Re: altering a column's collation leaves an invalid foreign key
On Thu, Oct 17, 2024 at 7:14 PM Peter Eisentraut wrote: > > > So I took the v5 patch you had posted and started working from there. > The rule that you had picked isn't quite what we want, I think. It's > okay to have nondeterministic collations on foreign keys, as long as the > collation is the same on both sides. That's what I have implemented. > See attached. > > This approach also allows cleaning up a bunch of hackiness in > ri_triggers.c, which feels satisfying. > > yech, i missed FK, PK both are nondeterministic but with same collation OID. your work is more neat. However FK, PK same nondeterministic collation OID have implications for ri_KeysEqual. ri_KeysEqual definitely deserves some comments. for rel_is_pk, the equality is collation agnostic; for rel_is_pk is false, the equality is collation aware. for example: DROP TABLE IF EXISTS fktable, pktable; CREATE TABLE pktable (x text COLLATE case_insensitive PRIMARY KEY); CREATE TABLE fktable (x text collate case_insensitive REFERENCES pktable on update restrict on delete restrict); INSERT INTO pktable VALUES ('A'), ('Å'); INSERT INTO fktable VALUES ('a'); update pktable set x = 'a' where x = 'A'; ERROR: update or delete on table "pktable" violates foreign key constraint "fktable_x_fkey" on table "fktable" DETAIL: Key (x)=(A) is still referenced from table "fktable". this should not happen? If so, the below change can solve the problem. @@ -2930,6 +2915,16 @@ ri_KeysEqual(Relation rel, TupleTableSlot *oldslot, TupleTableSlot *newslot, */ Form_pg_attribute att = TupleDescAttr(oldslot->tts_tupleDescriptor, attnums[i] - 1); +Oidcollation = RIAttCollation(rel, attnums[i]); +if (OidIsValid(collation) && !get_collation_isdeterministic(collation)) +{ +Oideq_opr; +boolresult; +eq_opr = riinfo->pp_eq_oprs[i]; +result = ri_CompareWithCast(eq_opr, RIAttType(rel, attnums[i]), +collation, newvalue, oldvalue); +return result; +} if (!datum_image_eq(oldvalue, newvalue, att->attbyval, att->attlen)) return false; The above change will make the ri_KeysEqual equality coalition aware regardless rel_is_pk's value. to see the effect, we can test it BEFORE and AFTER applying the above ri_KeysEqual changes with the attached sql script. pk_fk_inderministic_collation.sql Description: application/sql
Re: altering a column's collation leaves an invalid foreign key
> So for the moment this is a master-only patch. I think once we have > tied down the behavior we want for the future /* * ri_GenerateQualCollation --- add a COLLATE spec to a WHERE clause * - * At present, we intentionally do not use this function for RI queries that - * compare a variable to a $n parameter. Since parameter symbols always have - * default collation, the effect will be to use the variable's collation. - * Now that is only strictly correct when testing the referenced column, since - * the SQL standard specifies that RI comparisons should use the referenced - * column's collation. However, so long as all collations have the same - * notion of equality (which they do, because texteq reduces to bitwise - * equality), there's no visible semantic impact from using the referencing - * column's collation when testing it, and this is a good thing to do because - * it lets us use a normal index on the referencing column. However, we do - * have to use this function when directly comparing the referencing and - * referenced columns, if they are of different collations; else the parser - * will fail to resolve the collation to use. + * We only have to use this function when directly comparing the referencing + * and referenced columns, if they are of different collations; else the + * parser will fail to resolve the collation to use. We don't need to use + * this function for RI queries that compare a variable to a $n parameter. + * Since parameter symbols always have default collation, the effect will be + * to use the variable's collation. + * + * Note that we require that the collations of the referencing and the + * referenced column have the some notion of equality: Either they have to + * both be deterministic or else they both have to be the same. */ " some notion of equality" should it be "same notion of equality"? maybe we can add "see ATAddForeignKeyConstraint also" at the end of the whole comment. /* * transformColumnNameList - transform list of column names * * Lookup each name and return its attnum and, optionally, type OID * * Note: the name of this function suggests that it's general-purpose, * but actually it's only used to look up names appearing in foreign-key * clauses. The error messages would need work to use it in other cases, * and perhaps the validity checks as well. */ static int transformColumnNameList(Oid relId, List *colList, int16 *attnums, Oid *atttypids, Oid *attcollids) comments need slight adjustment? comments in transformFkeyGetPrimaryKey also need slight change? ri_CompareWithCast, we can aslo add comments saying the output is collation aware. + /* + * This shouldn't be possible, but better check to make sure we have a + * consistent state for the check below. + */ + if ((pkcoll && !fkcoll) || (!pkcoll && fkcoll)) + elog(ERROR, "key columns are not both collatable"); + + if (pkcoll && fkcoll) cosmetic. pkcoll can change to OidIsValid(pkcoll) fkcoll change to OidIsValid(fkcoll). ereport(ERROR, (errcode(ERRCODE_COLLATION_MISMATCH), errmsg("foreign key constraint \"%s\" cannot be implemented", fkconstraint->conname), errdetail("Key columns \"%s\" and \"%s\" have incompatible collations: \"%s\" and \"%s\"." "If either collation is nondeterministic, then both collations have to be the same.", strVal(list_nth(fkconstraint->fk_attrs, i)), strVal(list_nth(fkconstraint->pk_attrs, i)), get_collation_name(fkcoll), get_collation_name(pkcoll; ERROR: foreign key constraint "a_fk_col1_fkey" cannot be implemented DETAIL: Key columns "col1" and "col1" have incompatible collations: "default" and "case_insensitive". If either collation is nondeterministic, then both collations have to be the same. "DETAIL: Key columns "col1" and "col1"" may be confusing. we can be more verbose. like DETAIL: Key columns "col1" in relation "X" and "col1" in relation "Y" have incompatible collations.. (just a idea, don't have much preference) old_check_ok = (new_pathtype == old_pathtype && new_castfunc == old_castfunc && (!IsPolymorphicType(pfeqop_right) || new_fktype == old_fktype) && (new_fkcoll == old_fkcoll || (get_collation_isdeterministic(old_fkcoll) && get_collation_isdeterministic(new_fkcoll; I am wondering if one of the new_fkcoll, old_fkcoll is zero, the other is not. turns out that would n't happen. before "if (old_check_ok)" we already did extensionsive check that new_fkcoll is ok for the job. in ATAddForeignKeyConstraint, we can leave the old_check_ok part as is. what do you think of the following: old_check_ok = (new_pathtype == old_pathtype && new_castfunc == old_castfunc && (!IsPolymorphicType(pfeqop_right) || new_fktype == old_fktype)); if (OidIsValid(new_fkcoll) && OidIsValid(old_fkcoll) && new_fkcoll != old_fkcoll) { if(!get_collat
Re: Pgoutput not capturing the generated columns
On Thu, Oct 24, 2024 at 8:50 PM vignesh C wrote: > > The v42 version patch attached at [1] has the changes for the same. > Some more comments: 1. @@ -1017,7 +1089,31 @@ pgoutput_column_list_init(PGOutputData *data, List *publications, { ListCell *lc; bool first = true; + Bitmapset *relcols = NULL; Relation relation = RelationIdGetRelation(entry->publish_as_relid); + TupleDesc desc = RelationGetDescr(relation); + MemoryContext oldcxt = NULL; + bool collistpubexist = false; + + pgoutput_ensure_entry_cxt(data, entry); + + oldcxt = MemoryContextSwitchTo(entry->entry_cxt); + + /* + * Prepare the columns that will be published for FOR ALL TABLES and + * FOR TABLES IN SCHEMA publication. + */ + for (int i = 0; i < desc->natts; i++) + { + Form_pg_attribute att = TupleDescAttr(desc, i); + + if (att->attisdropped || (att->attgenerated && !entry->pubgencols)) + continue; + + relcols = bms_add_member(relcols, att->attnum); + } + + MemoryContextSwitchTo(oldcxt); This code is unnecessary for cases when the table's publication has a column list. So, I suggest to form this list only when required. Also, have an assertion that pubgencols value for entry and publication matches. 2. @@ -1115,10 +1186,17 @@ pgoutput_column_list_init(PGOutputData *data, List *publications, ereport(ERROR, errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot use different column lists for table \"%s.%s\" in different publications", -get_namespace_name(RelationGetNamespace(relation)), -RelationGetRelationName(relation))); + get_namespace_name(RelationGetNamespace(relation)), + RelationGetRelationName(relation))); Is there a reason to make the above change? It appears to be a spurious change. 3. + /* Check if there is any generated column present */ + for (int i = 0; i < desc->natts; i++) + { + Form_pg_attribute att = TupleDescAttr(desc, i); + if (att->attgenerated) Add one empty line between the above two lines. 4. + else if (entry->pubgencols != pub->pubgencols) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot use different values of publish_generated_columns for table \"%s.%s\" in different publications", + get_namespace_name(RelationGetNamespace(relation)), + RelationGetRelationName(relation))); The last two lines are not aligned. -- With Regards, Amit Kapila.
Re: [PoC] Partition path cache
Bykov Ivan writes: > Our customers use databases (1-10 TB) with big tables. Often these tables are > big and split on sections. For example, we have tables with almost > thousand sections. In most cases, those sections have a similar set of > indexes > and contain similar data. Often this partitioned table has multilevel > structure > (for example, a per-year section has a per-quarter section, and a per-quarter > section in turn has a per-month simple relation sections). > > During the analysis of the planning procedure, we found that the planner > we found that the planner in PostgreSQL 15.7 spents a lot of time building > access paths. .. > We backported new access path build algorithms from PostgreSQL 17 (which > optimizes match_pathkeys_to_index()) and it takes effect: > planner spent 1090 ms for planning query at first time and 320 ms for > second time. > > But we still think that planners make unnecessary jobs when building all > types of paths for every section. So we implemented a feature named > “partition path cache” (see next section), and now planner spent 970 ms for > planning query at the first time and 240 ms for the second time. > > Partition path cache > > The partition path cache aims to speed up planning for partition scan paths. > > Path cache doesn't copy and transform Path nodes directly due to the absence > of > Path nodes copy functions. The main aim of this patch is to prove the > assumption > that partitions of the same relation that have similar sets of indexes may use > similar access path types. This sounds like an interesting idea, I like it because it omit the needs for "global statistics" effort for partitioned table since it just use the first partition it knows. Of couse it has its drawback that "first" partition can't represent other partitions. One of the Arguments of this patch might be "What if other partitions have a pretty different statistics from the first partition?". If I were you, I might check all the used statistics on this stage and try to find out a similar algorithms to prove that the best path would be similar too. This can happens once when the statistics is gathered. However this might be not easy. -- Best Regards Andy Fan
Re: Changing the default random_page_cost value
On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian wrote: > I am not a fan of this patch. I don't see why _removing_ the magnetic > part helps because you then have no logic for any 1.2 was chosen. Okay, but we have no documented logic on why 4.0 was chosen either. :) I would put the magnetic in a separate paragraph perhaps, and recommend > 4.0 for it. Sounds doable. Even in the pre-SSD age I recall lowering this as a fairly standard practice, but I'm fine with a recommendation of 4. Partly because I doubt anyone will use it much. Also, per-tablespace makes sense because of physical media > differences, but what purpose would per-database and per-role serve? > Also, per-tablespace is not a connection-activated item like the other > two. > Good point, I withdraw that part. Cheers, Greg
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Fri, Oct 18, 2024 at 4:38 AM Daniel Gustafsson wrote: > In validate() it seems to me we should clear out ret->authn_id on failure to > pair belts with suspenders. Fixed by calling explicit_bzero on it in the error > path. The new hunk says: > cleanup: > /* > * Clear and free the validation result from the validator module once > * we're done with it to avoid accidental re-use. > */ > if (ret->authn_id != NULL) > { > explicit_bzero(ret->authn_id, strlen(ret->authn_id)); > pfree(ret->authn_id); > } > pfree(ret); But I'm not clear on what's being protected against. Which code would reuse this result? Thanks, --Jacob
Re: Refactor to use common function 'get_publications_str'.
On Fri, Oct 25, 2024 at 09:28:47AM +1100, Peter Smith wrote: > I've attached the patch v4. Looks OK to me. Thanks. I'll see to get that done through the day. -- Michael signature.asc Description: PGP signature
Re: Add isolation test template in injection_points for wait/wakeup/detach
On Mon, Oct 21, 2024 at 11:13:59AM +0900, Michael Paquier wrote: > Used "basic" as test name at the end, tweaked the comment close to > what you have suggested, and applied the result. I have spotted a race condition with this test in some CF bot runs. Here are some of them: https://github.com/michaelpq/postgres/runs/31984161747 https://cirrus-ci.com/task/6741385749987328 https://cirrus-ci.com/task/6056918689513472 https://cirrus-ci.com/task/5863397261049856 https://cirrus-ci.com/task/5565235765968896 https://cirrus-ci.com/task/6238594598174720 All of them show up as follows in regression.diffs: diff -U3 /tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out /tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out --- /tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out 2024-10-24 02:36:24.06859 + +++ /tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out 2024-10-24 02:40:09.179695000 + @@ -13,16 +13,16 @@ (1 row) -step wait1: <... completed> -injection_points_run - - -(1 row) - step detach2: SELECT injection_points_detach('injection-points-wait'); injection_points_detach --- +(1 row) + +step wait1: <... completed> +injection_points_run + + (1 row) This is in the first permutation of the test done with "wait1 wakeup2 detach2", and the diff means that the backend running the "wait" callback is reported as finished after the detach is done, injection_points_run being only used for the waits. Hence the wait is so slow to finish that the detach has time to complete and finish, breaking the output. And here comes the puzzling part: all of failures involve FreeBSD 13 in the CI. Reproducing this failure would not be difficult, I thought first; we can add a hardcoded pg_usleep() to delay the end of injection_wait() so as we make sure that the backend doing the wait reports later than the detach. Or just do the same at the end of injection_points_run() once the callback exits. I've sure done that, placing some strategic pg_usleep() calls on Linux to make the paths that matter in the wait slower, but the test remains stable. The CI on Linux is stable as well: 3rd and 4th columns of the cfbot are green, I did not spot any failures related to this isolation test in injection_points. Only the second column about FreeBSD is going rogue on a periodic basis. One fix would is to remove this first permutation test, still that's hiding a problem rather than solving it, and something looks wrong with conditional variables, specific to FreeBSD? Hence, I'm puzzled by that. I am going to remove the first permutation anyway to stabilize the CI reports. But it feels like we have a different problem here, and I am not sure what. Any thoughts or comments? -- Michael signature.asc Description: PGP signature
Re: pgbench: Improve result outputs related to failed transactinos
Hello Alexander, > Hello Tatsuo-san, > > 11.10.2024 07:54, Tatsuo Ishii wrote: > ... > - number of transactions actually pocessed: 1 (tps = 410.846343) > ... >> Patch pushed. > > Please consider fixing a typo sneaked in that commit: pocessed -> > processed? Thank you for the report! I have pushed a fix. Best reagards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
RE: Conflict detection for update_deleted in logical replication
Dear Hou, > Here is the V5 patch set which addressed above comments. Thanks for updating the patch! While reviewing yours, I found a corner case that a recently deleted tuple cannot be detected when index scan is chosen. This can happen when indices are re-built during the replication. Unfortunately, I don't have any solutions for it. Found issue When indices are built with the CONCURRENTLY option, a standard MVCC snapshot is used to list up the tuples of the table, which means the new index ignores recently deleted tuples. This can cause the update_deleted to be wrongly detected as update_missing. Assuming that we have a bidirectional cluster like case 1 [1] (id is a primary key), and REINDEX CONCURRENTLY happens after executing DELETE but before receiving the UPDATE. A primary key of t will be re-built by the REINDEX command but the dead tuples by DELETE will be missed because it is invisible from the transaction. Then, the apply worker receives the UPDATE and recognizes the target tuple is removed. It scans with snapshotany to find the deleted tuple via the index, but it fails. This event is reported as update_missing. Reproduce steps === This can be reproduced with v5 patch set: 1. constructed a 2-way replication by running attached. 2. stopped an apply worker on node2 3. executed `UPDATE tab SET a = 2;` on node1. 4. executed `DELETE FROM tab;` on node 2 5. executed `REINDEX TABLE CONCURRENTLY tab;` on node2 6. resumed the stopped apply worker 7. the worker should detect update_deleted, but it detected update_missing ``` LOG: conflict detected on relation "public.tab": conflict=update_missing DETAIL: Could not find the row to be updated. Remote tuple (2); replica identity (a)=(1). `` [1]: https://www.postgresql.org/message-id/OS0PR01MB5716A78EE3D6F2F4754E1209946C2%40OS0PR01MB5716.jpnprd01.prod.outlook.com Best regards, Hayato Kuroda FUJITSU LIMITED test_1025.sh Description: test_1025.sh
Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails
Alexander Lakhin 于2024年10月24日周四 22:00写道: > Hello Alvaro, > > 22.10.2024 17:32, Alvaro Herrera wrote: > > Yeah. I pushed these patches finally, thanks! > > Please look at a new anomaly introduced with 53af9491a. When running the > following script: > CREATE TABLE t (a int, b int, PRIMARY KEY (a, b)); > CREATE TABLE pt (a int, b int, FOREIGN KEY (a, b) REFERENCES t(a, b)) > PARTITION BY LIST (a); > > CREATE TABLE tp1 (x int, a int, b int); > ALTER TABLE tp1 DROP COLUMN x; > > ALTER TABLE pt ATTACH PARTITION tp1 FOR VALUES IN (1); > > ALTER TABLE pt DETACH PARTITION tp1; > > I get a memory access error detected by Valgrind: > 2024-10-24 12:05:04.645 UTC [1079077] LOG: statement: ALTER TABLE pt > DETACH PARTITION tp1; > ==00:00:00:07.887 1079077== Invalid read of size 2 > ==00:00:00:07.887 1079077==at 0x4A61DD: DetachPartitionFinalize > (tablecmds.c:19545) > ==00:00:00:07.887 1079077==by 0x4A5C11: ATExecDetachPartition > (tablecmds.c:19386) > ==00:00:00:07.887 1079077==by 0x48561E: ATExecCmd (tablecmds.c:5540) > ==00:00:00:07.887 1079077==by 0x4845DE: ATRewriteCatalogs > (tablecmds.c:5203) > ==00:00:00:07.887 1079077==by 0x4838EC: ATController (tablecmds.c:4758) > ==00:00:00:07.887 1079077==by 0x4834F1: AlterTable (tablecmds.c:4404) > ==00:00:00:07.887 1079077==by 0x7D6D52: ProcessUtilitySlow > (utility.c:1318) > ==00:00:00:07.887 1079077==by 0x7D65F7: standard_ProcessUtility > (utility.c:1067) > ==00:00:00:07.887 1079077==by 0x7D54F7: ProcessUtility (utility.c:523) > ==00:00:00:07.887 1079077==by 0x7D3D70: PortalRunUtility > (pquery.c:1158) > ==00:00:00:07.887 1079077==by 0x7D3FE7: PortalRunMulti (pquery.c:1316) > ==00:00:00:07.887 1079077==by 0x7D3431: PortalRun (pquery.c:791) > > Reproduced on REL_15_STABLE .. master. > Sorry, I can't reproduce this leak on master. -- Thanks, Tender Wang
cache lookup failed when \d t concurrent with DML change column data type
hi. I think I found a bug. PostgreSQL 18devel_debug_build_45188c2ea2 on x86_64-linux, compiled by gcc-14.1.0, 64-bit commit at 45188c2ea2. Ubuntu 22.04.4 LTS setup: drop table t cascade; create table t(a int PRIMARY key); IN session1: step "change data type" {begin; alter table t alter column a set data type int4;} step "s1" {commit;} IN session2: step "psql_another_session" {\d t} permutation "change data type" "psql_another_session" "s1" \set ECHO_HIDDEN on / QUERY */ SELECT c2.relname, i.indisprimary, i.indisunique, i.indisclustered, i.indisvalid, pg_catalog.pg_get_indexdef(i.indexrelid, 0, true), pg_catalog.pg_get_constraintdef(con.oid, true), contype, condeferrable, condeferred, i.indisreplident, c2.reltablespace, con.conperiod FROM pg_catalog.pg_class c, pg_catalog.pg_class c2, pg_catalog.pg_index i LEFT JOIN pg_catalog.pg_constraint con ON (conrelid = i.indrelid AND conindid = i.indexrelid AND contype IN ('p','u','x')) WHERE c.oid = '34405' AND c.oid = i.indrelid AND i.indexrelid = c2.oid ORDER BY i.indisprimary DESC, c2.relname; // ERROR: cache lookup failed for attribute 1 of relation 34418
Re: general purpose array_sort
Hi, > I can accept this outcome though an optional three-valued boolean sort order > (ascending and descending only) I'd argue is worth keeping. null value > placement too I guess, three-valued boolean (nulls_first). Perhaps these optional arguments deserve separate discussions. I suggest merging something everyone agrees on first. This will simplify the review process and allow us to deliver value to the users quickly. Arguments like `reverse => true` and `nulls_first => true` can always be implemented and added as separate patches. -- Best regards, Aleksander Alekseev
Re: Enhancing Memory Context Statistics Reporting
On 2024-10-24 14:59, Rahila Syed wrote: Hi Torikoshia, Thank you for reviewing the patch! On Wed, Oct 23, 2024 at 9:28 AM torikoshia wrote: On 2024-10-22 03:24, Rahila Syed wrote: Hi, PostgreSQL provides following capabilities for reporting memory contexts statistics. 1. pg_get_backend_memory_contexts(); [1] 2. pg_log_backend_memory_contexts(pid); [2] [1] provides a view of memory context statistics for a local backend, while [2] prints the memory context statistics of any backend or auxiliary process to the PostgreSQL logs. Although [1] offers detailed statistics, it is limited to the local backend, restricting its use to PostgreSQL client backends only. On the other hand, [2] provides the statistics for all backends but logs them in a file, which may not be convenient for quick access. I propose enhancing memory context statistics reporting by combining these capabilities and offering a view of memory statistics for all PostgreSQL backends and auxiliary processes. Thanks for working on this! I originally tried to develop something like your proposal in [2], but there were some difficulties and settled down to implement pg_log_backend_memory_contexts(). Yes. I am revisiting this problem :) Attached is a patch that implements this functionality. It introduces a SQL function that takes the PID of a backend as an argument, returning a set of records, each containing statistics for a single memory context. The underlying C function sends a signal to the backend and waits for it to publish its memory context statistics before returning them to the user. The publishing backend copies these statistics during the next CHECK_FOR_INTERRUPTS call. I remember waiting for dumping memory contexts stats could cause trouble considering some erroneous cases. For example, just after the target process finished dumping stats, pg_get_remote_backend_memory_contexts() caller is terminated before reading the stats, calling pg_get_remote_backend_memory_contexts() has no response any more: [session1]$ psql (40699)=# $ kill -s SIGSTOP 40699 [session2] psql (40866)=# select * FROM pg_get_remote_backend_memory_contexts('40699', false); -- waiting $ kill -s SIGSTOP 40866 $ kill -s SIGCONT 40699 [session3] psql (47656) $ select pg_terminate_backend(40866); $ kill -s SIGCONT 40866 -- session2 terminated [session3] (47656)=# select * FROM pg_get_remote_backend_memory_contexts('47656', false); -- no response It seems the reason is memCtxState->in_use is now and memCtxState->proc_id is 40699. We can continue to use pg_get_remote_backend_memory_contexts() after specifying 40699, but it'd be hard to understand for users. Thanks for testing and reporting. While I am not able to reproduce this problem, I think this may be happening because the requesting backend/caller is terminated before it gets a chance to mark memCtxState->in_use as false. Yeah, when I attached a debugger to 47656 when it was waiting on pg_get_remote_backend_memory_contexts('47656', false), memCtxState->in_use was true as you suspected: (lldb) p memCtxState->in_use (bool) $1 = true (lldb) p memCtxState->proc_id (int) $2 = 40699 (lldb) p pid (int) $3 = 47656 In this case memCtxState->in_use should be marked as 'false' possibly during the processing of ProcDiePending in ProcessInterrupts(). This approach facilitates on-demand publication of memory statistics for a specific backend, rather than collecting them at regular intervals. Since past memory context statistics may no longer be relevant, there is little value in retaining historical data. Any collected statistics can be discarded once read by the client backend. A fixed-size shared memory block, currently accommodating 30 records, is used to store the statistics. This number was chosen arbitrarily, as it covers all parent contexts at level 1 (i.e., direct children of the top memory context) based on my tests. Further experiments are needed to determine the optimal number for summarizing memory statistics. Any additional statistics that exceed the shared memory capacity are written to a file per backend in the PG_TEMP_FILES_DIR. The client backend first reads from the shared memory, and if necessary, retrieves the remaining data from the file, combining everything into a unified view. The files are cleaned up automatically if a backend crashes or during server restarts. The statistics are reported in a breadth-first search order of the memory context tree, with parent contexts reported before their children. This provides a cumulative summary before diving into the details of each child context's consumption. The rationale behind the shared memory chunk is to ensure that the majority of contexts which are the direct children of TopMemoryContext, fit into memory This allows a client to request a summary of memory statistics, which can be served from memory without the overhead of file access, unless necessary. A publishing back
Re: cache lookup failed when \d t concurrent with DML change column data type
On 10/24/24 22:30, jian he wrote: hi. I think I found a bug. PostgreSQL 18devel_debug_build_45188c2ea2 on x86_64-linux, compiled by gcc-14.1.0, 64-bit commit at 45188c2ea2. Ubuntu 22.04.4 LTS setup: drop table t cascade; create table t(a int PRIMARY key); IN session1: step "change data type" {begin; alter table t alter column a set data type int4;} step "s1" {commit;} IN session2: step "psql_another_session" {\d t} permutation "change data type" "psql_another_session" "s1" ERROR: cache lookup failed for attribute 1 of relation 34418 Yes, it looks like a bug existing for a long time, at least since PG11 (I didn't trace further down). It seems that the backend didn't apply invalidation messages before touching system caches. Backtrace: in get_attoptions (relid=16388, attnum=1) at lsyscache.c:982 in pg_get_indexdef_worker (indexrelid=16388, colno=0, excludeOps=0x0, attrsOnly=false, keysOnly=false, showTblSpc=false, inherits=false, prettyFlags=7, missing_ok=true) at ruleutils.c:1458 in pg_get_indexdef_ext (fcinfo=0x55a15acc1c18) at ruleutils.c:1202 in ExecInterpExpr (state=0x55a15acc2a10, econtext=0x55a15ac62930, isnull=0x7fffd66a5bcf) at execExprInterp.c:770 in ExecInterpExprStillValid (state=0x55a15acc2a10, econtext=0x55a15ac62930, isNull=0x7fffd66a5bcf) at execExprInterp.c:2035 in ExecEvalExprSwitchContext (state=0x55a15acc2a10, econtext=0x55a15ac62930, isNull=0x7fffd66a5bcf) at ../../../src/include/executor/executor.h:367 in ExecProject (projInfo=0x55a15acc2a08) at ../../../src/include/executor/executor.h:401 -- regards, Andrei Lepikhov
Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails
Alexander Lakhin 于2024年10月24日周四 22:00写道: > Hello Alvaro, > > 22.10.2024 17:32, Alvaro Herrera wrote: > > Yeah. I pushed these patches finally, thanks! > > Please look at a new anomaly introduced with 53af9491a. When running the > following script: > CREATE TABLE t (a int, b int, PRIMARY KEY (a, b)); > CREATE TABLE pt (a int, b int, FOREIGN KEY (a, b) REFERENCES t(a, b)) > PARTITION BY LIST (a); > > CREATE TABLE tp1 (x int, a int, b int); > ALTER TABLE tp1 DROP COLUMN x; > > ALTER TABLE pt ATTACH PARTITION tp1 FOR VALUES IN (1); > > ALTER TABLE pt DETACH PARTITION tp1; > > I get a memory access error detected by Valgrind: > 2024-10-24 12:05:04.645 UTC [1079077] LOG: statement: ALTER TABLE pt > DETACH PARTITION tp1; > ==00:00:00:07.887 1079077== Invalid read of size 2 > ==00:00:00:07.887 1079077==at 0x4A61DD: DetachPartitionFinalize > (tablecmds.c:19545) > ==00:00:00:07.887 1079077==by 0x4A5C11: ATExecDetachPartition > (tablecmds.c:19386) > ==00:00:00:07.887 1079077==by 0x48561E: ATExecCmd (tablecmds.c:5540) > ==00:00:00:07.887 1079077==by 0x4845DE: ATRewriteCatalogs > (tablecmds.c:5203) > ==00:00:00:07.887 1079077==by 0x4838EC: ATController (tablecmds.c:4758) > ==00:00:00:07.887 1079077==by 0x4834F1: AlterTable (tablecmds.c:4404) > ==00:00:00:07.887 1079077==by 0x7D6D52: ProcessUtilitySlow > (utility.c:1318) > ==00:00:00:07.887 1079077==by 0x7D65F7: standard_ProcessUtility > (utility.c:1067) > ==00:00:00:07.887 1079077==by 0x7D54F7: ProcessUtility (utility.c:523) > ==00:00:00:07.887 1079077==by 0x7D3D70: PortalRunUtility > (pquery.c:1158) > ==00:00:00:07.887 1079077==by 0x7D3FE7: PortalRunMulti (pquery.c:1316) > ==00:00:00:07.887 1079077==by 0x7D3431: PortalRun (pquery.c:791) > > Reproduced on REL_15_STABLE .. master. > Thanks for reporting. I can reproduce this memory invalid access on HEAD. After debuging codes, I found the root cause. In DetachPartitionFinalize(), below code: att = TupleDescAttr(RelationGetDescr(partRel), attmap->attnums[conkey[i] - 1] - 1); We should use confkey[i] -1 not conkey[i] - 1; In this case, the length of attmap is 2, conkey= {2,3}, confkey={1,2}, attmap->attnums[conkey[1] -1] will trigger memory invalid access. In the block, we process referenced side, so wo should use confkey. I attach a patch to fix this. -- Thanks, Tender Wang 0001-Fix-memory-invalid-access-due-to-using-wrong-conkey.patch Description: Binary data
Useless field ispartitioned in CreateStmtContext
Hi! When looking at the partition-related code, I found that the ispartitioned field in CreateStmtContext is not used. It looks like we can safely remove it and avoid invalid assignment logic. Here's a very simple fix, any suggestion? diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index 1e15ce10b48..6dea85cc2dc 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -89,7 +89,6 @@ typedef struct List *alist; /* "after list" of things to do after creating * the table */ IndexStmt *pkey; /* PRIMARY KEY index, if any */ - bool ispartitioned; /* true if table is partitioned */ PartitionBoundSpec *partbound; /* transformed FOR VALUES */ bool ofType; /* true if statement contains OF typename */ } CreateStmtContext; @@ -246,7 +245,6 @@ transformCreateStmt(CreateStmt *stmt, const char *queryString) cxt.blist = NIL; cxt.alist = NIL; cxt.pkey = NULL; - cxt.ispartitioned = stmt->partspec != NULL; cxt.partbound = stmt->partbound; cxt.ofType = (stmt->ofTypename != NULL); @@ -3401,7 +3399,6 @@ transformAlterTableStmt(Oid relid, AlterTableStmt *stmt, cxt.blist = NIL; cxt.alist = NIL; cxt.pkey = NULL; - cxt.ispartitioned = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); cxt.partbound = NULL; cxt.ofType = false; -- Best regards, hugozhang
Re: execute prepared statement passing parameter expression with COLLATE clause
jian he 于2024年10月24日周四 16:56写道: > hi. > > $Subject setup > > CREATE COLLATION case_insensitive (provider = icu, locale = > '@colStrength=secondary', deterministic = false); > CREATE COLLATION ignore_accents (provider = icu, locale = > '@colStrength=primary;colCaseLevel=yes', deterministic = false); > DROP TABLE IF EXISTS pktable cascade; > CREATE TABLE pktable (x text COLLATE case_insensitive); > INSERT INTO pktable VALUES ('A'); > DEALLOCATE q6; > PREPARE q6 AS SELECT * FROM pktable WHERE x = $1; > > > select * from pktable where x = 'Å' collate ignore_accents; > --return one row > > execute q6('Å' collate ignore_accents); > --return zero rows > > not sure return zero rows is desired. > > postgres=# explain execute q6('Å' collate ignore_accents); QUERY PLAN - Seq Scan on pktable (cost=0.00..27.00 rows=7 width=32) Filter: (x = 'Å'::text) (2 rows) postgres=# explain select * from pktable where x = 'Å' collate ignore_accents; QUERY PLAN - Seq Scan on pktable (cost=0.00..27.00 rows=7 width=32) Filter: (x = 'Å'::text COLLATE ignore_accents) (2 rows) The filter expr in the two queries is different. And I debug the texteq; the collid is also different. So the result of the two queries is different. I don't look execute more in details. -- Thanks, Tender Wang
Re: Fix for consume_xids advancing XIDs incorrectly
On 2024/10/24 5:23, Masahiko Sawada wrote: if (xids_left > 2000 && consumed - last_reported_at < REPORT_INTERVAL && MyProc->subxidStatus.overflowed) { int64 consumed_by_shortcut = consume_xids_shortcut(); if (consumed_by_shortcut > 0) { consumed += consumed_by_shortcut; continue; } } By the way, this isn't directly related to the proposed patch, but while reading the xid_wraparound code, I noticed that consume_xids_common() could potentially return an unexpected XID if consume_xids_shortcut() returns a value greater than 2000. Based on the current logic, consume_xids_common() should always return a value less than 2000, so the issue I'm concerned about shouldn't occur. Good point. Yes, the function doesn't return a value greater than 2000 as long as we do "distance = Min(distance, COMMIT_TS_XACTS_PER_PAGE - rem);". But it's true with <= 16KB block sizes. Still, would it be worth adding an assertion to ensure that consume_xids_common() never returns a value greater than 2000? While adding an assertion makes sense to me, another idea is to set last_xid even in the shortcut path. That way, it works even with 32KB block size. Agreed on making xid_wraparound compatible with a 32k block size. As you pointed out, with 32k blocks, XidSkip() can return values greater than 2000. So, the first step is to adjust the following if-condition by increasing "2000" to a higher value. Since XidSkip() can return up to 3276 (based on COMMIT_TS_XACTS_PER_PAGE (BLCKSZ / 10) with 32k blocks), we could, for instance, update "2000" to "4000." if (xids_left > 2000 && consumed - last_reported_at < REPORT_INTERVAL && MyProc->subxidStatus.overflowed) To ensure lastxid is set even in the shortcut case, what about modifying XidSkip() so it returns "distance - 1" instead of "distance"? This change would make consume_xids_common() run at least one more loop cycle, triggering GetNewTransactionId() and setting lastxid correctly. consumed = XidSkip(nextXid); if (consumed > 0) TransamVariables->nextXid.value += (uint64) consumed; BTW, the code snippet above in consume_xids_shortcut() could potentially set the next XID to a value below FirstNormalTransactionId? If yes, we should account for FirstNormalFullTransactionId when increasing the next XID, similar to how FullTransactionIdAdvance() handles it. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Better error reporting from extension scripts (Was: Extend ALTER OPERATOR)
In the no-good-deed-goes-unpunished department: buildfarm member hamerkop doesn't like this patch [1]. The diffs look like @@ -77,7 +77,7 @@ ERROR: syntax error at or near "FUNCTIN" LINE 1: CREATE FUNCTIN my_erroneous_func(int) RETURNS int LANGUAGE S... ^ -QUERY: CREATE FUNCTIN my_erroneous_func(int) RETURNS int LANGUAGE SQL +QUERY: CREATE FUNCTIN my_erroneous_func(int) RETURNS int LANGUAGE SQL AS $$ SELECT $1 + 1 $$; CONTEXT: extension script file "test_ext7--2.0--2.1bad.sql", near line 10 alter extension test_ext7 update to '2.2bad'; It's hard to be totally sure from the web page, but I suppose what is happening is that the newline within the quoted query fragment is represented as "\r\n" not just "\n". (I wonder why the cfbot failed to detect this; there must be more moving parts involved than just "it's Windows".) The reason why this is happening seems fairly clear: extension.c's read_whole_file() opens the script file with PG_BINARY_R, preventing Windows' libc from hiding DOS-style newlines from us, even though the git checkout on that machine is evidently using DOS newlines. That seems like a rather odd choice. Elsewhere in the same module, parse_extension_control_file() opens control files with plain "r", so that those act differently. I checked the git history and did not learn much. The original extensions commit d9572c4e3 implemented reading with a call to read_binary_file(), and we seem to have just carried that behavioral decision forward through various refactorings. I don't recall if there was an intentional choice to use binary read or that was just a random decision to use an available function. So what I'd like to do to fix this is to change - if ((file = AllocateFile(filename, PG_BINARY_R)) == NULL) + if ((file = AllocateFile(filename, "r")) == NULL) The argument against that is it creates a nonzero chance of breaking somebody's extension script file on Windows. But there's a counter-argument that it might *prevent* bugs on Windows, by making script behavior more nearly identical to what it is on not-Windows. So I think that's kind of a wash. Other approaches we could take with perhaps-smaller blast radii include making script_error_callback() trim \r's out of the quoted text (ugly) or creating a variant expected-file (hard to maintain, and I wonder if it'd survive git-related newline munging). Thoughts? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2024-10-23%2011%3A00%3A37
Re: Useless field ispartitioned in CreateStmtContext
Hi! On 24.10.2024 16:07, hugo wrote: Hi! When looking at the partition-related code, I found that the ispartitioned field in CreateStmtContext is not used. It looks like we can safely remove it and avoid invalid assignment logic. Here's a very simple fix, any suggestion? diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c index 1e15ce10b48..6dea85cc2dc 100644 --- a/src/backend/parser/parse_utilcmd.c +++ b/src/backend/parser/parse_utilcmd.c @@ -89,7 +89,6 @@ typedef struct List *alist; /* "after list" of things to do after creating * the table */ IndexStmt *pkey; /* PRIMARY KEY index, if any */ - bool ispartitioned; /* true if table is partitioned */ PartitionBoundSpec *partbound; /* transformed FOR VALUES */ bool ofType; /* true if statement contains OF typename */ } CreateStmtContext; @@ -246,7 +245,6 @@ transformCreateStmt(CreateStmt *stmt, const char *queryString) cxt.blist = NIL; cxt.alist = NIL; cxt.pkey = NULL; - cxt.ispartitioned = stmt->partspec != NULL; cxt.partbound = stmt->partbound; cxt.ofType = (stmt->ofTypename != NULL); @@ -3401,7 +3399,6 @@ transformAlterTableStmt(Oid relid, AlterTableStmt *stmt, cxt.blist = NIL; cxt.alist = NIL; cxt.pkey = NULL; - cxt.ispartitioned = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE); cxt.partbound = NULL; cxt.ofType = false; I absolutely agree with you. I found that ispartitioned parameter has been defined but it is never used during optimization. I also noticed that its local copy is being redefined in the ATExecDropIdentity, ATExecSetIdentity and ATExecAddIdentity functions. So, I'm willing to agree with you. BTW, the regression tests were successful. -- Regards, Alena Rybakina Postgres Professional
Re: execute prepared statement passing parameter expression with COLLATE clause
jian he writes: > select * from pktable where x = 'Å' collate ignore_accents; > --return one row > execute q6('Å' collate ignore_accents); > --return zero rows > not sure return zero rows is desired. The parameter symbol just represents a value, which does not carry any collation information. The collation to use was determined when the prepared statement was parsed, and is not going to change on the basis of what you write in EXECUTE. We could have a discussion about whether this is desirable, but it's prett much moot, because this is how the SQL committee designed SQL's collation feature. It's not going to change. regards, tom lane
Re: general purpose array_sort
"David G. Johnston" writes: > Composing function calls here seems quite undesirable from a performance > standpoint, but maybe I over-estimate the cost of > exploding-manipulating-freezing an array datum. Also, while I'm not in a > good position to judge the challenge of implementing the sort parameters I > would rather have them than not since the order by API has them (plus > performance). I also, maybe unreasonably, believe that our extensible type > system has already limited the maintenance burden. Oh! I had not actually looked at the details of what was being proposed here. I imagined "array_sort(anyarray)" and the sort would use the default sort order for the array's element type. This business with a textual representation of a sort clause seems like over-engineering ... except that it's also under-engineered, because the parsing is lame and incomplete. (No USING option, and the fact that collation comes from somewhere else seems impossibly confusing.) Let's drop that. As already noted, if you need a non-default sort order you can build what you want with a sub-select. The ambition here should be to make easy cases easy. regards, tom lane
vacuumdb --analyze-only (e.g., after a major upgrade) vs. partitioned tables: pg_statistic missing stats for the partitioned table itself
Nikolay Samokhvalov 2:47 PM (0 minutes ago) to pglsql-hackers I just learned that vacuumdb --analyze-only doesn't update stats for the partitioned table itself, taking care only about individual partitions: (DDL doesn't matter here) # vacuumdb --analyze-only -U postgres test --verbose ... INFO: analyzing "public.measurement_2023_01" INFO: "measurement_2023_01": scanned 6370 of 6370 pages, containing 100 live rows and 0 dead rows; 3 rows in sample, 100 estimated total rows INFO: "measurement_2023_02": scanned 6257 of 6257 pages, containing 982279 live rows and 0 dead rows; 3 rows in sample, 982279 estimated total rows INFO: analyzing "public.measurement_2023_03" INFO: "measurement_2023_03": scanned 6483 of 6483 pages, containing 1017721 live rows and 0 dead rows; 3 rows in sample, 1017721 estimated total rows ... test=# select starelid::regclass, count(*) from pg_statistic where starelid::regclass::text ~ 'measurement' group by 1 order by 1; starelid | count -+--- measurement_2023_01 | 4 measurement_2023_02 | 4 measurement_2023_03 | 4 (3 rows) While for the single-threaded SQL-level ANALYZE: test=# analyze verbose measurement; ... test=# select starelid::regclass, count(*) from pg_statistic where starelid::regclass::text ~ 'measurement' group by 1 order by 1; starelid | count -+--- measurement | 4 measurement_2023_01 | 4 measurement_2023_02 | 4 measurement_2023_03 | 4 (4 rows) This means that if, after running pg_upgrade, we use vacuumdb to update stats faster, some stats may be missing, potentially leading to suboptimal performance. Additionally, it doesn't help that pg_stat_all_tables doesn't show counters/timestamps for partitioned table, even after SQL-level ANALYZE: test=# select relname, analyze_count, autoanalyze_count, last_analyze, last_autoanalyze from pg_stat_user_tables where relname ~ 'measurement'; relname | analyze_count | autoanalyze_count | last_analyze | last_autoanalyze -+---+---+---+-- measurement_2023_01 | 2 | 0 | 2024-10-24 21:25:47.979958+00 | measurement_2023_02 | 2 | 0 | 2024-10-24 21:25:48.070355+00 | measurement_2023_03 | 2 | 0 | 2024-10-24 21:25:48.154613+00 | (3 rows) This is also discussed in https://www.postgresql.org/message-id/flat/CADofcAXVbD0yGp_EaC9chmzsOoSai3jcfBCnyva3j0RRdRvMVA%40mail.gmail.com I propose considering 3 fixes: 1) vacuumdb --analyze / --analyze-only to update stats for the partitioned table, so people using pg_upgrade are not in trouble 2) present the ANALYZE metadata for partitioned tables in pg_stat_all_tables 3) for old versions, either backpatch with fix (1) OR just add to the docs (and maybe to the final words pg_upgrade prints), suggesting something like this in addition to vacuumdb analyze-only: -- psql snippet select format( 'analyze verbose %I.%I;', relnamespace::oid::regnamespace, oid::regclass ) as vacuum_command from pg_class where relkind = 'p' \gexec Additionally, I do like the idea of ANALYZE ONLY from the -general discussion above (though, there might be confusion with logic of --analyze and --analyze-only in vacuumdb). Does it make sense? Nik
Re: Using read_stream in index vacuum
Hi Andrey, I ran the following test with v7-0001-Prototype-B-tree-vacuum-streamlineing.patch to measure the performance improvement. --Table size of approx 2GB (Fits in RAM) postgres=# create unlogged table x_big as select i from generate_series(1,6e7) i; SELECT 6000 postgres=# create index on x_big(i); CREATE INDEX -- Perform updates to create dead tuples. postgres=# do $$ declare var int := 0; begin for counter in 1 .. 1e7 loop var := (SELECT floor(random() * (1e7 - 1 + 1) * 1)); UPDATE x_big SET i = i + 5 WHERE i = var; end loop; end; $$; postgres=# CREATE EXTENSION pg_buffercache; CREATE EXTENSION -- Evict Postgres buffer cache for this relation. postgres=# SELECT DISTINCT pg_buffercache_evict(bufferid) FROM pg_buffercache WHERE relfilenode = pg_relation_filenode('x_big'); pg_buffercache_evict -- t (1 row) postgres=# \timing on Timing is on. postgres=# vacuum x_big; VACUUM The timing does not seem to have improved with the patch. Timing with the patch: Time: 9525.696 ms (00:09.526) Timing without the patch: Time: 9468.739 ms (00:09.469) While writing this email, I realized I evicted buffers for the table and not the index. I will perform the test again. However, I would like to know your opinion on whether this looks like a valid test. Thank you, Rahila Syed On Thu, Oct 24, 2024 at 4:45 PM Andrey M. Borodin wrote: > > > > On 24 Oct 2024, at 10:15, Andrey M. Borodin > wrote: > > > > I've also added GiST vacuum to the patchset. > > I decided to add up a SP-GiST while having launched on pgconf.eu. > > > Best regards, Andrey Borodin. >
Re: Conflict detection for update_deleted in logical replication
Hi Hou-san, here are my review comments for patch v5-0001. == General 1. Sometimes in the commit message and code comments the patch refers to "transaction id" and other times to "transaction ID". The patch should use the same wording everywhere. == Commit message. 2. "While for concurrent remote transactions with earlier timestamps,..." I think this means: "But, for concurrent remote transactions with earlier timestamps than the DELETE,..." Maybe expressed this way is clearer. ~~~ 3. ... the resolution would be to convert the update to an insert. Change this to uppercase like done elsewhere: "... the resolution would be to convert the UPDATE to an INSERT. == doc/src/sgml/protocol.sgml 4. + +Primary WAL status update (B) + + + + Byte1('s') + + + Identifies the message as a primary WAL status update. + + + I felt it would be better if this is described as just a "Primary status update" instead of a "Primary WAL status update". Doing this makes it more flexible in case there is a future requirement to put more status values in here which may not be strictly WAL related. ~~~ 5. + +Standby WAL status request (F) + + + + Byte1('W') + + + Identifies the message as a request for the WAL status on the primary. + + + + + + 5a. Ditto the previous comment #4. Perhaps you should just call this a "Primary status request". ~ 5b. Also, The letter 'W' also seems chosen because of WAL. But it might be more flexible if those identifiers are more generic. e.g. 's' = the request for primary status update 'S' = the primary status update == src/backend/replication/logical/worker.c 6. + else if (c == 's') + { + TimestampTz timestamp; + + remote_lsn = pq_getmsgint64(&s); + timestamp = pq_getmsgint64(&s); + + maybe_advance_nonremovable_xid(&remote_lsn, &phase); + UpdateWorkerStats(last_received, timestamp, false); + } Since there's no equivalent #define or enum value, IMO it is too hard to know the intent of this code without already knowing the meaning of the magic letter 's'. At least there could be a comment here to explain that this is for handling an incoming "Primary status update" message. ~~~ maybe_advance_nonremovable_xid: 7. + * The oldest_nonremovable_xid is maintained in shared memory to prevent dead + * rows from being removed prematurely when the apply worker still needs them + * to detect update-delete conflicts. /update-delete/update_deleted/ ~ 8. + * applied and flushed locally. The process involves: + * + * DTR_REQUEST_WALSENDER_WAL_POS - Call GetOldestActiveTransactionId() to get + * the candidate xmin and send a message to request the remote WAL position + * from the walsender. + * + * DTR_WAIT_FOR_WALSENDER_WAL_POS - Wait for receiving the WAL position from + * the walsender. + * + * DTR_WAIT_FOR_LOCAL_FLUSH - Advance the non-removable transaction ID if the + * current flush location has reached or surpassed the received WAL position. 8a. This part would be easier to read if those 3 phases were indented from the rest of this function comment. ~ 8b. /Wait for receiving/Wait to receive/ ~ 9. + * Retaining the dead tuples for this period is sufficient for ensuring + * eventual consistency using last-update-wins strategy, which involves + * converting an UPDATE to an INSERT and applying it if remote transactions The commit message referred to a "latest_timestamp_wins". I suppose that is the same as what this function comment called "last-update-wins". The patch should use consistent terminology. It would be better if the commit message and (parts of) this function comment were just cut/pasted to be identical. Currently, they seem to be saying the same thing, but using slightly different wording. ~ 10. + static TimestampTz xid_advance_attemp_time = 0; + static FullTransactionId candidate_xid; typo in var name - "attemp" ~ 11. + *phase = DTR_WAIT_FOR_LOCAL_FLUSH; + + /* + * Do not return here because the apply worker might have already + * applied all changes up to remote_wal_pos. Proceeding to the next + * phase to check if we can immediately advance the transaction ID. + */ 11a. IMO this comment should be above the *phase assignment. 11b. /Proceeding to the next phase to check.../Instead, proceed to the next phase to check.../ ~ 12. + /* + * Advance the non-removable transaction id if the remote wal position + * has been received, and all transactions up to that position on the + * publisher have been applied and flushed locally. + */ Some minor re-wording would help clarify this comment. SUGGESTION Reaching here means the remote wal position has been received, and all transactions up to that position on the publisher have been applied and flushed locally. So, now we can advance the
Re: Inconsistent use of relpages = -1
On Thu, 2024-10-24 at 08:03 -0700, Jeff Davis wrote: > On Thu, 2024-10-24 at 05:01 +0300, Laurenz Albe wrote: > > What you write above indicates that "relpages" = 0 and "reltuples" > > > 0 > > would also be acceptable. > > As Tom pointed out, that creates a risk that it's interpreted as > infinite tuple denisity. > > The only functional change in my patch is to create the partitioned > table with relpages=-1 to be more consistent with the value after it's > analyzed. Great, then it cannot be wrong. Sorry for the noise! Yours, Laurenz Albe
Re: Commutation of array SOME/ANY and ALL operators
> Inventing commutator operators for LIKE etc could be a path of > much less resistance (unless the operator names get bikeshedded > to death). Are there really that many that people need? > A quick query of pg_operator suggests that the LIKE/regex family > is the bulk of the problem for real-world cases. Sadly, I was afraid that might be the case. It would be so nice if it wasn't but if we have to go the path of least resistance. We'd need a standard for the reversed version of a command that didn't intersect with any other command or SQL keyword. We would also need a standard for the reversed version of operators that don't have a commutation that we want to support the opposite of. The only commands/operators that I think are probably really wanted in that category would be: * LIKE * ILIKE * ~ (regex match) * ~* (regex case insensitive match) * !~ (not regex match) * !~* (not regex case insensitive match) Proposed: (prefix the command/operator with ~ and put it before the ! if it's a "not" expression) * EKIL (LIKE backwards) or RLIKE or CLIKE or ~LIKE * EKILI (ILIKE backwards) or RILIKE or CILIKE or ~ILIKE * ~~ * ~~* or ~* * !~~ * !~~* or !~* or *~! On Wed, Oct 23, 2024 at 6:57 PM Tom Lane wrote: > Matthew Morrissette Vance writes: > > If instead, PostgreSQL could support the commutation of the `SOME/ANY` > and > > `ALL` operators so that the `ANY(array)` could be on both sides of the > > provided operator, it would allow for this kind of searching natively. > > > Firstly, would a PR that enhanced PostgreSQL in this manner be accepted? > > My gut feeling is you'll run into insurmountable grammar-ambiguity > problems. I might be wrong, but I have an idea that this has > already been tried and failed on that point. > > Inventing commutator operators for LIKE etc could be a path of > much less resistance (unless the operator names get bikeshedded > to death). Are there really that many that people need? > A quick query of pg_operator suggests that the LIKE/regex family > is the bulk of the problem for real-world cases. > > regards, tom lane >
Re: Can rs_cindex be < 0 for bitmap heap scans?
Thanks for the reply, Dilip! On Thu, Oct 24, 2024 at 12:50 AM Dilip Kumar wrote: > > On Thu, Oct 24, 2024 at 3:45 AM Melanie Plageman > wrote: >> >> HeapScanDescData->rs_cindex (the current index into the array of >> visible tuples stored in the heap scan descriptor) is used for >> multiple scan types, but for bitmap heap scans, AFAICT, it should >> never be < 0. > > You are right it can never be < 0 in this case at least. Cool, thanks for confirming. I will commit this change then. > In fact you don't need to explicitly set it to 0 in initscan[1], because > before calling heapam_scan_bitmap_next_tuple() we must call > heapam_scan_bitmap_next_block() and this function is initializing this to 0 > (hscan->rs_cindex = 0;). Anyway no object even if you prefer to initialize > in initscan(), just the point is that we are already doing it for each block > before fetching tuples from the block. I am inclined to still initialize it to 0 in initscan(). I was refactoring the bitmap heap scan code to use the read stream API and after moving some things around, this field was no longer getting initialized in heapam_scan_bitmap_next_block(). While that may not be a good reason to change it in this patch, most of the other fields in the heap scan descriptor (like rs_cbuf and rs_cblock) are re-initialized in initscan(), so it seems okay to do that there even though it isn't strictly necessary on master. - Melanie
Re: Pgoutput not capturing the generated columns
On Thu, Oct 24, 2024 at 12:15 PM vignesh C wrote: > > The attached v41 version patch has the changes for the same. > Please find comments for the new version as follows: 1. + Generated columns may be skipped during logical replication according to the + CREATE PUBLICATION option + + include_generated_columns. The above statement doesn't sound to be clear. Can we change it to: "Generated columns are allowed to be replicated during logical replication according to the CREATE PUBLICATION option .."? 2. static void publication_invalidation_cb(Datum arg, int cacheid, uint32 hashvalue); -static void send_relation_and_attrs(Relation relation, TransactionId xid, - LogicalDecodingContext *ctx, - Bitmapset *columns); static void send_repl_origin(LogicalDecodingContext *ctx, ... ... static RelationSyncEntry *get_rel_sync_entry(PGOutputData *data, Relation relation); +static void send_relation_and_attrs(Relation relation, TransactionId xid, + LogicalDecodingContext *ctx, + RelationSyncEntry *relentry); Why the declaration of this function is changed? 3. + /* + * Skip publishing generated columns if the option is not specified or + * if they are not included in the column list. + */ + if (att->attgenerated && !relentry->pubgencols && !columns) In the comment above, shouldn't "specified or" be "specified and"? 4. +pgoutput_pubgencol_init(PGOutputData *data, List *publications, + RelationSyncEntry *entry) { ... + foreach(lc, publications) + { + Publication *pub = lfirst(lc); + + /* No need to check column list publications */ + if (is_column_list_publication(pub, entry->publish_as_relid)) Are we ignoring column_list publications because for such publications the value of column_list prevails and we ignore 'publish_generated_columns' value? If so, it is not clear from the comments. 5. /* Initialize the column list */ pgoutput_column_list_init(data, rel_publications, entry); + + /* Initialize publish generated columns value */ + pgoutput_pubgencol_init(data, rel_publications, entry); + + /* + * Check if there is conflict with the columns selected for the + * publication. + */ + check_conflicting_columns(rel_publications, entry); } It looks odd to check conflicting column lists among publications twice once in pgoutput_column_list_init() and then in check_conflicting_columns(). Can we merge those? -- With Regards, Amit Kapila.
Re: Add support to TLS 1.3 cipher suites and curves lists
> On 16 Oct 2024, at 17:30, Jacob Champion > wrote: > Other than that, LGTM! Thanks for all the review work, I went ahead and pushed this patchseries today after a little bit more polishing of comments and docs. So far plover has failed which was expected due to the raised OpenSSL/LibreSSL requirement, I will reach out to BF animal owners for upgrades. -- Daniel Gustafsson
Re: Using Expanded Objects other than Arrays from plpgsql
I wrote: > ... I'm still writing up > details, but right now I'm envisioning completely separate sets of > rules for the prosupport case versus the no-prosupport case. So here is the design I've come up with for optimizing R/W expanded object updates in plpgsql without any special knowledge from a prosupport function. AFAICS this requires no assumptions at all about the behavior of called functions, other than the bare minimum "you can't corrupt the object to the point where it wouldn't be cleanly free-able". In particular that means it can work for user-written called functions in plpgsql, SQL, or whatever, not only for C-coded functions. There are two requirements to apply the optimization: * If the assignment statement is within a BEGIN ... EXCEPTION block, its target variable must be declared inside the most-closely-nested such block. This ensures that if an error is thrown from within the assignment statement's expression, we do not care about the value of the target variable, except to the extent of being able to clean it up. * The target variable must be referenced exactly once within the RHS expression. This avoids aliasing hazards such as we discussed upthread. But unlike the existing rule, that reference can be anywhere in the RHS --- it doesn't have to be an argument of the topmost function. So for example we can optimize foo := fee(fi(fo(fum(foo), other_variable), ...)); While I've not tried to write any code yet, I think both of these conditions should be reasonably easy to verify. Given that those conditions are met and the current value of the assignment target variable is a R/W expanded pointer, we can execute the assignment as follows: * Provide the R/W expanded pointer as the value of the Param representing the sole reference. At the same time, apply TransferExpandedObject to reparent the object under the transient eval_mcontext memory context that's being used to evaluate the RHS expression, and then set the target variable's value to NULL. (At this point, the R/W object has exactly the same logical status as any intermediate calculation result that's palloc'd in the eval_mcontext.) * At successful completion of the RHS, assign the result to the target variable normally. This includes, if it's an R/W expanded object, reparenting it under the calling function's main context. If the RHS expression winds up returning the same expanded object (probably mutated), then the outer function regains ownership of it after no more than a couple of TransferExpandedObject calls, which are cheap. If the RHS returns something different, then either the original expanded object got discarded during RHS evaluation, or it will be cleaned up when we reset the eval_mcontext, so that it won't be leaked. I didn't originally foresee the need to transfer the object into the transient memory context. But this design avoids any possibility of double-free attempts, which would be likely to happen if we allow the outer function's variable to hold onto a reference to the object while the RHS is evaluated. A function receiving an R/W reference is entitled to assume that that value is not otherwise referenced and can be freed when it's no longer needed, so it might well get freed during RHS evaluation. By converting the original R/W object into (effectively) a temporary value within the RHS evaluation, we make that assumption valid. So, while this design greatly expands the set of cases we can optimize, it does lose some cases that the old approach could support. I envision addressing that by allowing a prosupport function attached to the RHS' topmost function to "bless" other cases as safe, using reasoning similar to the old rules. (Or different rules, even, but it's on the prosupport function to be sure it's safe.) I don't have a detailed design in mind, but I'm thinking along the lines of just passing the whole RHS expression to the prosupport function and letting it decide what's safe. In any case, we don't need to even call the prosupport function unless there's an exception block or multiple RHS references to the target variable. regards, tom lane
Re: Inconsistent use of relpages = -1
On Wed, 2024-10-23 at 10:05 -0700, Jeff Davis wrote: > On Wed, 2024-10-23 at 04:47 +0200, Laurenz Albe wrote: > > On Tue, 2024-10-22 at 10:41 -0700, Jeff Davis wrote: > > > I attached a patch that creates partitioned tables with relpages=- > > > 1, > > > and updates the docs. > > > > Does this need any changes for pg_upgrade? > > pg_upgrade would go through the same DDL path, so I expect it would be > set to -1 in the upgraded cluster anyway. Are you seeing something > different? No; I was too lazy to check, so I asked. > In any case, the change is just meant to improve consistency and > documentation. I didn't intend to create a hard guarantee that it would > always be -1, so perhaps I should make the documentation more vague > with "may be" instead of "is"? Reading the thread an Tom's comments again, I get the impression that there are two states that we consider OK: - "relpages" and "reltuples" are both 0 - "relpages" is -1 and "reltuples" is greater than 0 What you write above indicates that "relpages" = 0 and "reltuples" > 0 would also be acceptable. As long as the code does not mistakenly assume that a partitioned table is empty, and as long as the documentation is not confusing, anything is fine with me. Currently, I am still a bit confused. Yours, Laurenz Albe
Re: general purpose array_sort
Hi, > It's hardly "general purpose" if it randomly refuses to > sort certain types. I would say it should be able to sort > anything that ORDER BY will handle --- and that certainly > includes the cases shown here. I wonder how useful / convenient the new function will be considering that we already have CTEs and can do: SELECT array_agg(x ORDER BY x) FROM unnest(ARRAY[5,1,3,2,4]) AS x; Perhaps there are use cases I didn't consider? -- Best regards, Aleksander Alekseev
Re: general purpose array_sort
Hi David, >> > It's hardly "general purpose" if it randomly refuses to >> > sort certain types. I would say it should be able to sort >> > anything that ORDER BY will handle --- and that certainly >> > includes the cases shown here. >> >> I wonder how useful / convenient the new function will be considering >> that we already have CTEs and can do: >> >> SELECT array_agg(x ORDER BY x) FROM unnest(ARRAY[5,1,3,2,4]) AS x; >> >> Perhaps there are use cases I didn't consider? >> > > Succinctness of expression. Plus I'm under the impression that a function > doing this is going to be somewhat faster than composing two functions > together within a multi-node subtree. > > I feel like the same observation could have been made for array_shuffle but > we added that. This function being added feels to me like just completing > the set. Right. To clarify, I'm not opposed to array_sort(). In fact personally I would prefer using it instead of array_agg() + unnest(). Just making an observation / thinking out loud that the requirement to support everything ORDER BY handles / supports (and will handle / support?) might make this function impractical to maintain. array_shuffle() or a recently proposed array_reverse() [1] are rather simple since they just move the array elements without looking at them. array_sort() does look at the elements and thus is very different. Particularly, does the function really need to support dir => asc | desc | asc nulls first | etc... ? Maybe array_sort() + array_reverse( array_sort() ) will handle most of the practical cases? I don't know. [1]: https://commitfest.postgresql.org/50/5314/ -- Best regards, Aleksander Alekseev
Re: Changing the default random_page_cost value
On Fri, 25 Oct 2024 at 13:14, Greg Sabino Mullane wrote: > > On Mon, Oct 14, 2024 at 10:20 PM David Rowley wrote: >> >> Yeah, I think any effort to change the default value for this setting would >> require some analysis to prove that the newly proposed default >> is a more suitable setting than the current default. I mean, why 1.2 and not >> 1.1 or 1.3? Where's the evidence that 1.2 is the best value >> for this? > > As I said, I was just throwing that 1.2 number out there. It felt right, > although perhaps a tad high (which seems right as we keep things very > conservative). I agree we should make a best effort to have an accurate, > defendable default. We all know (I hope) that 4.0 is wrong for SSDs. I don't think we're going to find the correct new value for this setting by throwing randomly chosen numbers at each other on an email thread. Unfortunately, someone is going to have to do some work to figure out what the number should be, and then hopefully someone else can verify that work to check that person is correct. I'm not trying to be smart or funny here, but I just am failing to comprehend why you think you offering a number without any information about how you selected that number to set as the new default random_page_cost would be acceptable. Are you expecting someone else to go and do the work to prove that your selected number is the correct one? It's been 4 weeks since your first email and nobody has done that yet, so maybe you might need to consider other ways to achieve your goal. >> I don't think just providing evidence that random read times are closer to >> sequential read times on SSDs are closer than they are with >> HDDs is going to be enough. > > ... >> >> It would be nice to have this as a script so that other people could easily >> run it on their hardware to ensure that random_page_cost we >> choose as the new default is representative of the average hardware. > > > Heh, this is starting to feel like belling the cat (see > https://fablesofaesop.com/belling-the-cat.html) I don't see the similarity. Changing the default random_page_cost requires analysis to find what the new default should be. The execution of the actual change in default is dead simple. With belling the cat, it seems like the execution is the hard part and nobody is debating the idea itself. > Remember this is still just a default, and we should encourage people to > tweak it themselves based on their own workloads. I just want people to start > in the right neighborhood. I'll see about working on some more research / > generating a script, but help from others is more than welcome. You might be mistakenly thinking that the best random_page_cost is an exact ratio of how much slower a random seek and read is from a sequential read. There are unfortunately many other factors to consider. The correct setting is going to be the one where the chosen plan uses the scan method that's the fastest and knowing the answer to that is going to take some benchmarks on PostgreSQL. Our cost model simply just isn't perfect enough for you to assume that I/O is the only factor that changes between an Index Scan and a Seq Scan. I'd say it's not overly difficult to come up with test cases that go to prove the value you select is "correct". I've done this before for CPU-related costs. I think with I/O the main difference will be that your tests should be much larger, and doing that will mean getting the results takes much more time. Here's a link to some analysis I did to help solve a problem relating to partition-wise aggregates [1]. Maybe you can use a similar method to determine random_page_cost. David [1] https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com
Re: POC: make mxidoff 64 bits
HI Maxim Orlov > After a bit of thought, I've realized that to be conservative here is the way to go. >We can reuse a maximum of existing logic. I mean, we can remove offset wraparound "error logic" and reuse "warning logic". But set the threshold for "warning >logic" to a much higher value. For now, I choose 2^32-1. In other world, legit logic, in my view, here would be to trigger autovacuum if the number of offsets (i.e. >difference nextOffset - oldestOffset) exceeds 2^32-1. PFA patch set. good point ,Couldn't agree with you more. xid64 is the solution to the wraparound problem,The previous error log is no longer meaningful ,But we might want to refine the output waring log a little(For example, checking the underlying reasons why age has been increasing),Though we don't have to worry about xid wraparound +/* + * Multixact members warning threshold. + * + * If difference bettween nextOffset and oldestOffset exceed this value, we + * trigger autovacuum in order to release the disk space if possible. + */ +#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0x) Can we refine this annotation a bit? for example +/* + * Multixact members warning threshold. + * + * If difference bettween nextOffset and oldestOffset exceed this value, we + * trigger autovacuum in order to release the disk space ,reduce table bloat if possible. + */ +#define MULTIXACT_MEMBER_AUTOVAC_THRESHOLD UINT64CONST(0x) Thanks Maxim Orlov 于2024年10月23日周三 23:55写道: > After a bit of thought, I've realized that to be conservative here is the > way to go. > > We can reuse a maximum of existing logic. I mean, we can remove offset > wraparound "error logic" and reuse "warning logic". But set the threshold > for "warning logic" to a much higher value. For now, I choose 2^32-1. In > other world, legit logic, in my view, here would be to trigger autovacuum > if the number of offsets (i.e. difference nextOffset - oldestOffset) > exceeds 2^32-1. PFA patch set. > > -- > Best regards, > Maxim Orlov. >
Re: Changing the default random_page_cost value
HI Greg Sabino Mullane Another thing is that you simply change the configuration template is not effective, need to modify the DEFAULT_RANDOM_PAGE_COST values { {"random_page_cost", PGC_USERSET, QUERY_TUNING_COST, gettext_noop("Sets the planner's estimate of the cost of a " "nonsequentially fetched disk page."), NULL, GUC_EXPLAIN }, &random_page_cost, DEFAULT_RANDOM_PAGE_COST, 0, DBL_MAX, NULL, NULL, NULL }, src/include/optimizer/cost.h /* defaults for costsize.c's Cost parameters */ /* NB: cost-estimation code should use the variables, not these constants! */ /* If you change these, update backend/utils/misc/postgresql.conf.sample */ #define DEFAULT_SEQ_PAGE_COST 1.0 #define DEFAULT_RANDOM_PAGE_COST 4.0 #define DEFAULT_CPU_TUPLE_COST 0.01 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005 #define DEFAULT_CPU_OPERATOR_COST 0.0025 #define DEFAULT_PARALLEL_TUPLE_COST 0.1 #define DEFAULT_PARALLEL_SETUP_COST 1000.0 Thanks > > > >
Re: Changing the default random_page_cost value
On Mon, Oct 14, 2024 at 10:20 PM David Rowley wrote: > Yeah, I think any effort to change the default value for this setting > would require some analysis to prove that the newly proposed default > is a more suitable setting than the current default. I mean, why 1.2 and > not 1.1 or 1.3? Where's the evidence that 1.2 is the best value > for this? > As I said, I was just throwing that 1.2 number out there. It felt right, although perhaps a tad high (which seems right as we keep things very conservative). I agree we should make a best effort to have an accurate, defendable default. We all know (I hope) that 4.0 is wrong for SSDs. > I don't think just providing evidence that random read times are closer to > sequential read times on SSDs are closer than they are with > HDDs is going to be enough. ... > It would be nice to have this as a script so that other people could > easily run it on their hardware to ensure that random_page_cost we > choose as the new default is representative of the average hardware. Heh, this is starting to feel like belling the cat (see https://fablesofaesop.com/belling-the-cat.html) Remember this is still just a default, and we should encourage people to tweak it themselves based on their own workloads. I just want people to start in the right neighborhood. I'll see about working on some more research / generating a script, but help from others is more than welcome. Cheers, Greg
Re: Missing installation of Kerberos.pm and AdjustUpgrade.pm
On Thu, Oct 24, 2024 at 02:51:05PM +0900, Michael Paquier wrote: > While catching up with some other thing, I've been reminded that this > has not been addressed. Any objections if I were to apply the > attached to add install and uninstall rules for Kerberos.pm and > AdjustUpgrade.pm? > > If people would prefer a backpatch, please let me know. Hearing nothing, applied this one on HEAD for now. -- Michael signature.asc Description: PGP signature
Re: pgbench: Improve result outputs related to failed transactinos
Hello Tatsuo-san, 11.10.2024 07:54, Tatsuo Ishii wrote: ... - number of transactions actually pocessed: 1 (tps = 410.846343) ... Patch pushed. Please consider fixing a typo sneaked in that commit: pocessed -> processed? Best regards, Alexander
Re: Refactor to use common function 'get_publications_str'.
On Fri, Oct 25, 2024 at 10:00 AM Michael Paquier wrote: > > On Fri, Oct 25, 2024 at 09:28:47AM +1100, Peter Smith wrote: > > I've attached the patch v4. > > Looks OK to me. Thanks. I'll see to get that done through the day. > -- Thanks for pushing! == Kind Regards, Peter Smith. Fujitsu Australia
Re: proposal: schema variables
st 23. 10. 2024 v 6:11 odesílatel Laurenz Albe napsal: > I have gone over patch 3 from the set and worked on the comments. > > Apart from that, I have modified your patch as follows: > > > +/* > > + * pg_session_variables - designed for testing > > + * > > + * This is a function designed for testing and debugging. It returns > the > > + * content of sessionvars as-is, and can therefore display entries about > > + * session variables that were dropped but for which this backend didn't > > + * process the shared invalidations yet. > > + */ > > +Datum > > +pg_session_variables(PG_FUNCTION_ARGS) > > +{ > > +#define NUM_PG_SESSION_VARIABLES_ATTS 8 > > + > > + elog(DEBUG1, "pg_session_variables start"); > > I don't think that message is necessary, particularly with DEBUG1. > I have removed this message and the "end" message as well. > removed > > > + while ((svar = (SVariable) hash_seq_search(&status)) != > NULL) > > + { > > + Datum > values[NUM_PG_SESSION_VARIABLES_ATTS]; > > + bool > nulls[NUM_PG_SESSION_VARIABLES_ATTS]; > > + HeapTuple tp; > > + boolvar_is_valid = false; > > + > > + memset(values, 0, sizeof(values)); > > + memset(nulls, 0, sizeof(nulls)); > > Instead of explicitly zeroing out the arrays, I have used an empty > initializer > in the definition, like > > bool nulls[NUM_PG_SESSION_VARIABLES_ATTS] = {}; > > That should have the same effect. > If you don't like that, I have no real problem with your original code. > I prefer the original way - minimally it is a common pattern. I didn't find any usage of `= {} ` in code > > + values[0] = ObjectIdGetDatum(svar->varid); > > + values[3] = ObjectIdGetDatum(svar->typid); > > You are using the type ID without checking if it exists in the catalog. > I think that is a bug. > The original idea was using typid as hint identification of deleted variables. The possibility that this id will not be consistent for the current catalogue was expected. And it is a reason why the result type is just Oid and not regtype. Without it, pg_session_variables shows just empty rows (except oid) for dropped not yet purged variables. > > There is a dependency between the variable and the type, but if a > concurrent > session drops both the variable and the data type, the non-existing type ID > would still show up in the function output. > Even worse, the OID could have been reused for a different type since. > I agree so usage `select typid::regtype, ... from pg_session_variables(.. ` can be dangerous, but it is the reason why we have there typname column. Showing typid has some information value, but I don't think it is absolutely necessary. I see some possible changes: 1. no change 2. remove typid column 3. show typid only when variable is valid, and using regtype as output type, remove typname What do you prefer? I'll send the attachment with merging changes for patch 4 Regards Pavel > > I am attaching just patch number 3 and leave you to adapt the patch set, > but I don't think any of the other patches should be affected. > The comments from your patch are merged > > Yours, > Laurenz Albe >
Re: Docs Build in CI failing with "failed to load external entity"
On Fri, Oct 25, 2024 at 4:44 AM Tom Lane wrote: > Melanie Plageman writes: > > I know in the past docs builds failing with "failed to load external > > entity" have happened on macos. But, recently I've noticed this > > failure for docs build on CI (not on macos) -- docs build is one of > > the jobs run under the "Compiler Warnings" task. > > It looks to me like a broken docbook installation on (one of?) > the CI machines. Note that the *first* complaint is > > [19:23:20.590] file:///etc/xml/catalog:1: parser error : Document is empty Yeah. That CI job runs on a canned Debian image that is rebuilt and republished every couple of days to make sure it's using up to date packages and kernel etc. Perhaps the package installation silently corrupted /etc/xml/catalog, given that multiple packages probably mess with it, though I don't have a specific theory for how that could happen, given that package installation seems to be serial... The installation log doesn't seem to show anything suspicious. https://github.com/anarazel/pg-vm-images/ https://cirrus-ci.com/github/anarazel/pg-vm-images https://cirrus-ci.com/build/5427240429682688 https://api.cirrus-ci.com/v1/task/6621385303261184/logs/build_image.log I tried simply reinstalling docbook-xml in my own github account (which is showing the problem), and it cleared the error: setup_additional_packages_script: | -#apt-get update -#DEBIAN_FRONTEND=noninteractive apt-get -y install ... +apt-get update +DEBIAN_FRONTEND=noninteractive apt-get -y install --reinstall docbook-xml https://cirrus-ci.com/task/6458406242877440 I wonder if this will magically fix itself when the next CI image build cron job kicks off. I have no idea what time zone this page is showing but it should happen in another day or so, unless Andres is around to kick it sooner: https://cirrus-ci.com/github/anarazel/pg-vm-images
Re: Retire support for OpenSSL 1.1.1 due to raised API requirements
This has now been committed via the TLS 1.3 ciphersuite patchset. -- Daniel Gustafsson
Re: Docs Build in CI failing with "failed to load external entity"
Melanie Plageman writes: > I know in the past docs builds failing with "failed to load external > entity" have happened on macos. But, recently I've noticed this > failure for docs build on CI (not on macos) -- docs build is one of > the jobs run under the "Compiler Warnings" task. It looks to me like a broken docbook installation on (one of?) the CI machines. Note that the *first* complaint is [19:23:20.590] file:///etc/xml/catalog:1: parser error : Document is empty I suspect that the subsequent "failed to load external entity" complaints happen because the XML processor doesn't find any DTD objects in the local catalog, so it tries to go out to the net for them, and is foiled by the --no-net switch we use. regards, tom lane
Re: Wrong results with grouping sets
On Thu, Oct 10, 2024 at 6:51 PM Richard Guo wrote: > On Thu, Oct 10, 2024 at 4:06 PM Richard Guo wrote: > > While we can fix this issue by propagating the hasGroupRTE mark from > > the EXISTS subquery to the parent, a better fix might be to remove the > > subquery's RTE_GROUP entry, since we have dropped the subquery's > > groupClause before the pull-up (see simplify_EXISTS_query). > > Here is the patch. I've pushed this patch with minor tweaks. Thanks again for the report! Thanks Richard
Re: Can rs_cindex be < 0 for bitmap heap scans?
On Thu, Oct 24, 2024 at 7:11 PM Melanie Plageman wrote: > Thanks for the reply, Dilip! > > On Thu, Oct 24, 2024 at 12:50 AM Dilip Kumar > wrote: > > > > On Thu, Oct 24, 2024 at 3:45 AM Melanie Plageman < > melanieplage...@gmail.com> wrote: > >> > >> HeapScanDescData->rs_cindex (the current index into the array of > >> visible tuples stored in the heap scan descriptor) is used for > >> multiple scan types, but for bitmap heap scans, AFAICT, it should > >> never be < 0. > > > > You are right it can never be < 0 in this case at least. > > Cool, thanks for confirming. I will commit this change then. > +1 > > > In fact you don't need to explicitly set it to 0 in initscan[1], because > before calling heapam_scan_bitmap_next_tuple() we must call > heapam_scan_bitmap_next_block() and this function is initializing this to 0 > (hscan->rs_cindex = 0;). Anyway no object even if you prefer to initialize > in initscan(), just the point is that we are already doing it for each > block before fetching tuples from the block. > > I am inclined to still initialize it to 0 in initscan(). I was > refactoring the bitmap heap scan code to use the read stream API and > after moving some things around, this field was no longer getting > initialized in heapam_scan_bitmap_next_block(). While that may not be > a good reason to change it in this patch, most of the other fields in > the heap scan descriptor (like rs_cbuf and rs_cblock) are > re-initialized in initscan(), so it seems okay to do that there even > though it isn't strictly necessary on master. > make sense -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: cache lookup failed when \d t concurrent with DML change column data type
On 10/25/24 10:05, Andrei Lepikhov wrote: On 10/24/24 22:30, jian he wrote: hi. I think I found a bug. PostgreSQL 18devel_debug_build_45188c2ea2 on x86_64-linux, compiled by gcc-14.1.0, 64-bit commit at 45188c2ea2. Ubuntu 22.04.4 LTS setup: drop table t cascade; create table t(a int PRIMARY key); IN session1: step "change data type" {begin; alter table t alter column a set data type int4;} step "s1" {commit;} IN session2: step "psql_another_session" {\d t} permutation "change data type" "psql_another_session" "s1" ERROR: cache lookup failed for attribute 1 of relation 34418 Yes, it looks like a bug existing for a long time, at least since PG11 (I didn't trace further down). It seems that the backend didn't apply invalidation messages before touching system caches. Backtrace: After a short discovery, I found the origins: The pg_get_indexdef has an incoming index oid and gets all the stuff needed just by looking up sys-caches. But it wants to build a list of relation column names at a specific moment and opens the heap relation. After that operation, we already have syscaches updated and the old index oid replaced with the new one. It may be have made sense to lock the row of replaced index in pg_class and pg_index until the transaction, altered it will be commmitted. But, because ALTER TABLE is not fully MVCC-safe, it may be expected (or acceptable) behaviour. -- regards, Andrei Lepikhov
Re: Retire support for OpenSSL 1.1.1 due to raised API requirements
On Thu, Oct 24, 2024 at 07:00:52PM +0200, Daniel Gustafsson wrote: > This has now been committed via the TLS 1.3 ciphersuite patchset. Nice, thanks for 6c66b7443ceb! -- Michael signature.asc Description: PGP signature
execute prepared statement passing parameter expression with COLLATE clause
hi. $Subject setup CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false); CREATE COLLATION ignore_accents (provider = icu, locale = '@colStrength=primary;colCaseLevel=yes', deterministic = false); DROP TABLE IF EXISTS pktable cascade; CREATE TABLE pktable (x text COLLATE case_insensitive); INSERT INTO pktable VALUES ('A'); DEALLOCATE q6; PREPARE q6 AS SELECT * FROM pktable WHERE x = $1; select * from pktable where x = 'Å' collate ignore_accents; --return one row execute q6('Å' collate ignore_accents); --return zero rows not sure return zero rows is desired.
Re: proposal: schema variables
... and here is a review for patch 4 I didn't change any code, just added the odd article to a comment. While running the regression tests with "make installcheck", I noticed two problems: --- /home/laurenz/postgresql/src/test/regress/expected/session_variables.out 2024-10-24 11:14:06.717663613 +0300 +++ /home/laurenz/postgresql/src/test/regress/results/session_variables.out 2024-10-24 11:15:37.999286228 +0300 @@ -30,6 +30,7 @@ GRANT ALL ON SCHEMA svartest TO regress_variable_owner; CREATE VARIABLE svartest.var1 AS int; CREATE ROLE regress_variable_reader; +ERROR: role "regress_variable_reader" already exists I suggest that patch 0001 should drop role "regress_variable_reader" again. @@ -107,7 +108,7 @@ CONTEXT: SQL function "sqlfx" statement 1 SELECT plpgsqlfx(20); ERROR: permission denied for session variable var1 -CONTEXT: SQL expression "$1 + var1" +CONTEXT: PL/pgSQL expression "$1 + var1" That looks like bit rot from your commit 4af123ad45. Yours, Laurenz Albe From 027fb062ecc0840bc5c2d135ebbc8ddc6b046f96 Mon Sep 17 00:00:00 2001 From: Laurenz Albe Date: Thu, 24 Oct 2024 11:17:12 +0300 Subject: [PATCH v20241024] DISCARD VARIABLES Implementation of DISCARD VARIABLES commands by removing hash table with session variables and resetting related memory context. --- doc/src/sgml/ref/discard.sgml | 13 +++- src/backend/commands/discard.c| 6 ++ src/backend/commands/session_variable.c | 28 - src/backend/parser/gram.y | 6 ++ src/backend/tcop/utility.c| 3 + src/bin/psql/tab-complete.in.c| 2 +- src/include/commands/session_variable.h | 2 + src/include/nodes/parsenodes.h| 1 + src/include/tcop/cmdtaglist.h | 1 + .../regress/expected/session_variables.out| 63 +++ src/test/regress/sql/session_variables.sql| 36 +++ 11 files changed, 158 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/ref/discard.sgml b/doc/src/sgml/ref/discard.sgml index bf44c523cac..61b967f9c9b 100644 --- a/doc/src/sgml/ref/discard.sgml +++ b/doc/src/sgml/ref/discard.sgml @@ -21,7 +21,7 @@ PostgreSQL documentation -DISCARD { ALL | PLANS | SEQUENCES | TEMPORARY | TEMP } +DISCARD { ALL | PLANS | SEQUENCES | TEMPORARY | TEMP | VARIABLES } @@ -66,6 +66,16 @@ DISCARD { ALL | PLANS | SEQUENCES | TEMPORARY | TEMP } + +VARIABLES + + + Resets the value of all session variables. If a variable + is later reused, it is re-initialized to NULL. + + + + TEMPORARY or TEMP @@ -93,6 +103,7 @@ SELECT pg_advisory_unlock_all(); DISCARD PLANS; DISCARD TEMP; DISCARD SEQUENCES; +DISCARD VARIABLES; diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c index 92d983ac748..d06ebaba6f4 100644 --- a/src/backend/commands/discard.c +++ b/src/backend/commands/discard.c @@ -18,6 +18,7 @@ #include "commands/async.h" #include "commands/discard.h" #include "commands/prepare.h" +#include "commands/session_variable.h" #include "commands/sequence.h" #include "utils/guc.h" #include "utils/portal.h" @@ -48,6 +49,10 @@ DiscardCommand(DiscardStmt *stmt, bool isTopLevel) ResetTempTableNamespace(); break; + case DISCARD_VARIABLES: + ResetSessionVariables(); + break; + default: elog(ERROR, "unrecognized DISCARD target: %d", stmt->target); } @@ -75,4 +80,5 @@ DiscardAll(bool isTopLevel) ResetPlanCache(); ResetTempTableNamespace(); ResetSequenceCaches(); + ResetSessionVariables(); } diff --git a/src/backend/commands/session_variable.c b/src/backend/commands/session_variable.c index 19f772a9fb6..eedce691bc0 100644 --- a/src/backend/commands/session_variable.c +++ b/src/backend/commands/session_variable.c @@ -94,7 +94,13 @@ pg_variable_cache_callback(Datum arg, int cacheid, uint32 hashvalue) elog(DEBUG1, "pg_variable_cache_callback %u %u", cacheid, hashvalue); - Assert(sessionvars); + /* + * There is no guarantee of session variables being initialized, even when + * receiving an invalidation callback, as DISCARD [ ALL | VARIABLES ] + * destroys the hash table entirely. + */ + if (!sessionvars) + return; /* * If the hashvalue is not specified, we have to recheck all currently @@ -658,3 +664,23 @@ pg_session_variables(PG_FUNCTION_ARGS) return (Datum) 0; } + +/* + * Fast drop of the complete content of the session variables hash table, and + * cleanup of any list that wouldn't be relevant anymore. + * This is used by the DISCARD VARIABLES (and DISCARD ALL) command. + */ +void +ResetSessionVariables(void) +{ + /* destroy hash table and reset related memory context */ + if (sessionvars) + { + hash_destroy(sessionvars); + sessionvars = NULL; + } + + /* release memory allocated by session variables */ + if (SVariableMemoryContext != NULL) + MemoryContextReset(SVariableMemoryContext
Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.
Amit Langote 于2024年10月24日周四 14:33写道: > Hi, > > On Thu, Oct 24, 2024 at 1:46 PM Tender Wang wrote: > > Tender Wang 于2024年10月23日周三 21:48写道: > >> > >> Hi all, > >> > >> I find another issue as $SUBJECT when I work on [1]. > > > > When I continue to work on this, I find below issue. But I'm not sure > whether it is a bug. > > > > postgres=# create table part_index(a text primary key) partition by list > ( a collate "POSIX"); > > ERROR: unique constraint on partitioned table must include all > partitioning columns > > DETAIL: PRIMARY KEY constraint on table "part_index" lacks column "a" > which is part of the partition key. > > postgres=# create table part_index(a text) partition by list ( a collate > "POSIX"); > > CREATE TABLE > > postgres=# alter table part_index add primary key (a); > > ERROR: unique constraint on partitioned table must include all > partitioning columns > > DETAIL: PRIMARY KEY constraint on table "part_index" lacks column "a" > which is part of the partition key. > > > > It seems we can't create a primary key if the collation is different > between columnDef and PartitionKey. > > Yeah, you don't want to have the PK index and the partitioning logic > to not be in sync about the collation rules applied to the individual > rows. > > > By the way, I think the error message is misleading to users. > > ostgres=# alter table part_index add primary key (a); > > ERROR: unique constraint on partitioned table must include all > partitioning columns > > DETAIL: PRIMARY KEY constraint on table "part_index" lacks column "a" > which is part of the partition key. > > I think it's kind of similar to the message you get when a GROUP BY > column's collation doesn't match the column appearing in the SELECT > list: > > explain SELECT c collate case_insensitive, count(c) FROM > pagg_tab_case_s GROUP BY c collate "C"; > ERROR: column "pagg_tab_case_s.c" must appear in the GROUP BY clause > or be used in an aggregate function > LINE 1: explain SELECT c collate case_insensitive, count(c) FROM pag... > > Perhaps it would be more helpful for the error message or hint or > detail to mention the actual discrepancy (collation mismatch) that's > causing the error. > > There might be other instances of such an error and I am not sure it > would be worthwhile to find and fix them all. > > Thanks for the explanation. We had better focus on the wrong result issue. I feel that it's hard only to use one struct(for example, X), which just calls equal(X, expr) can check both the expression match and the collation match. Maybe we should add another collation match checks in match_clause_to_partition_key(), like partition pruning logic does. Any thoughts? -- Thanks, Tender Wang
Re: [BUG FIX] Fix validation of COPY options FORCE_NOT_NULL/FORCE_NULL
On Thu, Oct 24, 2024 at 09:41:03AM +0300, Joel Jacobson wrote: > Therefore closing this patch and marking it as Committed. Thanks. I have managed to miss the CF entry. -- Michael signature.asc Description: PGP signature
Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.
Hi, On Thu, Oct 24, 2024 at 1:46 PM Tender Wang wrote: > Tender Wang 于2024年10月23日周三 21:48写道: >> >> Hi all, >> >> I find another issue as $SUBJECT when I work on [1]. > > When I continue to work on this, I find below issue. But I'm not sure whether > it is a bug. > > postgres=# create table part_index(a text primary key) partition by list ( a > collate "POSIX"); > ERROR: unique constraint on partitioned table must include all partitioning > columns > DETAIL: PRIMARY KEY constraint on table "part_index" lacks column "a" which > is part of the partition key. > postgres=# create table part_index(a text) partition by list ( a collate > "POSIX"); > CREATE TABLE > postgres=# alter table part_index add primary key (a); > ERROR: unique constraint on partitioned table must include all partitioning > columns > DETAIL: PRIMARY KEY constraint on table "part_index" lacks column "a" which > is part of the partition key. > > It seems we can't create a primary key if the collation is different between > columnDef and PartitionKey. Yeah, you don't want to have the PK index and the partitioning logic to not be in sync about the collation rules applied to the individual rows. > By the way, I think the error message is misleading to users. > ostgres=# alter table part_index add primary key (a); > ERROR: unique constraint on partitioned table must include all partitioning > columns > DETAIL: PRIMARY KEY constraint on table "part_index" lacks column "a" which > is part of the partition key. I think it's kind of similar to the message you get when a GROUP BY column's collation doesn't match the column appearing in the SELECT list: explain SELECT c collate case_insensitive, count(c) FROM pagg_tab_case_s GROUP BY c collate "C"; ERROR: column "pagg_tab_case_s.c" must appear in the GROUP BY clause or be used in an aggregate function LINE 1: explain SELECT c collate case_insensitive, count(c) FROM pag... Perhaps it would be more helpful for the error message or hint or detail to mention the actual discrepancy (collation mismatch) that's causing the error. There might be other instances of such an error and I am not sure it would be worthwhile to find and fix them all. -- Thanks, Amit Langote
Re: Pgoutput not capturing the generated columns
On Wed, Oct 23, 2024 at 11:51 AM Amit Kapila wrote: > > Thanks. One more thing that I didn't like about the patch is that it > used column_list to address the "publish_generated_columns = false" > case such that we build column_list without generated columns for the > same. The first problem is that it will add overhead to always probe > column_list during proto.c calls (for example during > logicalrep_write_attrs()), then it makes the column_list code complex > especially the handling in pgoutput_column_list_init(), and finally > this appears to be a misuse of column_list. > > So, I suggest remembering this information in RelationSyncEntry and > then using it at the required places. We discussed above that > contradictory values of "publish_generated_columns" across > publications for the same relations are not accepted, so we can detect > that during get_rel_sync_entry() and give an ERROR for the same. > The changes in tablesync look complicated and I am not sure whether it handles the conflicting publish_generated_columns option correctly. I have few thoughts for the same. * The handling of contradictory options in multiple publications needs to be the same as for column lists. I think it is handled (a) during subscription creation, (b) during copy in fetch_remote_table_info(), and (c) during replication. See Peter's email (https://www.postgresql.org/message-id/CAHut%2BPs985rc95cB2x5yMF56p6Lf192AmCJOpAtK_%2BC5YGUF2A%40mail.gmail.com) to understand why this new option has to be handled in the same way as the column list. * While fetching column list via pg_get_publication_tables(), we should detect contradictory publish_generated_columns options similar to column lists, and then after we get publish_generated_columns as return value, we can even use that while fetching attribute information. A few additional comments: 1. - /* Regular table with no row filter */ - if (lrel.relkind == RELKIND_RELATION && qual == NIL) + /* + * Check if the remote table has any generated columns that should be + * copied. + */ + for (int i = 0; i < relmapentry->remoterel.natts; i++) + { + if (lrel.attremotegen[i]) + { + gencol_copy_needed = true; + break; + } + } Can't we get this information from fetch_remote_table_info() instead of traversing the entire column list again? 2. @@ -1015,7 +1110,6 @@ fetch_remote_table_info(char *nspname, char *relname, ExecDropSingleTupleTableSlot(slot); lrel->natts = natt; - walrcv_clear_result(res); Spurious line removal. 3. Why do we have to specifically exclude generated columns of a subscriber-side table in make_copy_attnamelist()? Can't we rely on fetch_remote_table_info() and logicalrep_rel_open() that the final remote attrlist will contain the generated column only if the subscriber doesn't have a generated column otherwise it would have given an error in logicalrep_rel_open()? -- With Regards, Amit Kapila.
Typo spotted in GetFileBackupMethod comment(?)
Hey Robert, I'm investigating how PostgresQL 17 does incremental backup and find this comment of GetFileBackupMethod a bit off. relative_block_numbers being an array of at *most* RELSEG_SIZE makes more sense to me. So I made this patch to address it. Note that this is the first time I made a patch for the community. So kindly bear with me if I did it wrong somehow. Best, Haoran v01-fix_GetFileBackupMethod_comment.patch Description: Binary data
Re: Using read_stream in index vacuum
I've also added GiST vacuum to the patchset. > On 24 Oct 2024, at 01:04, Melanie Plageman wrote: > > On Wed, Oct 23, 2024 at 4:29 PM Andrey M. Borodin > wrote: >> >>> On 23 Oct 2024, at 20:57, Andrey M. Borodin wrote: >>> >>> I'll think how to restructure flow there... >> >> OK, I've understood how it should be structured. PFA v5. Sorry for the noise. > > I think this would be a bit nicer: > > while (BufferIsValid(buf = read_stream_next_buffer(stream, NULL))) > { > block = btvacuumpage(&vstate, buf); > if (info->report_progress) > pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE, block); > } OK. > Maybe change btvacuumpage() to return the block number to avoid the > extra BufferGetBlockNumber() calls (those add up). Makes sense... now we have non-obvious function prototype, but I think it worth it. > While looking at this, I started to wonder if it isn't incorrect that > we are not calling pgstat_progress_update_param() for the blocks that > we backtrack and read in btvacuumpage() too (on master as well). > btvacuumpage() may actually have scanned more than one block, so... It's correct, user wants to know progress and backtracks do not count towards progress. Though, it must be relatevely infrequent case. > Unrelated to code review, but btree index vacuum has the same issues > that kept us from committing streaming heap vacuum last release -- > interactions between the buffer access strategy ring buffer size and > the larger reads -- one of which is an increase in the number of WAL > syncs and writes required. Thomas mentions it here [1] and here [2] is > the thread where he proposes adding vectored writeback to address some > of these issues. Do I need to do anything about it? Thank you! Best regards, Andrey Borodin. v6-0001-Prototype-B-tree-vacuum-streamlineing.patch Description: Binary data v6-0002-Use-read_stream-in-GiST-vacuum.patch Description: Binary data
Re: Enhancing Memory Context Statistics Reporting
Hi Torikoshia, Thank you for reviewing the patch! On Wed, Oct 23, 2024 at 9:28 AM torikoshia wrote: > On 2024-10-22 03:24, Rahila Syed wrote: > > Hi, > > > > PostgreSQL provides following capabilities for reporting memory > > contexts statistics. > > 1. pg_get_backend_memory_contexts(); [1] > > 2. pg_log_backend_memory_contexts(pid); [2] > > > > [1] provides a view of memory context statistics for a local backend, > > while [2] prints the memory context statistics of any backend or > > auxiliary > > process to the PostgreSQL logs. Although [1] offers detailed > > statistics, > > it is limited to the local backend, restricting its use to PostgreSQL > > client backends only. > > On the other hand, [2] provides the statistics for all backends but > > logs them in a file, > > which may not be convenient for quick access. > > > > I propose enhancing memory context statistics reporting by combining > > these > > capabilities and offering a view of memory statistics for all > > PostgreSQL backends > > and auxiliary processes. > > Thanks for working on this! > > I originally tried to develop something like your proposal in [2], but > there were some difficulties and settled down to implement > pg_log_backend_memory_contexts(). > > Yes. I am revisiting this problem :) > > Attached is a patch that implements this functionality. It introduces > > a SQL function > > that takes the PID of a backend as an argument, returning a set of > > records, > > each containing statistics for a single memory context. The underlying > > C function > > sends a signal to the backend and waits for it to publish its memory > > context statistics > > before returning them to the user. The publishing backend copies > > these statistics > > during the next CHECK_FOR_INTERRUPTS call. > > I remember waiting for dumping memory contexts stats could cause trouble > considering some erroneous cases. > > For example, just after the target process finished dumping stats, > pg_get_remote_backend_memory_contexts() caller is terminated before > reading the stats, calling pg_get_remote_backend_memory_contexts() has > no response any more: > > [session1]$ psql > (40699)=# > > $ kill -s SIGSTOP 40699 > > [session2] psql >(40866)=# select * FROM > pg_get_remote_backend_memory_contexts('40699', false); -- waiting > > $ kill -s SIGSTOP 40866 > > $ kill -s SIGCONT 40699 > > [session3] psql > (47656) $ select pg_terminate_backend(40866); > > $ kill -s SIGCONT 40866 -- session2 terminated > > [session3] (47656)=# select * FROM > pg_get_remote_backend_memory_contexts('47656', false); -- no response > > It seems the reason is memCtxState->in_use is now and > memCtxState->proc_id is 40699. > We can continue to use pg_get_remote_backend_memory_contexts() after > specifying 40699, but it'd be hard to understand for users. > > Thanks for testing and reporting. While I am not able to reproduce this problem, I think this may be happening because the requesting backend/caller is terminated before it gets a chance to mark memCtxState->in_use as false. In this case memCtxState->in_use should be marked as 'false' possibly during the processing of ProcDiePending in ProcessInterrupts(). > This approach facilitates on-demand publication of memory statistics > > for a specific backend, rather than collecting them at regular > > intervals. > > Since past memory context statistics may no longer be relevant, > > there is little value in retaining historical data. Any collected > > statistics > > can be discarded once read by the client backend. > > > > A fixed-size shared memory block, currently accommodating 30 records, > > is used to store the statistics. This number was chosen arbitrarily, > > as it covers all parent contexts at level 1 (i.e., direct children of > > the top memory context) > > based on my tests. > > Further experiments are needed to determine the optimal number > > for summarizing memory statistics. > > > > Any additional statistics that exceed the shared memory capacity > > are written to a file per backend in the PG_TEMP_FILES_DIR. The client > > backend > > first reads from the shared memory, and if necessary, retrieves the > > remaining data from the file, > > combining everything into a unified view. The files are cleaned up > > automatically > > if a backend crashes or during server restarts. > > > > The statistics are reported in a breadth-first search order of the > > memory context tree, > > with parent contexts reported before their children. This provides a > > cumulative summary > > before diving into the details of each child context's consumption. > > > > The rationale behind the shared memory chunk is to ensure that the > > majority of contexts which are the direct children of > > TopMemoryContext, > > fit into memory > > This allows a client to request a summary of memory statistics, > > which can be served from memory without the overhead of file access, > > unless necessary. > > > > A publi
Re: Can rs_cindex be < 0 for bitmap heap scans?
On Thu, Oct 24, 2024 at 3:45 AM Melanie Plageman wrote: > Hi, > > HeapScanDescData->rs_cindex (the current index into the array of > visible tuples stored in the heap scan descriptor) is used for > multiple scan types, but for bitmap heap scans, AFAICT, it should > never be < 0. > > As such, I find this test in heapam_scan_bitmap_next_tuple() pretty > confusing. > > if (hscan->rs_cindex < 0 || hscan->rs_cindex >= hscan->rs_ntuples) > > I know it seems innocuous, but I find it pretty distracting. Am I > missing something? Could it somehow be < 0? > > If not, I propose removing that part of the if statement like in the > attached patch. > > You are right it can never be < 0 in this case at least. In fact you don't need to explicitly set it to 0 in initscan[1], because before calling heapam_scan_bitmap_next_tuple() we must call heapam_scan_bitmap_next_block() and this function is initializing this to 0 (hscan->rs_cindex = 0;). Anyway no object even if you prefer to initialize in initscan(), just the point is that we are already doing it for each block before fetching tuples from the block. [1] @@ -378,6 +378,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) ItemPointerSetInvalid(&scan->rs_ctup.t_self); scan->rs_cbuf = InvalidBuffer; scan->rs_cblock = InvalidBlockNumber; + scan->rs_cindex = 0; -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [BUG FIX] Fix validation of COPY options FORCE_NOT_NULL/FORCE_NULL
On Sat, Oct 19, 2024, at 09:52, Joel Jacobson wrote: > However, since my last email, I've found some other problems in this area, > and think we should do a more ambitious improvement, by rearranging the > incorrect options tests into three categories: > > 1. incorrect COPY {FROM|TO} options > 2. incorrect COPY FROM options > 3. incorrect COPY TO options > > Also, I've found two new problems: > > 1. Incorrect options tests are missing for the QUOTE and ESCPAE options. >This was discovered by Jian He in a different thread. > > 2. One of the ON_ERROR incorrect options tests also depend on the order >of checks in ProcessCopyOptions. > > To explain my current focus on the COPY tests, it's because we're currently > working on the new raw format for COPY, and it would be nice to cleanup these > tests as a preparatory step first. I realize these suggested changes are out of scope for $subject, that was about a bug fix. Therefore closing this patch and marking it as Committed. /Joel
Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.
Tender Wang 于2024年10月23日周三 21:48写道: > Hi all, > > I find another issue as $SUBJECT when I work on [1]. > When I continue to work on this, I find below issue. But I'm not sure whether it is a bug. postgres=# create table part_index(a text primary key) partition by list ( a collate "POSIX"); ERROR: unique constraint on partitioned table must include all partitioning columns DETAIL: PRIMARY KEY constraint on table "part_index" lacks column "a" which is part of the partition key. postgres=# create table part_index(a text) partition by list ( a collate "POSIX"); CREATE TABLE postgres=# alter table part_index add primary key (a); ERROR: unique constraint on partitioned table must include all partitioning columns DETAIL: PRIMARY KEY constraint on table "part_index" lacks column "a" which is part of the partition key. It seems we can't create a primary key if the collation is different between columnDef and PartitionKey. By the way, I think the error message is misleading to users. ostgres=# alter table part_index add primary key (a); ERROR: unique constraint on partitioned table must include all partitioning columns DETAIL: PRIMARY KEY constraint on table "part_index" lacks column "a" which is part of the partition key. -- Thanks, Tender Wang
Re: Using Expanded Objects other than Arrays from plpgsql
Michel Pelletier writes: > On Wed, Oct 23, 2024 at 8:21 AM Tom Lane wrote: >> Another thing that confuses me is why there's a second flatten_matrix >> operation happening here. Shouldn't set_element return its result >> as a R/W expanded object? > That confuses me too, and my default assumption is always that I'm doing it > wrong. set_element does return a R/W object afaict, here is the return: > https://github.com/OneSparse/OneSparse/blob/main/src/matrix.c#L1726 Hmph. That seems right. Can you add errbacktrace() to your logging ereports, in hopes of seeing how we're getting to flatten_matrix? Or break there with gdb for a more complete/reliable stack trace. regards, tom lane