Re: pg_rewind: Skip log directory for file type check like pg_wal
On Tue, 7 Mar 2023 at 08:52, Soumyadeep Chakraborty < soumyadeep2...@gmail.com> wrote: > > > It couldn't be achieved just by introducing a static string "log". The > "log_directory" GUC must be examined on both, source and target. > > Trouble with doing that is if pg_rewind is run in non-libpq (offline) > mode. Then we would have to parse it out of the conf file(s)? > Is there a standard way of doing that? > pg_rewind is already doing something similar for "restore_command": /* * Get value of GUC parameter restore_command from the target cluster. * * This uses a logic based on "postgres -C" to get the value from the * cluster. */ static void getRestoreCommand(const char *argv0) For the running source cluster one could just use "SHOW log_directory" Regards, -- Alexander Kukushkin
Re: using memoize in in paralel query decreases performance
po 6. 3. 2023 v 22:52 odesílatel David Rowley napsal: > On Mon, 6 Mar 2023 at 21:55, Pavel Stehule > wrote: > > default https://explain.depesz.com/s/fnBe > > It looks like the slowness is coming from the Bitmap Index scan and > Bitmap heap scan rather than Memoize. > >-> Bitmap Heap Scan on public.item i (cost=285.69..41952.12 > rows=29021 width=16) (actual time=20.395..591.606 rows=20471 > loops=784) > Output: i.id, i.item_category_id > Recheck Cond: (i.item_category_id = ictc.sub_category_id) > Heap Blocks: exact=1590348 > Worker 0: actual time=20.128..591.426 rows=20471 loops=112 > Worker 1: actual time=20.243..591.627 rows=20471 loops=112 > Worker 2: actual time=20.318..591.660 rows=20471 loops=112 > Worker 3: actual time=21.180..591.644 rows=20471 loops=112 > Worker 4: actual time=20.226..591.357 rows=20471 loops=112 > Worker 5: actual time=20.597..591.418 rows=20471 loops=112 > -> Bitmap Index Scan on ixfk_ite_itemcategoryid > (cost=0.00..278.43 rows=29021 width=0) (actual time=14.851..14.851 > rows=25362 loops=784) >Index Cond: (i.item_category_id = ictc.sub_category_id) >Worker 0: actual time=14.863..14.863 rows=25362 loops=112 >Worker 1: actual time=14.854..14.854 rows=25362 loops=112 >Worker 2: actual time=14.611..14.611 rows=25362 loops=112 >Worker 3: actual time=15.245..15.245 rows=25362 loops=112 >Worker 4: actual time=14.909..14.909 rows=25362 loops=112 >Worker 5: actual time=14.841..14.841 rows=25362 loops=112 > > > disabled memoize https://explain.depesz.com/s/P2rP > > -> Bitmap Heap Scan on public.item i (cost=285.69..41952.12 > rows=29021 width=16) (actual time=9.256..57.503 rows=20471 loops=784) >Output: i.id, i.item_category_id >Recheck Cond: (i.item_category_id = ictc.sub_category_id) >Heap Blocks: exact=1590349 >Worker 0: actual time=9.422..58.420 rows=20471 loops=112 >Worker 1: actual time=9.449..57.539 rows=20471 loops=112 >Worker 2: actual time=9.751..58.129 rows=20471 loops=112 >Worker 3: actual time=9.620..57.484 rows=20471 loops=112 >Worker 4: actual time=8.940..57.911 rows=20471 loops=112 >Worker 5: actual time=9.454..57.488 rows=20471 loops=112 >-> Bitmap Index Scan on ixfk_ite_itemcategoryid > (cost=0.00..278.43 rows=29021 width=0) (actual time=4.581..4.581 > rows=25363 loops=784) > Index Cond: (i.item_category_id = ictc.sub_category_id) > Worker 0: actual time=4.846..4.846 rows=25363 loops=112 > Worker 1: actual time=4.734..4.734 rows=25363 loops=112 > Worker 2: actual time=4.803..4.803 rows=25363 loops=112 > Worker 3: actual time=4.959..4.959 rows=25363 loops=112 > Worker 4: actual time=4.402..4.402 rows=25363 loops=112 > Worker 5: actual time=4.778..4.778 rows=25363 loops=112 > > I wonder if the additional work_mem required for Memoize is just doing > something like causing kernel page cache evictions and leading to > fewer buffers for ixfk_ite_itemcategoryid and the item table being > cached in the kernel page cache. > > Maybe you could get an idea of that if you SET track_io_timing = on; > and EXPLAIN (ANALYZE, BUFFERS) both queries. > https://explain.depesz.com/s/vhk0 https://explain.depesz.com/s/R5ju Regards Pavel > David >
Re: Raising the SCRAM iteration count
> On 7 Mar 2023, at 05:53, Michael Paquier wrote: > > On Fri, Mar 03, 2023 at 11:13:36PM +0100, Daniel Gustafsson wrote: >> That would indeed be nice, but is there a way to do this without a >> complicated >> pump TAP expression? I was unable to think of a way but I might be missing >> something? > > A SET command refreshes immediately the cache information of the > connection in pqSaveParameterStatus()@libpq, so a test in password.sql > with \password would be enough to check the computation happens in > pg_fe_scram_build_secret() with the correct iteration number. Say > like: > =# SET scram_iterations = 234; > SET > =# \password > Enter new password for user "postgres": TYPEME > Enter it again: TYPEME > =# select substr(rolpassword, 1, 18) from pg_authid > where oid::regrole::name = current_role; > substr > > SCRAM-SHA-256$234: > (1 row) > > Or perhaps I am missing something? Right, what I meant was: can a pg_regress sql/expected test drive a psql interactive prompt? Your comments suggested using password.sql so I was curious if I was missing a neat trick for doing this. -- Daniel Gustafsson
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
> > FWIW, Greenplum has a similar construct (but which also logs the errors > in the > db) where data type errors are skipped as long as the number of errors > don't > exceed a reject limit. If the reject limit is reached then the COPY > fails: > > > > LOG ERRORS [ SEGMENT REJECT LIMIT [ ROWS | PERCENT ]] > > > IIRC the gist of this was to catch then the user copies the wrong input > data or > plain has a broken file. Rather than finding out after copying n rows > which > are likely to be garbage the process can be restarted. > I think this is a matter for discussion. The same question is: "Where to log errors to separate files or to the system logfile?". IMO it's better for users to log short-detailed error message to system logfile and not output errors to the terminal. This version of the patch has a compiler error in the error message: > Yes, corrected it. Changed "ignored_errors" to int64 because "processed" (used for counting copy rows) is int64. I felt just logging "Error: %ld" would make people wonder the meaning of > the %ld. Logging something like ""Error: %ld data type errors were > found" might be clearer. > Thanks. For more clearance change the message to: "Errors were found: %". Regards, Damir Belyalov Postgres Professional diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml index c25b52d0cb..706b929947 100644 --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -34,6 +34,7 @@ COPY { table_name [ ( format_name FREEZE [ boolean ] +IGNORE_DATATYPE_ERRORS [ boolean ] DELIMITER 'delimiter_character' NULL 'null_string' HEADER [ boolean | MATCH ] @@ -233,6 +234,17 @@ COPY { table_name [ ( + +IGNORE_DATATYPE_ERRORS + + + Drops rows that contain malformed data while copying. These are rows + with columns where the data type's input-function raises an error. + Outputs warnings about rows with incorrect data to system logfile. + + + + DELIMITER diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index e34f583ea7..0334894014 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -410,6 +410,7 @@ ProcessCopyOptions(ParseState *pstate, bool format_specified = false; bool freeze_specified = false; bool header_specified = false; + bool ignore_datatype_errors_specified= false; ListCell *option; /* Support external use for option sanity checking */ @@ -449,6 +450,13 @@ ProcessCopyOptions(ParseState *pstate, freeze_specified = true; opts_out->freeze = defGetBoolean(defel); } + else if (strcmp(defel->defname, "ignore_datatype_errors") == 0) + { + if (ignore_datatype_errors_specified) +errorConflictingDefElem(defel, pstate); + ignore_datatype_errors_specified = true; + opts_out->ignore_datatype_errors = defGetBoolean(defel); + } else if (strcmp(defel->defname, "delimiter") == 0) { if (opts_out->delim) diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c index 29cd1cf4a6..facfc44def 100644 --- a/src/backend/commands/copyfrom.c +++ b/src/backend/commands/copyfrom.c @@ -949,10 +949,14 @@ CopyFrom(CopyFromState cstate) errcallback.previous = error_context_stack; error_context_stack = &errcallback; + if (cstate->opts.ignore_datatype_errors) + cstate->ignored_errors = 0; + for (;;) { TupleTableSlot *myslot; bool skip_tuple; + ErrorSaveContext escontext = {T_ErrorSaveContext}; CHECK_FOR_INTERRUPTS(); @@ -985,9 +989,26 @@ CopyFrom(CopyFromState cstate) ExecClearTuple(myslot); + if (cstate->opts.ignore_datatype_errors) + { + escontext.details_wanted = true; + cstate->escontext = escontext; + } + /* Directly store the values/nulls array in the slot */ if (!NextCopyFrom(cstate, econtext, myslot->tts_values, myslot->tts_isnull)) + { + if (cstate->opts.ignore_datatype_errors && cstate->ignored_errors > 0) +ereport(WARNING, errmsg("Errors were found: %lld", (long long) cstate->ignored_errors)); break; + } + + /* Soft error occured, skip this tuple */ + if (cstate->escontext.error_occurred) + { + ExecClearTuple(myslot); + continue; + } ExecStoreVirtualTuple(myslot); diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c index 91b564c2bc..9c36b0dc8b 100644 --- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c @@ -70,6 +70,7 @@ #include "libpq/pqformat.h" #include "mb/pg_wchar.h" #include "miscadmin.h" +#include "nodes/miscnodes.h" #include "pgstat.h" #include "port/pg_bswap.h" #include "utils/builtins.h" @@ -938,10 +939,23 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext, cstate->cur_attname = NameStr(att->attname); cstate->cur_attval = string; - values[m] = InputFunctionCall(&in_functions[m], - string, - typioparams[m], - att->atttypmod); + + /* If IGNORE_DATATYPE_ERRORS i
Re: using memoize in in paralel query decreases performance
/On Tue, 7 Mar 2023 at 21:09, Pavel Stehule wrote: > > po 6. 3. 2023 v 22:52 odesílatel David Rowley napsal: >> I wonder if the additional work_mem required for Memoize is just doing >> something like causing kernel page cache evictions and leading to >> fewer buffers for ixfk_ite_itemcategoryid and the item table being >> cached in the kernel page cache. >> >> Maybe you could get an idea of that if you SET track_io_timing = on; >> and EXPLAIN (ANALYZE, BUFFERS) both queries. > > > https://explain.depesz.com/s/vhk0 This is the enable_memoize=on one. The I/O looks like: Buffers: shared hit=105661309 read=15274264 dirtied=15707 written=34863 I/O Timings: shared/local read=2671836.341 write=1286.869 2671836.341 / 15274264 = ~0.175 ms per read. > https://explain.depesz.com/s/R5ju This is the faster enable_memoize = off one. The I/O looks like: Buffers: shared hit=44542473 read=18541899 dirtied=11988 written=18625 I/O Timings: shared/local read=1554838.583 write=821.477 1554838.583 / 18541899 = ~0.084 ms per read. That indicates that the enable_memoize=off version is just finding more pages in the kernel's page cache than the slower query. The slower query just appears to be under more memory pressure causing the kernel to have less free memory to cache useful pages. I don't see anything here that indicates any problems with Memoize. Sure the statistics could be better as, ideally, the Memoize wouldn't have happened for the i_2 relation. I don't see anything that indicates any bugs with this, however. It's pretty well known that Memoize puts quite a bit of faith into ndistinct estimates. If it causes too many issues the enable_memoize switch can be turned to off. You might want to consider experimenting with smaller values of work_mem and/or hash_mem_multiplier for this query or just disabling memoize altogether. David
Re: [PATCH] Support % wildcard in extension upgrade filenames
On Mon, Mar 06, 2023 at 02:54:35PM -0500, Gregory Stark (as CFM) wrote: > This patch too is conflicting on meson.build. I'm attaching a rebased version to this email. > Maybe it can just be solved with multiple one-line scripts > that call to a master script? Not really, as the problem we are trying to solve is the need to install many files, and the need to foresee any possible future bug-fix release we might want to support upgrades from. PostGIS is already installing zero-line scripts to upgrade from to a virtual "ANY" version which we then use to have a single ANY-- upgrade path, but we are still REQUIRED to install a file for any we want to allow upgrade from, which is what this patch is aimed at fixing. --strk; >From ffefd039f24e3def8d2216b3ac7875d0b6feb8fb Mon Sep 17 00:00:00 2001 From: Sandro Santilli Date: Wed, 14 Sep 2022 11:10:10 +0200 Subject: [PATCH v1] Allow wildcard (%) in extension upgrade paths A wildcard character "%" will be accepted in the "source" side of the upgrade script and be considered usable to upgrade any version to the "target" side. Includes regression test and documentation. --- doc/src/sgml/extend.sgml | 8 src/backend/commands/extension.c | 42 --- src/test/modules/test_extensions/Makefile | 6 ++- .../expected/test_extensions.out | 15 +++ src/test/modules/test_extensions/meson.build | 3 ++ .../test_extensions/sql/test_extensions.sql | 7 .../test_ext_wildcard1--%--2.0.sql| 6 +++ .../test_ext_wildcard1--1.0.sql | 6 +++ .../test_ext_wildcard1.control| 3 ++ 9 files changed, 88 insertions(+), 8 deletions(-) create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--%--2.0.sql create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--1.0.sql create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1.control diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml index b70cbe83ae..f1f0ae1244 100644 --- a/doc/src/sgml/extend.sgml +++ b/doc/src/sgml/extend.sgml @@ -1081,6 +1081,14 @@ SELECT pg_catalog.pg_extension_config_dump('my_config', 'WHERE NOT standard_entr 1.1). + + The literal value % can be used as the + old_version component in an extension + update script for it to match any version. Such wildcard update + scripts will only be used when no explicit path is found from + old to target version. + + Given that a suitable update script is available, the command ALTER EXTENSION UPDATE will update an installed extension diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c index 02ff4a9a7f..6df0fd403a 100644 --- a/src/backend/commands/extension.c +++ b/src/backend/commands/extension.c @@ -130,6 +130,7 @@ static void ApplyExtensionUpdates(Oid extensionOid, bool cascade, bool is_create); static char *read_whole_file(const char *filename, int *length); +static bool file_exists(const char *name); /* @@ -893,7 +894,14 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control, if (from_version == NULL) elog(DEBUG1, "executing extension script for \"%s\" version '%s'", control->name, version); else + { + if ( ! file_exists(filename) ) + { + /* if filename does not exist, try wildcard */ + filename = get_extension_script_filename(control, "%", version); + } elog(DEBUG1, "executing extension script for \"%s\" update from version '%s' to '%s'", control->name, from_version, version); + } /* * If installing a trusted extension on behalf of a non-superuser, become @@ -1217,14 +1225,19 @@ identify_update_path(ExtensionControlFile *control, /* Find shortest path */ result = find_update_path(evi_list, evi_start, evi_target, false, false); + if (result != NIL) + return result; - if (result == NIL) - ereport(ERROR, -(errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("extension \"%s\" has no update path from version \"%s\" to version \"%s\"", - control->name, oldVersion, newVersion))); + /* Find wildcard path, if no explicit path was found */ + evi_start = get_ext_ver_info("%", &evi_list); + result = find_update_path(evi_list, evi_start, evi_target, false, false); + if (result != NIL) + return result; - return result; + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("extension \"%s\" has no update path from version \"%s\" to version \"%s\"", + control->name, oldVersion, newVersion))); } /* @@ -3395,3 +3408,20 @@ read_whole_file(const char *filename, int *length) buf[*length] = '\0'; return buf; } + +static bool +file_exists(const char *name) +{ + struct stat st; + + Assert(name != NULL); + + if (stat(name, &st) == 0) + return !S_ISDIR(st.st_mode); + else if (!(errno == ENOENT || errno == ENOTDIR || errno == EACCES)) + ereport(ERROR, +(errcode_for_file_access(), + errm
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Tue, Mar 7, 2023 at 1:19 PM Peter Smith wrote: > > Let me give an example to demonstrate why I thought something is fishy here: > > Imagine rel has a (non-default) REPLICA IDENTITY with Oid=. > Imagine the same rel has a PRIMARY KEY with Oid=. > > --- > > +/* > + * Get replica identity index or if it is not defined a primary key. > + * > + * If neither is defined, returns InvalidOid > + */ > +Oid > +GetRelationIdentityOrPK(Relation rel) > +{ > + Oid idxoid; > + > + idxoid = RelationGetReplicaIndex(rel); > + > + if (!OidIsValid(idxoid)) > + idxoid = RelationGetPrimaryKeyIndex(rel); > + > + return idxoid; > +} > + > +/* > + * Given a relation and OID of an index, returns true if the > + * index is relation's replica identity index or relation's > + * primary key's index. > + * > + * Returns false otherwise. > + */ > +bool > +IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid) > +{ > + Assert(OidIsValid(idxoid)); > + > + return GetRelationIdentityOrPK(rel) == idxoid; > +} > + > > > So, according to the above function comment/name, I will expect > calling IdxIsRelationIdentityOrPK passing Oid= or Oid- will > both return true, right? > > But AFAICT > > IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel) > returns (the valid oid of the RI) --> == --> true; > > IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel) > returns (the valid oid of the RI) --> == --> false; > > ~ > > Now two people are telling me this is OK, but I've been staring at it > for too long and I just don't see how it can be. (??) > The difference is that you are misunderstanding the intent of this function. GetRelationIdentityOrPK() returns a "replica identity index oid" if the same is defined, else return PK, if that is defined, otherwise, return invalidOid. This is what is expected by its callers. Now, one can argue to have a different function name and that may be a valid argument but as far as I can see the function does what is expected from it. -- With Regards, Amit Kapila.
Re: Add pg_walinspect function with block info columns
At Tue, 7 Mar 2023 16:18:21 +0900, Michael Paquier wrote in > On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote: > > Ah. Yes, that expansion sounds sensible. > > Okay, so, based on this idea, I have hacked on this stuff and finish > with the attached that shows block data if it exists, as well as FPI > stuff if any. bimg_info is showed as a text[] for its flags. # The naming convetion looks inconsistent between # pg_get_wal_records_info and pg_get_wal_block_info but it's not an # issue of this patch.. > I guess that I'd better add a test that shows correctly a record with > some block data attached to it, on top of the existing one for FPIs.. > Any suggestions? Perhaps just a heap/heap2 record? > > Thoughts? I thought that we needed a test for block data when I saw the patch. I don't have great idea but a single insert should work. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: using memoize in in paralel query decreases performance
út 7. 3. 2023 v 9:58 odesílatel David Rowley napsal: > /On Tue, 7 Mar 2023 at 21:09, Pavel Stehule > wrote: > > > > po 6. 3. 2023 v 22:52 odesílatel David Rowley > napsal: > >> I wonder if the additional work_mem required for Memoize is just doing > >> something like causing kernel page cache evictions and leading to > >> fewer buffers for ixfk_ite_itemcategoryid and the item table being > >> cached in the kernel page cache. > >> > >> Maybe you could get an idea of that if you SET track_io_timing = on; > >> and EXPLAIN (ANALYZE, BUFFERS) both queries. > > > > > > https://explain.depesz.com/s/vhk0 > > This is the enable_memoize=on one. The I/O looks like: > > Buffers: shared hit=105661309 read=15274264 dirtied=15707 written=34863 > I/O Timings: shared/local read=2671836.341 write=1286.869 > > 2671836.341 / 15274264 = ~0.175 ms per read. > > > https://explain.depesz.com/s/R5ju > > This is the faster enable_memoize = off one. The I/O looks like: > > Buffers: shared hit=44542473 read=18541899 dirtied=11988 written=18625 > I/O Timings: shared/local read=1554838.583 write=821.477 > > 1554838.583 / 18541899 = ~0.084 ms per read. > > That indicates that the enable_memoize=off version is just finding > more pages in the kernel's page cache than the slower query. The > slower query just appears to be under more memory pressure causing the > kernel to have less free memory to cache useful pages. I don't see > anything here that indicates any problems with Memoize. Sure the > statistics could be better as, ideally, the Memoize wouldn't have > happened for the i_2 relation. I don't see anything that indicates any > bugs with this, however. It's pretty well known that Memoize puts > quite a bit of faith into ndistinct estimates. If it causes too many > issues the enable_memoize switch can be turned to off. > > You might want to consider experimenting with smaller values of > work_mem and/or hash_mem_multiplier for this query or just disabling > memoize altogether. > I can live with it. This is an analytical query and the performance is not too important for us. I was surprised that the performance was about 25% worse, and so the hit ratio was almost zero. I am thinking, but I am not sure if the estimation of the effectiveness of memoization can depend (or should depend) on the number of workers? In this case the number of workers is high. > David >
Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
> On 7 Mar 2023, at 09:35, Damir Belyalov wrote: > I felt just logging "Error: %ld" would make people wonder the meaning of > the %ld. Logging something like ""Error: %ld data type errors were > found" might be clearer. > > Thanks. For more clearance change the message to: "Errors were found: %". I'm not convinced that this adds enough clarity to assist the user. We also shouldn't use "error" in a WARNING log since the user has explicitly asked to skip rows on error, so it's not an error per se. How about something like: ereport(WARNING, (errmsg("%ld rows were skipped due to data type incompatibility", cstate->ignored_errors), errhint("Skipped rows can be inspected in the database log for reprocessing."))); -- Daniel Gustafsson
Re: pg_rewind: Skip log directory for file type check like pg_wal
> On 7 Mar 2023, at 08:33, Alexander Kukushkin wrote: > The "log_directory" GUC must be examined on both, source and target. Agreed, log_directory must be resolved to the configured values. Teaching pg_rewind about those in case they are stored in $PGDATA sounds like a good idea though. -- Daniel Gustafsson
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Tue, Mar 7, 2023 at 8:01 PM Amit Kapila wrote: > > On Tue, Mar 7, 2023 at 1:19 PM Peter Smith wrote: > > > > Let me give an example to demonstrate why I thought something is fishy here: > > > > Imagine rel has a (non-default) REPLICA IDENTITY with Oid=. > > Imagine the same rel has a PRIMARY KEY with Oid=. > > > > --- > > > > +/* > > + * Get replica identity index or if it is not defined a primary key. > > + * > > + * If neither is defined, returns InvalidOid > > + */ > > +Oid > > +GetRelationIdentityOrPK(Relation rel) > > +{ > > + Oid idxoid; > > + > > + idxoid = RelationGetReplicaIndex(rel); > > + > > + if (!OidIsValid(idxoid)) > > + idxoid = RelationGetPrimaryKeyIndex(rel); > > + > > + return idxoid; > > +} > > + > > +/* > > + * Given a relation and OID of an index, returns true if the > > + * index is relation's replica identity index or relation's > > + * primary key's index. > > + * > > + * Returns false otherwise. > > + */ > > +bool > > +IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid) > > +{ > > + Assert(OidIsValid(idxoid)); > > + > > + return GetRelationIdentityOrPK(rel) == idxoid; > > +} > > + > > > > > > So, according to the above function comment/name, I will expect > > calling IdxIsRelationIdentityOrPK passing Oid= or Oid- will > > both return true, right? > > > > But AFAICT > > > > IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel) > > returns (the valid oid of the RI) --> == --> true; > > > > IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel) > > returns (the valid oid of the RI) --> == --> false; > > > > ~ > > > > Now two people are telling me this is OK, but I've been staring at it > > for too long and I just don't see how it can be. (??) > > > > The difference is that you are misunderstanding the intent of this > function. GetRelationIdentityOrPK() returns a "replica identity index > oid" if the same is defined, else return PK, if that is defined, > otherwise, return invalidOid. This is what is expected by its callers. > Now, one can argue to have a different function name and that may be a > valid argument but as far as I can see the function does what is > expected from it. > Sure, but I am questioning the function IdxIsRelationIdentityOrPK, not GetRelationIdentityOrPK. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: using memoize in in paralel query decreases performance
On Tue, 7 Mar 2023 at 22:09, Pavel Stehule wrote: > I can live with it. This is an analytical query and the performance is not > too important for us. I was surprised that the performance was about 25% > worse, and so the hit ratio was almost zero. I am thinking, but I am not sure > if the estimation of the effectiveness of memoization can depend (or should > depend) on the number of workers? In this case the number of workers is high. The costing for Memoize takes the number of workers into account by way of the change in expected input rows. The number of estimated input rows is effectively just divided by the number of parallel workers, so if we expect 1 million rows from the outer side of the join and 4 workers, then we'll assume the memorize will deal with 250,000 rows per worker. If the n_distinct estimate for the cache key is 500,000, then it's not going to look very attractive to Memoize that. In reality, estimate_num_groups() won't say the number of groups is higher than the input rows, but Memoize, with all the other overheads factored into the costs, it would never look favourable if the planner thought there was never going to be any repeated values. The expected cache hit ratio there would be zero. David
Re: using memoize in in paralel query decreases performance
út 7. 3. 2023 v 10:46 odesílatel David Rowley napsal: > On Tue, 7 Mar 2023 at 22:09, Pavel Stehule > wrote: > > I can live with it. This is an analytical query and the performance is > not too important for us. I was surprised that the performance was about > 25% worse, and so the hit ratio was almost zero. I am thinking, but I am > not sure if the estimation of the effectiveness of memoization can depend > (or should depend) on the number of workers? In this case the number of > workers is high. > > The costing for Memoize takes the number of workers into account by > way of the change in expected input rows. The number of estimated > input rows is effectively just divided by the number of parallel > workers, so if we expect 1 million rows from the outer side of the > join and 4 workers, then we'll assume the memorize will deal with > 250,000 rows per worker. If the n_distinct estimate for the cache key > is 500,000, then it's not going to look very attractive to Memoize > that. In reality, estimate_num_groups() won't say the number of > groups is higher than the input rows, but Memoize, with all the other > overheads factored into the costs, it would never look favourable if > the planner thought there was never going to be any repeated values. > The expected cache hit ratio there would be zero. > Thanks for the explanation. Pavel > David >
Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
On Sun, 5 Mar 2023 at 13:21, Lukas Fittl wrote: > Alternatively (or in addition) we could consider showing the "ndistinct" > value that is calculated in cost_memoize_rescan - since that's the most > significant contributor to the cache hit ratio (and you can influence that > directly by improving the ndistinct statistics). I think the ndistinct estimate plus the est_entries together would be useful. I think showing just the hit ratio number might often just raise too many questions about how that's calculated. To calculate the hit ratio we need to estimate the number of entries that can be kept in the cache at once and also the number of input rows and the number of distinct values. We can see the input rows by looking at the outer side of the join in EXPLAIN, but we've no idea about the ndistinct or how many items the planner thought could be kept in the cache at once. The plan node already has est_entries, so it should just be a matter of storing the ndistinct estimate in the Path and putting it into the Plan node so the executor has access to it during EXPLAIN. David
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On 2023-Mar-07, Julien Rouhaud wrote: > I registered lapwing as a 32b Debian 7 so I thought it would be expected to > keep it as-is rather than upgrading to all newer major Debian versions, > especially since there were newer debian animal registered (no 32b though > AFAICS). I'm not opposed to upgrading it but I think there's still value in > having somewhat old packages versions being tested, especially since there > isn't much 32b coverage of those. I would be happy to register a newer 32b > version, or even sid, if needed but the -m32 part on the CI makes me think > there isn't much value doing that now. I think a pure 32bit animal running contemporary Debian would be better than just ditching the animal completely, as would appear to be the alternative, precisely because we have no other 32bit machine running x86 Linux. Maybe you can have *two* animals on the same machine: one running the old Debian without -Werror, and the one with new Debian and that flag kept. I think CI is not a replacement for the buildfarm. It helps catche some problems earlier, but we shouldn't think that we no longer need some buildfarm animals because CI runs those configs. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Add pg_walinspect function with block info columns
On Tue, Mar 7, 2023 at 12:48 PM Michael Paquier wrote: > > On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote: > > Ah. Yes, that expansion sounds sensible. > > Okay, so, based on this idea, I have hacked on this stuff and finish > with the attached that shows block data if it exists, as well as FPI > stuff if any. bimg_info is showed as a text[] for its flags. +1. > I guess that I'd better add a test that shows correctly a record with > some block data attached to it, on top of the existing one for FPIs.. > Any suggestions? Perhaps just a heap/heap2 record? > > Thoughts? That would be a lot better. Not just the test, but also the documentation can have it. Simple way to generate such a record (both block data and FPI) is to just change the wal_level to logical in walinspect.conf [1], see code around REGBUF_KEEP_DATA and RelationIsLogicallyLogged in heapam.c I had the following comments and fixed them in the attached v2 patch: 1. Still a trace of pg_get_wal_fpi_info in docs, removed it. 2. Used int4 instead of int for fpilen just to be in sync with fpi_length of pg_get_wal_record_info. 3. Changed to be consistent and use just FPI or "F/full page". /* FPI flags */ /* No full page image, so store NULLs for all its fields */ /* Full-page image */ /* Full page exists, so let's save it. */ * and end LSNs. This produces information about the full page images with * to a record. Decompression is applied to the full-page images, if 4. I think we need to free raw_data, raw_page and flags as we loop over multiple blocks (XLR_MAX_BLOCK_ID) and will leak memory which can be a problem if we have many blocks assocated with a single WAL record. flags = (Datum *) palloc0(sizeof(Datum) * bitcnt); Also, we will leak all CStringGetTextDatum memory in the block_id for loop. Another way is to use and reset temp memory context in the for loop over block_ids. I prefer this approach over multiple pfree()s in block_id for loop. 5. I think it'd be good to say if the FPI is for WAL_VERIFICATION, so I changed it to the following. Feel free to ignore this if you think it's not required. if (blk->apply_image) flags[cnt++] = CStringGetTextDatum("APPLY"); else flags[cnt++] = CStringGetTextDatum("WAL_VERIFICATION"); 6. Did minor wordsmithing. 7. Added test case which shows both block data and fpi in the documentation. 8. Changed wal_level to logical in walinspect.conf to test case with block data. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v2-0001-Rework-pg_walinspect-to-retrieve-more-block-infor.patch Description: Binary data
Re: RFC: logical publication via inheritance root?
Hi Jacob, > I'm going to register this in CF for feedback. Many thanks for the updated patch. Despite the fact that the patch is still work in progress all in all it looks very good to me. So far I only have a couple of nitpicks, mostly regarding the code coverage [1]: ``` +tablename = get_rel_name(tableoid); +if (tablename == NULL) +ereport(ERROR, +(errcode(ERRCODE_UNDEFINED_TABLE), + errmsg("OID %u does not refer to a table", tableoid))); +rootname = get_rel_name(rootoid); +if (rootname == NULL) +ereport(ERROR, +(errcode(ERRCODE_UNDEFINED_TABLE), + errmsg("OID %u does not refer to a table", rootoid))); ``` ``` +res = walrcv_exec(LogRepWorkerWalRcvConn, cmd.data, + lengthof(descRow), descRow); + +if (res->status != WALRCV_OK_TUPLES) +ereport(ERROR, +(errmsg("could not fetch logical descendants for table \"%s.%s\" from publisher: %s", +nspname, relname, res->err))); ``` ``` +res = walrcv_exec(LogRepWorkerWalRcvConn, cmd.data, 0, NULL); +pfree(cmd.data); +if (res->status != WALRCV_OK_COPY_OUT) +ereport(ERROR, +(errcode(ERRCODE_CONNECTION_FAILURE), + errmsg("could not start initial contents copy for table \"%s.%s\" from remote %s: %s", +lrel.nspname, lrel.relname, quoted_name, res->err))); ``` These new ereport() paths are never executed when we run the tests. I'm not 100% sure if they are "should never happen in practice" cases or not. If they are, I suggest adding corresponding comments. Otherwise we have to test these paths. ``` +else +{ +/* For older servers, we only COPY the table itself. */ +char *quoted = quote_qualified_identifier(lrel->nspname, +lrel->relname); +*to_copy = lappend(*to_copy, quoted); +} ``` Also we have to be extra careful with this code path because it is not test-covered too. ``` +Datum +pg_get_publication_rels_to_sync(PG_FUNCTION_ARGS) +{ +#define NUM_SYNC_TABLES_ELEM 1 ``` What is this macro for? ``` +{ oid => '8137', descr => 'get list of tables to copy during initial sync', + proname => 'pg_get_publication_rels_to_sync', prorows => '10', proretset => 't', + provolatile => 's', prorettype => 'regclass', proargtypes => 'regclass text', + proargnames => '{rootid,pubname}', + prosrc => 'pg_get_publication_rels_to_sync' }, ``` Something seems odd here. Is there a chance that it can return different results even within one statement, especially considering the fact that pg_set_logical_root() is VOLATILE? Maybe pg_get_publication_rels_to_sync() should be VOLATILE too [2]. ``` +{ oid => '8136', descr => 'mark a table root for logical replication', + proname => 'pg_set_logical_root', provolatile => 'v', proparallel => 'u', + prorettype => 'void', proargtypes => 'regclass regclass', + prosrc => 'pg_set_logical_root' }, ``` Shouldn't we also have pg_unset(reset?)_logical_root? [1]: https://github.com/afiskon/pgscripts/blob/master/code-coverage.sh [2]: https://www.postgresql.org/docs/current/xfunc-volatility.html -- Best regards, Aleksander Alekseev
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi Andres, Amit, all I think the case in which the patch regresses performance in is irrelevant > in > practice. > This is similar to what I think in this context. I appreciate the effort from Shi Yu, so that we have a clear understanding on the overhead. But the tests we do on [1] where we observe the regression are largely synthetic test cases that aim to spot the overhead. And having an index over thousands > of non-differing values will generally perform badly, not just in this > context. Similarly, maybe there are some eccentric use patterns that might follow this. But I also suspect even if there are such patterns, could they really be performance sensitive? Thanks, Onder KALACI [1] https://www.postgresql.org/message-id/OSZPR01MB63103A4AFBBA56BAF8AE7FAAFDA39%40OSZPR01MB6310.jpnprd01.prod.outlook.com
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi Shi Yu, all > Thanks for updating the patch. Here are some comments on v33-0001 patch. > > 1. > + if (RelationReplicaIdentityFullIndexScanEnabled(localrel) && > + remoterel->replident == REPLICA_IDENTITY_FULL) > > RelationReplicaIdentityFullIndexScanEnabled() is introduced in 0002 patch, > so > the call to it should be moved to 0002 patch I think. > Ah, sure, a rebase issue, fixed in v34 > > 2. > +#include "optimizer/cost.h" > > Do we need this in the latest patch? I tried and it looks it could be > removed > from src/backend/replication/logical/relation.c. > Hmm, probably an artifact of the initial versions of the patch where we needed some costing functionality. > > 3. > +# now, create a unique index and set the replica > +$node_publisher->safe_psql('postgres', > + "CREATE UNIQUE INDEX test_replica_id_full_unique ON > test_replica_id_full(x);"); > +$node_subscriber->safe_psql('postgres', > + "CREATE UNIQUE INDEX test_replica_id_full_unique ON > test_replica_id_full(x);"); > + > > Should the comment be "now, create a unique index and set the replica > identity"? > yes, fixed > > 4. > +$node_publisher->safe_psql('postgres', > + "ALTER TABLE test_replica_id_full REPLICA IDENTITY USING INDEX > test_replica_id_full_unique;"); > +$node_subscriber->safe_psql('postgres', > + "ALTER TABLE test_replica_id_full REPLICA IDENTITY USING INDEX > test_replica_id_full_unique;"); > + > +# wait for the synchronization to finish > +$node_subscriber->wait_for_subscription_sync; > > There's no new tables to need to be synchronized here, should we remove > the call > to wait_for_subscription_sync? > right, probably a copy & paste typo, thanks for spotting. I'll attach v34 with the next e-mail given the comments here only touch small parts of the patch. Thanks, Onder KALACI
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi Amit, all > > Few comments: > = > 1. > +get_usable_indexoid(ApplyExecutionData *edata, ResultRelInfo *relinfo) > { > ... > + if (targetrelkind == RELKIND_PARTITIONED_TABLE) > + { > + /* Target is a partitioned table, so find relmapentry of the partition */ > + TupleConversionMap *map = ExecGetRootToChildMap(relinfo, edata->estate); > + AttrMap*attrmap = map ? map->attrMap : NULL; > + > + relmapentry = > + logicalrep_partition_open(relmapentry, relinfo->ri_RelationDesc, > + attrmap); > > > When will we hit this part of the code? As per my understanding, for > partitioned tables, we always follow apply_handle_tuple_routing() > which should call logicalrep_partition_open(), and do the necessary > work for us. > > Looking closer, there is really no need for that. I changed the code such that we pass usableLocalIndexOid. It looks simpler to me. Do you agree? > 2. In logicalrep_partition_open(), it would be better to set > localrelvalid after finding the required index. The entry should be > marked valid after initializing/updating all the required members. I > have changed this in the attached. > > makes sense > 3. > @@ -31,6 +32,7 @@ typedef struct LogicalRepRelMapEntry > Relation localrel; /* relcache entry (NULL when closed) */ > AttrMap*attrmap; /* map of local attributes to remote ones */ > bool updatable; /* Can apply updates/deletes? */ > + Oid usableIndexOid; /* which index to use, or InvalidOid if none */ > > Would it be better to name this new variable as localindexoid to match > it with the existing variable localreloid? Also, the camel case for > this variable appears odd. > yes, both makes sense > > 4. If you agree with the above, then we should probably change the > name of functions get_usable_indexoid() and > FindLogicalRepUsableIndex() accordingly. > I dropped get_usable_indexoid() as noted in (1). Changed FindLogicalRepUsableIndex->FindLogicalRepLocalIndex > 5. > + { > + /* > + * If we had a primary key or relation identity with a unique index, > + * we would have already found and returned that oid. At this point, > + * the remote relation has replica identity full. > > These comments are not required as this just states what the code just > above is doing. > I don't have any strong opinions on adding this comment, applied your suggestion. > > Apart from the above, I have made some modifications in the other > comments. If you are convinced with those, then kindly include them in > the next version. > > Sure, they all look good. I think I have lost (and caused the reviewers to lose) quite a bit of time on the comment reviews. Next time, I'll try to be more prepared for the comments. Attached v34 Thanks, Onder KALACI v34_0001_use_index_on_subs_when_pub_rep_ident_full.patch Description: Binary data v34_0002_use_index_on_subs_when_pub_rep_ident_full.patch Description: Binary data
RE: The order of queues in row lock is changed (not FIFO)
From: Tom Lane > I don't see a bug here, or at least I'm not willing to move the goalposts to > where you want them to be. > I believe that we do guarantee arrival-order locking of individual tuple > versions. However, in the > example you show, a single row is being updated over and over. So, initially > we have a single "winner" > transaction that got the tuple lock first and updated the row. When it > commits, each other transaction > serially comes off the wait queue for that tuple lock and discovers that it > now needs a lock on a > different tuple version than it has got. > So it tries to get lock on whichever is the latest tuple version. > That might still appear serial as far as the original 100 sessions go, > because they were all queued on the > same tuple lock to start with. > But when the new sessions come in, they effectively line-jump because they > will initially try to lock > whichever tuple version is committed live at that instant, and thus they get > ahead of whichever remain of > the original 100 sessions for the lock on that tuple version (since those are > all still blocked on some older > tuple version, whose lock is held by whichever session is performing the > next-to-commit update). > I don't see any way to make that more stable that doesn't involve requiring > sessions to take locks on > already-dead-to-them tuples; which sure seems like a nonstarter, not least > because we don't even have a > way to find such tuples. The update chains only link forward not back. Thank you for your reply. When I was doing this test, I confirmed the following two actions. (1) The first 100 sessions are overtaken by the last 10. (2) the order of the preceding 100 sessions changes (1) I was concerned from the user's point of view that the lock order for the same tuple was not preserved. However, as you pointed out, in many cases the order of arrival is guaranteed from the perspective of the tuple. You understand the PostgreSQL architecture and understand that you need to use it. (2) This behavior is rare. Typically, the first session gets AccessExclusiveLock to the tuple and ShareLock to the transaction ID. Subsequent sessions will wait for AccessExclusiveLock to the tuple. However, we ignored AccessExclusiveLock in the tuple from the log and observed multiple sessions waiting for ShareLock to the transaction ID. The log shows that the order of the original 100 sessions has been changed due to the above movement. At first, I thought both (1) and (2) were obstacles. However, I understood from your indication that (1) is not a bug. I would be grateful if you could also give me your opinion on (2). Share the following logs: [Log] 1. ShareLock has one wait, the rest is in AccessExclusiveLock 1-1. Only 1369555 is aligned with ShareLock, the transaction ID obtained by 1369547, and the rest with AccessExclusiveLock, the tuple obtained by 1369555. This is similar to a pattern in which no updates have occurred to the tuple. -- 2022-10-26 01:20:08.881 EDT [1369555:19:0] LOG: process 1369555 still waiting for ShareLock on transaction 2501 after 10.072 ms 2022-10-26 01:20:08.881 EDT [1369555:20:0] DETAIL: Process holding the lock: 1369547. Wait queue: 1369555. 〜 2022-10-26 01:21:58.918 EDT [1369898:17:0] LOG: process 1369898 acquired AccessExclusiveLock on tuple (1, 0) of relation 16546 of database 13779 after 10.321 ms 2022-10-26 01:21:58.918 EDT [1369898:18:0] DETAIL: Process holding the lock: 1369555. Wait queue: 1369558, 1369561, 1369564, 1369567, 1369570, 1369573, 1369576, ... -- 2. All processes wait with ShareLock 2-1. With 1369558 holding the t1 (0, 4) lock, the queue head is 1369561. -- 2022-10-26 01:22:27.230 EDT [1369623:46:2525] LOG: process 1369623 still waiting for ShareLock on transaction 2504 after 10.133 msprocess 1369623 still waiting for ShareLock on transaction 2504 after 10.133 ms 2022-10-26 01:22:27.242 EDT [1369877:47:2604] DETAIL: Process holding the lock: 1369558. Wait queue: 1369561, 1369623, 1369626, ... -- 2-2. When 1369558 locks are released, the first 1369561 in the Wait queue was expected to acquire the lock, but the process actually acquired 1369787 -- 2022-10-26 01:22:28.237 EDT [1369623:63:2525] LOG: process 1369623 still waiting for ShareLock on transaction 2577 after 10.028 ms 2022-10-26 01:22:28.237 EDT [1369623:64:2525] DETAIL: Process holding the lock: 1369787. Wait queue: 1369623, 1369610, 1369614, 1369617, 1369620. -- 2-3. Checking that the 1369561 is rearranging. -- 2022-10-26 01:22:28.237 E
some problem explicit_bzero with building PostgreSQL on linux
Hi, we got some problem with building PostgreSQL (version 15.1) on linux ldd —version returns ldd (Debian GLIBC 2.31-13+deb11u5.tmw1) 2.31 we can build it all right, however we want to use binaries on different glibc version so we’re detecting usage of the glibc version > 2.17 and we need to prevent usage of symbols (like explicit_bzero), that wasn’t exist in glibc 2.17. what we see that even if I commented line $as_echo "#define HAVE_EXPLICIT_BZERO 1" >>confdefs.h from configure we still have a problem - symbol explicit_bzero was leaked in lib/libpq.so.5.15, bin/postgres, bin/pg_verifybackup I was able to verify that HAVE_EXPLICIT_BZERO wasn’t defined in all c files that use explicit_bzero: ./src/interfaces/libpq/fe-connect.c ./src/backend/libpq/be-secure-common.c ./src/common/hmac_openssl.c ./src/common/cryptohash.c ./src/common/cryptohash_openssl.c ./src/common/hmac.c how we can guaranty that if HAVE_EXPLICIT_BZERO is not defined then explicit_bzero function implemented in port/explicit_bzero.c will be used (just like in Darwin or windows) thanks in advance dm
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
On Tue, 17 Jan 2023 at 16:33, Aleksander Alekseev wrote: > Hi hackers, > > Maxim, perhaps you could share with us what your reasoning was here? > > I'm really sorry for late response, but better late than never. Yes, we can not access shared memory without lock. In this particular case, we use XidGenLock. That is why we use lock argument to take it is it was not taken previously. Actually, we may place assertion in this insist. As for xid compare: we do not compare xids here, we are checking for wraparound, so, AFAICS, this code is correct. On Mon, 6 Mar 2023 at 22:48, Heikki Linnakangas wrote: > > 1. Use 64 bit page numbers in SLRUs (this patch) > > 2. Use the larger segment file names in async.c, to lift the current 8 > GB limit on the max number of pending notifications. > > 3. Extend pg_multixact so that pg_multixact/members is addressed by > 64-bit offsets. > > 4. Extend pg_subtrans to 64-bits. > > 5. Extend pg_xact to 64-bits. > > > 6. (a bonus thing that I noticed while thinking of pg_xact.) Extend > pg_twophase.c, to use FullTransactionIds. > > Currently, the twophase state files in pg_twophase are named according > to the 32 bit Xid of the transaction. Let's switch to FullTransactionId > there. > > ... > > I propose that we try to finish 1 and 2 for v16. And maybe 6. I think > that's doable. It doesn't have any great user-visible benefits yet, but > we need to start somewhere. > > - Heikki > > Yes, this is a great idea! My only concern here is that we're going in circles here. You see, patch 1 is what was proposed in the beginning of this thread. Anyway, I will be happy if we are being able to push this topic forward. As for making pg_multixact 64 bit, I spend the last couple of days to make proper pg_upgrade for pg_multixact's and for pg_xact's with wraparound and I've understood, that it is not a simple task compare to pg_xact's. The problem is, we do not have epoch for multixacts, so we do not have ability to "overcome" wraparound. The solution may be adding some kind of epoch for multixacts or make them 64 bit in "main" 64-xid patch, but in perspective of this thread, in my view, this should be last in line here. In pg_xact we do not have such a problem, we do have epoch for transacions, so conversion should be pretty obvious: -> 0001 -> 0001 ... 0FFE -> 0FFE 0FFF -> 0FFF -> 0001 0001 -> 00010001 So, in my view, the plan should be: 1. Use internal 64 bit page numbers in SLRUs without changing segments naming. 2. Use the larger segment file names in async.c, to lift the current 8 GB limit on the max number of pending notifications. 3. Extend pg_xact to 64-bits. 4. Extend pg_subtrans to 64-bits. 5. Extend pg_multixact so that pg_multixact/members is addressed by 64-bit offsets. 6. Extend pg_twophase.c, to use FullTransactionIds. (a bonus thing) Thoughts? -- Best regards, Maxim Orlov. On Mon, 6 Mar 2023 at 22:48, Heikki Linnakangas wrote: > On 01/03/2023 12:21, Aleksander Alekseev wrote: > > Hi, > > > >> I'm surprised that these patches extend the page numbering to 64 bits, > >> but never actually uses the high bits. The XID "epoch" is not used, and > >> pg_xact still wraps around and the segment names are still reused. I > >> thought we could stop doing that. > > > > To clarify, the idea is to let CLOG grow indefinitely and simply store > > FullTransactionId -> TransactionStatus (two bits). Correct? > > Correct. > > > I didn't investigate this in much detail but it may affect quite some > > amount of code since TransactionIdDidCommit() and > > TransactionIdDidCommit() currently both deal with TransactionId, not > > FullTransactionId. IMO, this would be a nice change however, assuming > > we are ready for it. > > Yep, it's a lot of code churn.. > > > In the previous version of the patch there was an attempt to derive > > FullTransactionId from TransactionId but it was wrong for the reasons > > named above in the thread. Thus is was removed and the patch > > simplified. > > Yeah, it's tricky to get it right. Clearly we need to do it at some > point though. > > All in all, this is a big effort. I spent some more time reviewing this > in the last few days, and thought a lot about what the path forward here > could be. And I haven't looked at the actual 64-bit XIDs patch set yet, > just this patch to use 64-bit addressing in SLRUs. > > This patch is the first step, but we have a bit of a chicken and egg > problem, because this patch on its own isn't very interesting, but on > the other hand, we need it to work on the follow up items. Here's how I > see the development path for this (and again, this is just for the > 64-bit SLRUs work, not the bigger 64-bit-XIDs-in-heapam effort): > > 1. Use 64 bit page numbers in SLRUs (this patch) > > I would like to make one change here: I'd prefer to keep the old 4-digit > segment names, until we actually start to use the wider address space. > Let's allow each SLRU to specify how many dig
Re: Add pg_walinspect function with block info columns
On Tue, 7 Mar 2023 at 01:34, Michael Paquier wrote: > > On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote: > > On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy > >> IMO, pg_get_wal_records_extended_info as proposed doesn't look good to > >> me as it outputs most of the columns that are already given by > >> pg_get_wal_records_info.What I think the best way at this point is to > >> make it return the following: > >> lsn pg_lsn > >> block_id int8 > >> spcOid oid > >> dbOid oid > >> relNumber oid > >> forkNames text > >> fpi bytea > >> fpi_info text > > I would add the length of the block data (without the hole and > compressed, as the FPI data should always be presented as > uncompressed), and the block data if any (without the block data > length as one can guess it based on the bytea data length). Note that > a block can have both a FPI and some data assigned to it, as far as I > recall. > > > The basic idea is to create a single entrypoint to all relevant data > > from DecodedXLogRecord in SQL, not multiple. > > While I would agree with this principle on simplicity's ground in > terms of minimizing the SQL interface and the pg_wal/ lookups, I > disagree about it on unsability ground, because we can avoid extra SQL > tweaks with more functions. One recent example I have in mind is > partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE > on the catalogs. Correct, but in that case the user would build the same query (or at least with the same complexity) as what we're executing under the hood, right? > There are of course various degrees of complexity, > and perhaps unnest() cannot qualify as one, but having two functions > returning normalized records (one for the record information, and a > second for the block information), is a rather good balance between > usability and interface complexity, in my experience. I would agree, if it weren't for the reasons written below. > If you have two > functions, a JOIN is enough to cross-check the block data and the > record data, Joins are expensive on large datasets; and because WAL is one of the largest datasets in the system, why would we want to force the user to JOIN them if we can produce the data in one pre-baked data structure without a need to join? > while an unnest() heavily bloats the main function output > (aka byteas of FPIs in a single array). I don't see how that would be bad. You can select a subset of columns without much issue, which can allow you to ignore any and all bloat. It is also not easy to imagine that we'd have arguments in the function that determine whether it includes the largest fields (main data, blocks, block data, and block images) or leaves them NULL so that we need to pass less data around if the user doesn't want the data. Matthias van de Meent
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Tue, Mar 7, 2023 at 3:00 PM Peter Smith wrote: > > On Tue, Mar 7, 2023 at 8:01 PM Amit Kapila wrote: > > > > On Tue, Mar 7, 2023 at 1:19 PM Peter Smith wrote: > > > > > > Let me give an example to demonstrate why I thought something is fishy > > > here: > > > > > > Imagine rel has a (non-default) REPLICA IDENTITY with Oid=. > > > Imagine the same rel has a PRIMARY KEY with Oid=. > > > > > > --- > > > > > > +/* > > > + * Get replica identity index or if it is not defined a primary key. > > > + * > > > + * If neither is defined, returns InvalidOid > > > + */ > > > +Oid > > > +GetRelationIdentityOrPK(Relation rel) > > > +{ > > > + Oid idxoid; > > > + > > > + idxoid = RelationGetReplicaIndex(rel); > > > + > > > + if (!OidIsValid(idxoid)) > > > + idxoid = RelationGetPrimaryKeyIndex(rel); > > > + > > > + return idxoid; > > > +} > > > + > > > +/* > > > + * Given a relation and OID of an index, returns true if the > > > + * index is relation's replica identity index or relation's > > > + * primary key's index. > > > + * > > > + * Returns false otherwise. > > > + */ > > > +bool > > > +IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid) > > > +{ > > > + Assert(OidIsValid(idxoid)); > > > + > > > + return GetRelationIdentityOrPK(rel) == idxoid; > > > +} > > > + > > > > > > > > > So, according to the above function comment/name, I will expect > > > calling IdxIsRelationIdentityOrPK passing Oid= or Oid- will > > > both return true, right? > > > > > > But AFAICT > > > > > > IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel) > > > returns (the valid oid of the RI) --> == --> true; > > > > > > IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel) > > > returns (the valid oid of the RI) --> == --> false; > > > > > > ~ > > > > > > Now two people are telling me this is OK, but I've been staring at it > > > for too long and I just don't see how it can be. (??) > > > > > > > The difference is that you are misunderstanding the intent of this > > function. GetRelationIdentityOrPK() returns a "replica identity index > > oid" if the same is defined, else return PK, if that is defined, > > otherwise, return invalidOid. This is what is expected by its callers. > > Now, one can argue to have a different function name and that may be a > > valid argument but as far as I can see the function does what is > > expected from it. > > > > Sure, but I am questioning the function IdxIsRelationIdentityOrPK, not > GetRelationIdentityOrPK. > The intent of IdxIsRelationIdentityOrPK() is as follows: Returns true for the following conditions (a) if the given index OID is the same as replica identity (when the same is defined); else (if replica identity is not defined then (b)) (b) if the given OID is the same as PK. Returns false otherwise. Feel free to propose any better function name or if you think comments can be changed which makes it easier to understand. -- With Regards, Amit Kapila.
Re: pg_dump/pg_restore: Fix stdin/stdout handling of custom format on Win32
On Thu, Mar 2, 2023 at 8:01 AM Michael Paquier wrote: > > We had better make sure that this does not break again 10260c7, and > these could not be reproduced with automated tests as they needed a > Windows terminal. Isn't this issue like the other commit, where the > automated testing cannot reproduce any of that because it requires a > terminal? If not, could it be possible to add some tests to have some > coverage? The tests of pg_dump in src/bin/pg_dump/t/ invoke the > custom format in a few scenarios already, and these are tested in the > buildfarm for a couple of years now, without failing, but perhaps we'd > need a small tweak to have a reproducible test case for automation? > I've been able to manually reproduce the problem with: pg_dump --format=custom > custom.dump pg_restore --file=toc.txt --list custom.dump pg_dump --format=custom | pg_restore --file=toc.dump --use-list=toc.txt The error I get is: pg_restore: error: unsupported version (0.7) in file header I'm not really sure how to integrate this in a tap test. > The internal implementation of _pgstat64() is used in quite a few > places, so we'd better update this part first, IMO, and then focus on > the pg_dump part. Anyway, it looks like you are right here: there is > nothing for FILE_TYPE_PIPE or FILE_TYPE_CHAR in this WIN32 > implementation of fstat(). > > I am amazed to hear that both ftello64() and fseek64() actually > succeed if you use a pipe: > https://pubs.opengroup.org/onlinepubs/9699919799/functions/fseek.html > Could it be something we should try to make more portable by ourselves > with a wrapper for these on WIN32? That would not be the first one to > accomodate our code with POSIX, and who knows what code could be broken > because of that, like external extensions that use fseek64() without > knowing it. > The error is reproducible in versions previous to win32stat.c, so that might work as bug fix. > - if (hFile == INVALID_HANDLE_VALUE || buf == NULL) > + if (hFile == INVALID_HANDLE_VALUE || hFile == (HANDLE)-2 || buf == > NULL) > What's the -2 for? Perhaps this should have a comment? > There's a note on _get_osfhandle() [1] about when -2 is returned, but a comment seems appropriate. + fileType = GetFileType(hFile); > + lastError = GetLastError(); > [...] >if (fileType == FILE_TYPE_UNKNOWN && lastError != NO_ERROR) { > + _dosmaperr(lastError); > + return -1; > } > So, the patched code assumes that all the file types classified as > FILE_TYPE_UNKNOWN when GetFileType() does not fail refer to fileno > being either stdin, stderr or stdout. Perhaps we had better > cross-check that fileno points to one of these three cases in the > switch under FILE_TYPE_UNKNOWN? Could there be other cases where we > have FILE_TYPE_UNKNOWN but GetFileType() does not fail? > I don't think we should set st_mode for FILE_TYPE_UNKNOWN. > Per the documentation of GetFileType, FILE_TYPE_REMOTE is unused: > > https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfiletype > Perhaps it would be safer to fail in this case? > +1, we don't know what that might involve. [1] https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/get-osfhandle?view=msvc-170 Regards, Juan José Santamaría Flecha
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
On 07/03/2023 13:38, Maxim Orlov wrote: As for making pg_multixact 64 bit, I spend the last couple of days to make proper pg_upgrade for pg_multixact's and for pg_xact's with wraparound and I've understood, that it is not a simple task compare to pg_xact's. The problem is, we do not have epoch for multixacts, so we do not have ability to "overcome" wraparound. The solution may be adding some kind of epoch for multixacts or make them 64 bit in "main" 64-xid patch, but in perspective of this thread, in my view, this should be last in line here. That is true for pg_multixact/offsets. We will indeed need to add an epoch and introduce the concept of FullMultiXactIds for that. However, we can change pg_multixact/members independently of that. We can extend MultiXactOffset from 32 to 64 bits, and eliminate pg_multixact/members wraparound, while keeping multi-xids 32 bits wide. - Heikki
Re: Raising the SCRAM iteration count
> On 7 Mar 2023, at 09:26, Daniel Gustafsson wrote: > Right, what I meant was: can a pg_regress sql/expected test drive a psql > interactive prompt? Your comments suggested using password.sql so I was > curious if I was missing a neat trick for doing this. The attached v7 adds a TAP test for verifying that \password use the changed SCRAM iteration count setting, and dials back the other added test to use fewer iterations than the default setting in order to shave (barely noticeable amounts of) cpu cycles. Running interactive tests against psql adds a fair bit of complexity and isn't all that pleasing on the eye, but it can be cleaned up and refactored when https://commitfest.postgresql.org/42/4228/ is committed. -- Daniel Gustafsson v7-0001-Make-SCRAM-iteration-count-configurable.patch Description: Binary data
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
Hi, > Here's how I see the development path for this [...] > So, in my view, the plan should be [...] > Thoughts? The plan looks great! I would also explicitly include this: > As we start to refactor these things, I also think it would be good to > have more explicit tracking of the valid range of SLRU pages in each > SLRU. Take pg_subtrans for example: it's not very clear what pages have > been initialized, especially during different stages of startup. It > would be good to have clear start and end page numbers, and throw an > error if you try to look up anything outside those bounds. Same for all > other SLRUs. So the plan becomes: 1. Use internal 64 bit page numbers in SLRUs without changing segments naming. 2. Use the larger segment file names in async.c, to lift the current 8 GB limit on the max number of pending notifications. 3. Extend pg_xact to 64-bits. 4. Extend pg_subtrans to 64-bits. 5. Extend pg_multixact so that pg_multixact/members is addressed by 64-bit offsets. 6. Extend pg_twophase.c, to use FullTransactionIds. (a bonus thing) 7. More explicit tracking of the valid range of SLRU pages in each SLRU Where our main focus for PG16 is going to be 1 and 2, and we may try to deliver 6 and 7 too if time will permit. Maxim and I agreed (offlist) that I'm going to submit 1 and 2. The patch 1 is ready, please see the attachment. I'm currently working on 2 and going to submit it in a bit. It seems to be relatively straightforward but I don't want to rush things and want to make sure I didn't miss something. > I wonder if we should actually add an artificial limit, as a GUC. Yes, I think we need some sort of limit. Using a GUC would be the most straightforward approach. Alternatively we could derive the limit from the existing GUCs, similarly to how we derive the default value of wal_buffers from shared_buffers [1]. However, off the top of my head we only have max_wal_size and it doesn't strike me as a good candidate for deriving something for NOTIFY / LISTEN. So I'm going to add max_notify_segments GUC with the default value of 65536 as it is now. If the new GUC will receive a push back from the community we can always use a hard-coded value instead, or no limit at all. [1]: https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-BUFFERS -- Best regards, Aleksander Alekseev v56-0001-Index-SLRUs-by-64-bit-integers-rather-than-by-32.patch Description: Binary data
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi Amit, Peter > > > Let me give an example to demonstrate why I thought something is fishy > here: > > > > > > Imagine rel has a (non-default) REPLICA IDENTITY with Oid=. > > > Imagine the same rel has a PRIMARY KEY with Oid=. > > > > Hmm, alright, this is syntactically possible, but not sure if any user would do this. Still thanks for catching this. And, you are right, if a user has created such a schema, IdxIsRelationIdentityOrPK() would return the wrong result and we'd use sequential scan instead of index scan. This would be a regression. I think we should change the function. Here is the example: DROP TABLE tab1; CREATE TABLE tab1 (a int NOT NULL); CREATE UNIQUE INDEX replica_unique ON tab1(a); ALTER TABLE tab1 REPLICA IDENTITY USING INDEX replica_unique; ALTER TABLE tab1 ADD CONSTRAINT pkey PRIMARY KEY (a); I'm attaching v35. Does that make sense to you Amit? v35_0002_use_index_on_subs_when_pub_rep_ident_full.patch Description: Binary data v35_0001_use_index_on_subs_when_pub_rep_ident_full.patch Description: Binary data
Re: some problem explicit_bzero with building PostgreSQL on linux
Dimitry Markman writes: > how we can guaranty that if HAVE_EXPLICIT_BZERO is not defined then > explicit_bzero function implemented in port/explicit_bzero.c will be used > (just like in Darwin or windows) Did you remember to add explicit_bzero.o to LIBOBJS in the configured Makefile.global? If it still doesn't work, then evidently your toolchain is selecting the system's built-in definition of explicit_bzero over the one in src/port/. This is not terribly surprising given that there has to be some amount of compiler magic involved in that function. You may have to resort to actually building Postgres on a platform without explicit_bzero. regards, tom lane
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
On Tue, 7 Mar 2023 at 15:38, Heikki Linnakangas wrote: > > That is true for pg_multixact/offsets. We will indeed need to add an > epoch and introduce the concept of FullMultiXactIds for that. However, > we can change pg_multixact/members independently of that. We can extend > MultiXactOffset from 32 to 64 bits, and eliminate pg_multixact/members > wraparound, while keeping multi-xids 32 bits wide. > > Yes, you're totally correct. If it will be committable that way, I'm all for that. -- Best regards, Maxim Orlov.
Re: Add a hook to allow modification of the ldapbindpasswd
On 2023-03-06 Mo 15:16, Gregory Stark (as CFM) wrote: The CFBot says this patch is failing but I find it hard to believe this is related to this patch... 2023-03-05 20:56:58.705 UTC [33902][client backend] [pg_regress/btree_index][18/750:0] STATEMENT: ALTER INDEX btree_part_idx ALTER COLUMN id SET (n_distinct=100); 2023-03-05 20:56:58.709 UTC [33902][client backend] [pg_regress/btree_index][:0] LOG: disconnection: session time: 0:00:02.287 user=postgres database=regression host=[local] 2023-03-05 20:56:58.710 UTC [33889][client backend] [pg_regress/join][:0] LOG: disconnection: session time: 0:00:02.289 user=postgres database=regression host=[local] 2023-03-05 20:56:58.749 UTC [33045][postmaster] LOG: server process (PID 33898) was terminated by signal 6: Abort trap 2023-03-05 20:56:58.749 UTC [33045][postmaster] DETAIL: Failed process was running: SELECT * FROM writetest; 2023-03-05 20:56:58.749 UTC [33045][postmaster] LOG: terminating any other active server processes Yeah. It says it's fine now. Neither of the two recent failures look like they have anything to do with this. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: GUC for temporarily disabling event triggers
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed I like it now. * The patch does what it intends to do; * The implementation way is clear; * All test are passed; * No additional problems catched - at least by my eyes; I think it can be marked as Ready for Committer N.B. In fact I've encountered couple of failed tests during installcheck-world, although the same fails are there even for master branch. Thus doesn't seem to be this patch issue. The new status of this patch is: Ready for Committer
Re: Add shared buffer hits to pg_stat_io
Hi, On 3/6/23 4:38 PM, Melanie Plageman wrote: Thanks for the review! On Tue, Feb 28, 2023 at 7:36 AM Drouvot, Bertrand wrote: BufferDesc * LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, -bool *foundPtr, IOContext *io_context) +bool *foundPtr, IOContext io_context) { BufferTag newTag; /* identity of requested block */ LocalBufferLookupEnt *hresult; @@ -128,14 +128,6 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, hresult = (LocalBufferLookupEnt *) hash_search(LocalBufHash, &newTag, HASH_FIND, NULL); - /* -* IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set -* io_context here (instead of after a buffer hit would have returned) for -* convenience since we don't have to worry about the overhead of calling -* IOContextForStrategy(). -*/ - *io_context = IOCONTEXT_NORMAL; It looks like that io_context is not used in LocalBufferAlloc() anymore and then can be removed as an argument. Good catch. Updated patchset attached. Thanks for the update! While adding this, I noticed that I had made all of the IOOP columns int8 in the view, and I was wondering if this is sufficient for hits (I imagine you could end up with quite a lot of those). I think that's ok and bigint is what is already used for pg_statio_user_tables.heap_blks_hit for example. Ah, I was silly and didn't understand that the SQL type int8 is eight bytes and not 1. That makes a lot of things make more sense :) Oh, I see ;-) I may give it another review but currently V2 looks good to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Track IO times in pg_stat_io
Hi, On 3/6/23 5:30 PM, Melanie Plageman wrote: Thanks for the review! On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand wrote: On 2/26/23 5:03 PM, Melanie Plageman wrote: The timings will only be non-zero when track_io_timing is on That could lead to incorrect interpretation if one wants to divide the timing per operations, say: - track_io_timing is set to on while there is already operations - or set to off while it was on (and the number of operations keeps growing) Might be worth to warn/highlight in the "track_io_timing" doc? This is a good point. I've added a note to the docs for pg_stat_io. Thanks! Now I've a second thought: what do you think about resetting the related number of operations and *_time fields when enabling/disabling track_io_timing? (And mention it in the doc). That way it'd prevent bad interpretation (at least as far the time per operation metrics are concerned). Thinking that way as we'd loose some (most?) benefits of the new *_time columns if one can't "trust" their related operations and/or one is not sampling pg_stat_io frequently enough (to discard the samples where the track_io_timing changes occur). But well, resetting the operations could also lead to bad interpretation about the operations... Not sure about which approach I like the most yet, what do you think? That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() (for the IOOP_EXTEND case) and after pgstat_count_io_op() (for the IOOP_READ case). What about calling them in the same order and so that pgstat_count_io_time() is called before pgstat_count_io_op()? If so, the ordering would also need to be changed in: - FlushRelationBuffers() - register_dirty_segment() Yes, good point. I've updated the code to use this suggested ordering in attached v3. Thanks, this looks good to me. As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that would be a good idea to also check that if counts are not Zero then times are not Zero. Yes, I think adding some validation around the relationship between counts and timing should help prevent developers from forgetting to call pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant). However, I think that we cannot check that if IO counts are non-zero that IO times are non-zero, because the user may not have track_io_timing enabled. Yeah, right. We can check that if IO times are not zero, IO counts are not zero. I've done this in the attached v3. Thanks, looks good to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
Hi, Andres and Alexander! On Tue, 7 Mar 2023, 10:10 Alexander Korotkov, wrote: > On Tue, Mar 7, 2023 at 4:50 AM Andres Freund wrote: > > On 2023-03-02 14:28:56 +0400, Pavel Borisov wrote: > > > 2. Heap updates with low tuple concurrency: > > > Prepare with pkeys (pgbench -d postgres -i -I dtGvp -s 300 > --unlogged-tables) > > > Update 3*10^7 rows, 50 conns (pgbench postgres -f > > > ./update-only-account.sql -s 300 -P10 -M prepared -T 600 -j 5 -c 50) > > > > > > Result: Both patches and master are the same within a tolerance of > > > less than 0.7%. > > > > What exactly does that mean? I would definitely not want to accept a 0.7% > > regression of the uncontended case to benefit the contended case here... > > I don't know what exactly Pavel meant, but average overall numbers for > low concurrency are. > master: 420401 (stddev of average 233) > patchset v11: 420111 (stddev of average 199) > The difference is less than 0.1% and that is very safely within the error. > Yes, the only thing that I meant is that for low-concurrency case the results between patch and master are within the difference between repeated series of measurements. So I concluded that the test can not prove any difference between patch and master. I haven't meant or written there is some performance degradation. Alexander, I suppose did an extra step and calculated overall average and stddev, from raw data provided. Thanks! Regards, Pavel. >
Re: Timeline ID hexadecimal format
On 06/03/2023 18:04, Peter Eisentraut wrote: On 03.03.23 16:52, Sébastien Lardière wrote: On 02/03/2023 09:12, Peter Eisentraut wrote: I think here it would be more helpful to show actual examples. Like, here is a possible file name, this is what the different parts mean. So you mean explain the WAL filename and the history filename ? Is it the good place for it ? Well, your patch says, by the way, the timeline ID in the file is hexadecimal. Then one might ask, what file, what is a timeline, what are the other numbers in the file, etc. It seems very specific in this context. I don't know if the format of these file names is actually documented somewhere. Well, in the context of this patch, the usage both filename are explained juste before, so it seems understandable to me Timelines are explained in this place : https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-TIMELINES so the patch explains the format there This applies to all configuration parameters, so it doesn't need to be mentioned explicitly for individual ones. Probably, but is there another parameter with the same consequence ? worth it to document this point globally ? It's ok to mention it again. We do something similar for example at unix_socket_permissions. But maybe with more context, like "If you want to specify a timeline ID hexadecimal (for example, if extracted from a WAL file name), then prefix it with a 0x". Ok, I've improved the message Maybe this could be fixed instead? Indeed, and strtoul is probably a better option than sscanf, don't you think ? Yeah, the use of sscanf() is kind of weird here. We have been moving the option parsing to use option_parse_int(). Maybe hex support could be added there. Or just use strtoul(). I've made the change with strtoul About option_parse_int(), actually, strtoint() is used, do we need a option_parse_ul() fonction ? patch attached, best regards, -- Sébastien diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index be05a33205..fb86a3fec5 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -1332,7 +1332,8 @@ restore_command = 'cp /mnt/server/archivedir/%f %p' you like, add comments to a history file to record your own notes about how and why this particular timeline was created. Such comments will be especially valuable when you have a thicket of different timelines as -a result of experimentation. +a result of experimentation. The timeline identifier is an integer which +is used in hexadecimal format in both WAL segment file names and history files. diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index e5c41cc6c6..6c0d63b73d 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4110,7 +4110,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows current when the base backup was taken. The value latest recovers to the latest timeline found in the archive, which is useful in -a standby server. latest is the default. +a standby server. A numerical value expressed in hexadecimal, by exemple +from WAL filename or history file, must be prefixed with 0x, +for example 0x11 if the WAL filename +is 001100A1004F. +latest is the default. diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml index 343f0482a9..d92948c68a 100644 --- a/doc/src/sgml/ref/pg_waldump.sgml +++ b/doc/src/sgml/ref/pg_waldump.sgml @@ -215,7 +215,9 @@ PostgreSQL documentation Timeline from which to read WAL records. The default is to use the value in startseg, if that is specified; otherwise, the -default is 1. +default is 1. The value can be expressed in decimal or hexadecimal format. +The hexadecimal format, given by WAL filename, must be preceded by 0x, +by example 0x11. diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index 44b5c8726e..b29d5223ce 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -770,6 +770,7 @@ usage(void) printf(_(" -R, --relation=T/D/R only show records that modify blocks in relation T/D/R\n")); printf(_(" -s, --start=RECPTR start reading at WAL location RECPTR\n")); printf(_(" -t, --timeline=TLI timeline from which to read WAL records\n" + " hexadecimal value, from WAL filename, must be preceded by 0x\n" " (default: 1 or the value used in STARTSEG)\n")); printf(_(" -V, --version output version information, then exit\n")); printf(_(" -w, --fullpage only show records with a full page write\n")); @@ -1007,7 +1008,7 @@ main(int argc, char **argv) private.startptr = (uint64) xlogid << 32 | xrecoff; break; c
Re: some problem explicit_bzero with building PostgreSQL on linux
Hi Tom, thanks a lot Adding explicit_bzero.o did the job Thanks a lot dm From: Tom Lane Date: Tuesday, March 7, 2023 at 9:14 AM To: Dimitry Markman Cc: pgsql-hackers@lists.postgresql.org , Bhavya Dabas Subject: Re: some problem explicit_bzero with building PostgreSQL on linux Dimitry Markman writes: > how we can guaranty that if HAVE_EXPLICIT_BZERO is not defined then > explicit_bzero function implemented in port/explicit_bzero.c will be used > (just like in Darwin or windows) Did you remember to add explicit_bzero.o to LIBOBJS in the configured Makefile.global? If it still doesn't work, then evidently your toolchain is selecting the system's built-in definition of explicit_bzero over the one in src/port/. This is not terribly surprising given that there has to be some amount of compiler magic involved in that function. You may have to resort to actually building Postgres on a platform without explicit_bzero. regards, tom lane
Re: add PROCESS_MAIN to VACUUM
On Mon, Mar 6, 2023 at 10:45 PM Michael Paquier wrote: > > On Mon, Mar 06, 2023 at 04:59:49PM -0800, Nathan Bossart wrote: > > That did cross my mind, but I was worried that trying to explain all that > > here could cause confusion. > > > > If PROCESS_MAIN is set (the default), it's time to vacuum the main > > relation. Otherwise, we can skip this part. If processing the TOAST > > table is required (e.g., PROCESS_TOAST is set), we'll force > > PROCESS_MAIN to be set when we recurse to the TOAST table so that it > > gets processed here. > > > > How does that sound? > > Sounds clear to me, thanks! Melanie, what do you think? Yes, sounds clear to me also! - Melanie
Re: [PATCH] Support % wildcard in extension upgrade filenames
On Tue, Jan 10, 2023 at 6:50 PM Tom Lane wrote: > The script-file-per-upgrade-path aspect solves a problem that you > have, whether you admit it or not; I think you simply aren't realizing > that because you have not had to deal with the consequences of > your proposed feature. Namely that you won't have any control > over what the backend will try to do in terms of upgrade paths. > > As an example, suppose that a database has foo 4.0 installed, and > the DBA decides to try to downgrade to 3.0. With the system as it > stands, if you've provided foo--4.0--3.0.sql then the conversion > will go through, and presumably it will work because you tested that > that script does what it is intended to. If you haven't provided > any such downgrade script, then ALTER EXTENSION UPDATE will say > "Sorry Dave, I'm afraid I can't do that" and no harm is done. I mean, there is nothing to prevent the extension author from writing a script which ensures that the extension membership after the downgrade is exactly what it should be for version 3.0, regardless of what it was before. SQL is Turing-complete. The thing that confuses me here is why the PostGIS folks are ending up with so many files. We certainly don't have that problem with the extension that are being maintained in contrib, and I guess there is some difference in versioning practice that is making it an issue for them but not for us. I wish I understood what was going on there. But that to one side, there is clearly a problem here, and I think PostgreSQL ought to provide some infrastructure to help solve it, even if the ultimate cause of that problem is that the PostGIS folks managed their extension versions in some less-than-ideal way. I can't shake the feeling that you're just holding your breath here and hoping the problem goes away by itself, and judging by the responses, that doesn't seem like it's going to happen. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Improve WALRead() to suck data directly from WAL buffers when possible
On Tue, Mar 07, 2023 at 12:39:13PM +0530, Bharath Rupireddy wrote: > On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart > wrote: >> Is it possible to memcpy more than a page at a time? > > It would complicate things a lot there; the logic to figure out the > last page bytes that may or may not fit in the whole page gets > complicated. Also, the logic to verify each page's header gets > complicated. We might lose out if we memcpy all the pages at once and > start verifying each page's header in another loop. Doesn't the complicated logic you describe already exist to some extent in the patch? You are copying a page at a time, which involves calculating various addresses and byte counts. >> +elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for >> given LSN %X/%X, Timeline ID %u", >> + *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli); >> >> I definitely don't think we should put an elog() in this code path. >> Perhaps this should be guarded behind WAL_DEBUG. > > Placing it behind WAL_DEBUG doesn't help users/developers. My > intention was to let users know that the WAL read hit the buffers, > it'll help them report if any issue occurs and also help developers to > debug that issue. I still think an elog() is mighty expensive for this code path, even when it doesn't actually produce any messages. And when it does, I think it has the potential to be incredibly noisy. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: HOT chain validation in verify_heapam()
On Mon, Mar 6, 2023 at 12:36 PM Robert Haas wrote: > So it seems that we still don't have a patch where the > value of a variable called lp_valid corresponds to whether or not the > L.P. is valid. Here's a worked-over version of this patch. Changes: - I got rid of the code that sets lp_valid in funny places and instead arranged to have check_tuple_visibility() pass up the information on the XID status. That's important, because we can't casually apply operations like TransactionIdIsCommitted() to XIDs that, for all we know, might not even be in the range covered by CLOG. In such cases, we should not perform any HOT chain validation because we can't do it sensibly; the new code accomplishes this, and also reduces the number of CLOG lookups as compared with your version. - I moved most of the HOT chain checks from the loop over the predecessor[] array to the loop over the successor[] array. It didn't seem to have any value to put them in the third loop; it forces us to expend extra code to distinguish between redirects and tuples, information that we already had in the second loop. The only check that seems to make sense to do in that last loop is the one for a HOT chain that starts with a HOT tuple, which can't be done any earlier. - I realized that your patch had a guard against setting the predecessor[] when it was set already only for tuples, not for redirects. That means if a redirect pointed into the middle of a HOT chain we might not report corruption appropriately. I fixed this and reworded the associated messages a bit. - Assorted cosmetic and comment changes. I think this is easier to follow and more nearly correct, but what do you (and others) think? -- Robert Haas EDB: http://www.enterprisedb.com 0001-Implement-HOT-chain-validation-in-verify_heapam.patch Description: Binary data
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Hi, On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote: > Good point. Attached is what you suggested. I committed the transaction > before the drop table so that the statistics would be visible when we > queried pg_stat_io. Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie. Greetings, Andres Freund
Re: HOT chain validation in verify_heapam()
> On Mar 7, 2023, at 10:16 AM, Robert Haas wrote: > > On Mon, Mar 6, 2023 at 12:36 PM Robert Haas wrote: >> So it seems that we still don't have a patch where the >> value of a variable called lp_valid corresponds to whether or not the >> L.P. is valid. > > Here's a worked-over version of this patch. Changes: > > - I got rid of the code that sets lp_valid in funny places and instead > arranged to have check_tuple_visibility() pass up the information on > the XID status. That's important, because we can't casually apply > operations like TransactionIdIsCommitted() to XIDs that, for all we > know, might not even be in the range covered by CLOG. In such cases, > we should not perform any HOT chain validation because we can't do it > sensibly; the new code accomplishes this, and also reduces the number > of CLOG lookups as compared with your version. > > - I moved most of the HOT chain checks from the loop over the > predecessor[] array to the loop over the successor[] array. It didn't > seem to have any value to put them in the third loop; it forces us to > expend extra code to distinguish between redirects and tuples, > information that we already had in the second loop. The only check > that seems to make sense to do in that last loop is the one for a HOT > chain that starts with a HOT tuple, which can't be done any earlier. > > - I realized that your patch had a guard against setting the > predecessor[] when it was set already only for tuples, not for > redirects. That means if a redirect pointed into the middle of a HOT > chain we might not report corruption appropriately. I fixed this and > reworded the associated messages a bit. > > - Assorted cosmetic and comment changes. > > I think this is easier to follow and more nearly correct, but what do > you (and others) think? Thanks, Robert. Quickly skimming over this patch, it looks like something reviewable. Your changes to t/004_verify_heapam.pl appear to be consistent with how that test was intended to function. Note that I have not tried any of this yet. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Mon, Mar 06, 2023 at 07:27:59PM +0300, Önder Kalacı wrote: > On the other hand, we already have a similar problem with > recovery_min_apply_delay combined with hot_standby_feedback [1]. > So, that probably is an acceptable trade-off for the pgsql-hackers. > If you use this feature, you should be even more careful. Yes, but it's possible to turn off hot_standby_feedback so that you don't incur bloat on the primary. And you don't need to store hours or days of WAL on the primary. I'm very late to this thread, but IIUC you cannot avoid blocking VACUUM with the proposed feature. IMO the current set of trade-offs (e.g., unavoidable bloat and WAL buildup) would make this feature virtually unusable for a lot of workloads, so it's probably worth exploring an alternative approach. In any case, we probably shouldn't rush this into v16 in its current form. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Track IO times in pg_stat_io
Hi, On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote: > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that > > would be a good idea > > to also check that if counts are not Zero then times are not Zero. > > Yes, I think adding some validation around the relationship between > counts and timing should help prevent developers from forgetting to call > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant). > > However, I think that we cannot check that if IO counts are non-zero > that IO times are non-zero, because the user may not have > track_io_timing enabled. We can check that if IO times are not zero, IO > counts are not zero. I've done this in the attached v3. And even if track_io_timing is enabled, the timer granularity might be so low that we *still* get zeroes. I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime, > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char > relpersistence, ForkNumber forkNum, > > if (isExtend) > { > + instr_time io_start, > + io_time; > + > /* new buffers are zero-filled */ > MemSet((char *) bufBlock, 0, BLCKSZ); > + > + if (track_io_timing) > + INSTR_TIME_SET_CURRENT(io_start); > + else > + INSTR_TIME_SET_ZERO(io_start); > + I wonder if there's an argument for tracking this in the existing IO stats as well. But I guess we've lived with this for a long time... > @@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, > IOObject io_object, >* When a strategy is not in use, the write can only be a "regular" > write >* of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE). >*/ > - pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE); > - > if (track_io_timing) > { > INSTR_TIME_SET_CURRENT(io_time); > INSTR_TIME_SUBTRACT(io_time, io_start); > > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time)); > INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time); > + pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, > io_time); > } I think this needs a bit of cleanup - pgstat_count_buffer_write_time(), pgBufferUsage.blk_write_time++, pgstat_count_io_time() is a bit excessive. We might not be able to reduce the whole duplication at this point, but at least it should be a bit more centralized. > + pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE); > pgBufferUsage.shared_blks_written++; > > /* > @@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel) > > if (RelationUsesLocalBuffers(rel)) > { > + instr_time io_start, > + io_time; > + > for (i = 0; i < NLocBuffer; i++) > { > uint32 buf_state; > @@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel) > > PageSetChecksumInplace(localpage, > bufHdr->tag.blockNum); > > + if (track_io_timing) > + INSTR_TIME_SET_CURRENT(io_start); > + else > + INSTR_TIME_SET_ZERO(io_start); > + > smgrwrite(RelationGetSmgr(rel), > > BufTagGetForkNum(&bufHdr->tag), > bufHdr->tag.blockNum, I don't think you need the INSTR_TIME_SET_ZERO() in the body of the loop, to silence the compiler warnings you can do it one level up. > @@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, > BlockNumber blockNum, > > PageSetChecksumInplace(localpage, bufHdr->tag.blockNum); > > + if (track_io_timing) > + INSTR_TIME_SET_CURRENT(io_start); > + else > + INSTR_TIME_SET_ZERO(io_start); > + > /* And write... */ > smgrwrite(oreln, > BufTagGetForkNum(&bufHdr->tag), > @@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, > BlockNumber blockNum, > buf_state &= ~BM_DIRTY; > pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); > > + if (track_io_timing) > + { > + INSTR_TIME_SET_CURRENT(io_time); > + INSTR_TIME_SUBTRACT(io_time, io_start); > + pgstat_count_io_time(IOOBJECT_TEMP_RELATION, > IOCONTEXT_NORMAL, IOOP_WRITE, io_time); > + } > + > pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, > IOOP_WRITE); > pgBufferUsage.local_blks_written++; > } Perhaps we can inste
Re: Track IO times in pg_stat_io
Thanks for taking another look! On Tue, Mar 7, 2023 at 10:52 AM Drouvot, Bertrand wrote: > On 3/6/23 5:30 PM, Melanie Plageman wrote: > > Thanks for the review! > > > > On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand > > wrote: > >> On 2/26/23 5:03 PM, Melanie Plageman wrote: > >>> The timings will only be non-zero when track_io_timing is on > >> > >> That could lead to incorrect interpretation if one wants to divide the > >> timing per operations, say: > >> > >> - track_io_timing is set to on while there is already operations > >> - or set to off while it was on (and the number of operations keeps > >> growing) > >> > >> Might be worth to warn/highlight in the "track_io_timing" doc? > > > > This is a good point. I've added a note to the docs for pg_stat_io. > > Thanks! > > Now I've a second thought: what do you think about resetting the related > number > of operations and *_time fields when enabling/disabling track_io_timing? (And > mention it in the doc). > > That way it'd prevent bad interpretation (at least as far the time per > operation metrics are concerned). > > Thinking that way as we'd loose some (most?) benefits of the new *_time > columns > if one can't "trust" their related operations and/or one is not sampling > pg_stat_io frequently enough (to discard the samples > where the track_io_timing changes occur). > > But well, resetting the operations could also lead to bad interpretation > about the operations... > > Not sure about which approach I like the most yet, what do you think? Oh, this is an interesting idea. I think you are right about the synchronization issues making the statistics untrustworthy and, thus, unuseable. Building on your idea, what if we had the times be NULL instead of zero when track_io_timing is disabled? Then as you suggested, when you enable track_io_timing, it resets the IOOp counts and starts the times off at zero. However, disabling track_io_timing would only NULL out the times and not zero out the counts. We could also, as you say, log these events. - Melanie
Re: Track IO times in pg_stat_io
On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote: > > Now I've a second thought: what do you think about resetting the related > > number > > of operations and *_time fields when enabling/disabling track_io_timing? > > (And mention it in the doc). > > > > That way it'd prevent bad interpretation (at least as far the time per > > operation metrics are concerned). > > > > Thinking that way as we'd loose some (most?) benefits of the new *_time > > columns > > if one can't "trust" their related operations and/or one is not sampling > > pg_stat_io frequently enough (to discard the samples > > where the track_io_timing changes occur). > > > > But well, resetting the operations could also lead to bad interpretation > > about the operations... > > > > Not sure about which approach I like the most yet, what do you think? > > Oh, this is an interesting idea. I think you are right about the > synchronization issues making the statistics untrustworthy and, thus, > unuseable. No, I don't think we can do that. It can be enabled on a per-session basis. I think we simply shouldn't do anything here. This is a pre-existing issue. I also think that loosing stats when turning track_io_timing on/off would not be helpful. Greetings, Andres Freund
Re: postgres_fdw, dblink, and CREATE SUBSCRIPTION security
On Thu, Feb 9, 2023 at 4:46 PM Jacob Champion wrote: > On 2/6/23 08:22, Robert Haas wrote: > > I don't think that's quite the right concept. It seems to me that the > > client is responsible for informing the server of what the situation > > is, and the server is responsible for deciding whether to allow the > > connection. In your scenario, the client is not only communicating > > information ("here's the password I have got") but also making demands > > on the server ("DO NOT authenticate using anything else"). I like the > > first part fine, but not the second part. > > For what it's worth, making a negative demand during authentication is > pretty standard: if you visit example.com and it tells you "I need your > OS login password and Social Security Number to authenticate you," you > have the option of saying "no thanks" and closing the tab. No, that's the opposite, and exactly the point I'm trying to make. In that case, the SERVER says what it's willing to accept, and the CLIENT decides whether or not to provide that. In your proposal, the client tells the server which authentication methods to accept. > In a hypothetical world where the server presented the client with a > list of authentication options before allowing any access, this would > maybe be a little less convoluted to solve. For example, a proxy seeing > a SASL list of > > - ANONYMOUS > - EXTERNAL > > could understand that both methods allow the client to assume the > authority of the proxy itself. So if its client isn't allowed to do > that, the proxy realizes something is wrong (either it, or its target > server, has been misconfigured or is under attack), and it can close the > connection *before* the server runs login triggers. Yep, that totally makes sense to me, but I don't think it's what you proposed. > This sounds like a reasonable separation of responsibilities on the > surface, but I think it's subtly off. The entire confused-deputy problem > space revolves around the proxy being unable to correctly decide which > connections to allow unless it also knows why the connections are being > authorized. I agree. > You've constructed an example where that's not a concern: everything's > symmetrical, all proxies operate with the same authority, and internal > users are identical to external users. But the CVE that led to the > password requirement, as far as I can tell, dealt with asymmetry. The > proxy had the authority to connect locally to a user, and the clients > had the authority to connect to other machines' users, but those users > weren't the same and were not mutually trusting. Yeah, agreed. So, I think the point here is that the proxy configuration (and pg_hba.conf) need to be sufficiently powerful that each user can permit the things that make sense in their environment and block the things that don't. I don't think we're really very far apart here, but for some reason the terminology seems to be giving us some trouble. Of course, there's also the small problem of actually finding the time to do some meaningful work on this stuff, rather than just talking -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Support % wildcard in extension upgrade filenames
Robert Haas writes: > On Tue, Jan 10, 2023 at 6:50 PM Tom Lane wrote: >> As an example, suppose that a database has foo 4.0 installed, and >> the DBA decides to try to downgrade to 3.0. With the system as it >> stands, if you've provided foo--4.0--3.0.sql then the conversion >> will go through, and presumably it will work because you tested that >> that script does what it is intended to. If you haven't provided >> any such downgrade script, then ALTER EXTENSION UPDATE will say >> "Sorry Dave, I'm afraid I can't do that" and no harm is done. > I mean, there is nothing to prevent the extension author from writing > a script which ensures that the extension membership after the > downgrade is exactly what it should be for version 3.0, regardless of > what it was before. SQL is Turing-complete. It may be Turing-complete, but I seriously doubt that that's sufficient for the problem I'm thinking of, which is to downgrade from an extension version that you never heard of at the time the script was written. In the worst case, that version might even contain objects of types you never heard of and don't know how to drop. You can imagine various approaches to deal with that; for instance, maybe we could provide some kind of command to deal with dropping an object identified by classid and objid, which the upgrade script could scrape out of pg_depend. After writing even more code to issue those drops in dependency-aware order, you could get on with modifying the objects you do know about ... but maybe those now have properties you never heard of and don't know how to adjust. Whether this is all theoretically possible is sort of moot. What I am maintaining is that no extension author is actually going to write such a script, indeed they probably won't trouble to write any downgrade-like actions at all. Which makes the proposed design mostly a foot-gun. > But that to one side, there is clearly a problem here, and I think > PostgreSQL ought to provide some infrastructure to help solve it, even > if the ultimate cause of that problem is that the PostGIS folks > managed their extension versions in some less-than-ideal way. I can't > shake the feeling that you're just holding your breath here and hoping > the problem goes away by itself, and judging by the responses, that > doesn't seem like it's going to happen. I'm not unsympathetic to the idea of trying to support multiple upgrade paths in one script. I just don't like this particular design for that, because it requires the extension author to make promises that nobody is actually going to deliver on. regards, tom lane
RE: [PATCH] Support % wildcard in extension upgrade filenames
> The thing that confuses me here is why the PostGIS folks are ending up with > so many files. > We certainly don't have that problem with the extension that > are being maintained in contrib, and I guess there is some difference in > versioning practice that is making it an issue for them but not for us. I > wish I > understood what was going on there. The contrib files are minor versioned. PostGIS is micro versioned. So we have for example postgis--3.3.0--3.3.1.sql Also we have 5 extensions we ship all micro versions, so multiply that issue by 5 postgis postgis_raster postgis_topology postgis_tiger_geocoder postgis_sfcgal The other wrinkle we have that I don't think postgresql's contrib have is that we support Each version across multiple PostgreSQL versions. So for example our 3.3 series supports PostgreSQL 12-15 (with plans to also support 16).
Re: buildfarm + meson
Hi, On 2023-03-01 13:32:58 -0800, Andres Freund wrote: > On 2023-03-01 16:21:32 -0500, Andrew Dunstan wrote: > > Perhaps the latest version will be more to your taste. > > I'll check it out. A simple conversion from an existing config failed with: Can't use an undefined value as an ARRAY reference at /home/bf/src/pgbuildfarm-client-meson/PGBuild/Modules/TestICU.pm line 37. I disabled TestICU and was able to progress past that. ... piculet-meson:HEAD [19:12:48] setting up db cluster (C)... piculet-meson:HEAD [19:12:48] starting db (C)... piculet-meson:HEAD [19:12:48] running installcheck (C)... piculet-meson:HEAD [19:12:57] restarting db (C)... piculet-meson:HEAD [19:12:59] running meson misc installchecks (C) ... Branch: HEAD Stage delay_executionInstallCheck-C failed with status 1 The failures are like this: +ERROR: extension "dummy_index_am" is not available +DETAIL: Could not open extension control file "/home/bf/bf-build/piculet-meson/HEAD/inst/share/postgresql/extension/dummy_index_am.control": No such file or directory. +HINT: The extension must first be installed on the system where PostgreSQL is running. I assume this is in an interaction with b6a0d469cae. I think we need a install-test-modules or such that installs into the normal directory. Greetings, Andres Freund
RE: [PATCH] Support % wildcard in extension upgrade filenames
> I'm not unsympathetic to the idea of trying to support multiple upgrade paths > in one script. I just don't like this particular design for that, because it > requires the extension author to make promises that nobody is actually going > to deliver on. > > regards, tom lane How about the idea I mentioned, of we revise the patch to read versioned upgrades from the control file rather than relying on said file to exist. https://www.postgresql.org/message-id/000201d92572%247dcd8260%2479688720%24%40pcorp.us Even better, we have an additional control file, something like postgis--paths.control That has separate lines to itemize those paths. It would be nice if we could allow wild-cards in that, but I could live without that if we can stop shipping 300 empty files. Thanks, Regina
Re: Add shared buffer hits to pg_stat_io
Hi, LGTM. The only comment I have is that a small test wouldn't hurt... Compared to the other things it should be fairly easy... Greetings, Andres Freund
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi, On 2023-03-07 08:22:45 +0530, Amit Kapila wrote: > On Tue, Mar 7, 2023 at 1:34 AM Andres Freund wrote: > > I think even as-is it's reasonable to just use it. The sequential scan > > approach is O(N^2), which, uh, is not good. And having an index over > > thousands > > of non-differing values will generally perform badly, not just in this > > context. > > > Yes, it is true that generally also index scan with a lot of > duplicates may not perform well but during the scan, we do costing to > ensure such cases and may prefer other index or sequence scan. Then we > have "enable_indexscan" GUC that the user can use if required. So, I > feel it is better to have a knob to disallow usage of such indexes and > the default would be to use an index, if available. It just feels like we're optimizing for an irrelevant case here. If we add GUCs for irrelevant things like this we'll explode the number of GUCs even faster than we already are. Greetings, Andres Freund
Re: Add support for unit "B" to pg_size_pretty()
On 06.03.23 09:27, David Rowley wrote: On Mon, 6 Mar 2023 at 21:13, Peter Eisentraut wrote: On 02.03.23 20:58, David Rowley wrote: I think I'd prefer to see the size_bytes_unit_alias struct have an index into size_pretty_units[] array. i.e: Ok, done that way. (I had thought about that, but I was worried that that would be too error-prone to maintain. But I suppose the tables don't change that often, and test cases would easily catch mistakes.) Patch looks pretty good. I just see a small spelling mistake in: +/* Additional unit aliases acceted by pg_size_bytes */ I also updated the documentation a bit more. I see I must have forgotten to add PB to the docs when pg_size_pretty had that unit added. I guess you added the "etc" to fix that? I'm wondering if that's the right choice. You modified the comment above size_pretty_units[] to remind us to update the docs when adding units, but the docs now say "etc", so do we need to? I'd likely have gone with just adding "PB" to the docs, that way it's pretty clear that new units need to be mentioned in the docs. Ok, I have fixed the original documentation to that effect and backpatched it. The remaining patch has been updated accordingly and committed also.
Re: buildfarm + meson
On 2023-03-07 Tu 14:37, Andres Freund wrote: Hi, On 2023-03-01 13:32:58 -0800, Andres Freund wrote: On 2023-03-01 16:21:32 -0500, Andrew Dunstan wrote: Perhaps the latest version will be more to your taste. I'll check it out. A simple conversion from an existing config failed with: Can't use an undefined value as an ARRAY reference at /home/bf/src/pgbuildfarm-client-meson/PGBuild/Modules/TestICU.pm line 37. I disabled TestICU and was able to progress past that. Pushed a fix for that. ... piculet-meson:HEAD [19:12:48] setting up db cluster (C)... piculet-meson:HEAD [19:12:48] starting db (C)... piculet-meson:HEAD [19:12:48] running installcheck (C)... piculet-meson:HEAD [19:12:57] restarting db (C)... piculet-meson:HEAD [19:12:59] running meson misc installchecks (C) ... Branch: HEAD Stage delay_executionInstallCheck-C failed with status 1 The failures are like this: +ERROR: extension "dummy_index_am" is not available +DETAIL: Could not open extension control file "/home/bf/bf-build/piculet-meson/HEAD/inst/share/postgresql/extension/dummy_index_am.control": No such file or directory. +HINT: The extension must first be installed on the system where PostgreSQL is running. I assume this is in an interaction with b6a0d469cae. I think we need a install-test-modules or such that installs into the normal directory. Exactly. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: add PROCESS_MAIN to VACUUM
On Tue, Mar 07, 2023 at 12:39:29PM -0500, Melanie Plageman wrote: > Yes, sounds clear to me also! Here is an updated patch. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 580f966499..0acc42af2b 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -2058,25 +2058,34 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, bool skip_privs) save_nestlevel = NewGUCNestLevel(); /* - * Do the actual work --- either FULL or "lazy" vacuum + * If PROCESS_MAIN is set (the default), it's time to vacuum the main + * relation. Otherwise, we can skip this part. If processing the TOAST + * table is required (e.g., PROCESS_TOAST is set), we'll force PROCESS_MAIN + * to be set when we recurse to the TOAST table so that it gets processed + * here. */ - if ((params->options & VACOPT_FULL) && - (params->options & VACOPT_PROCESS_MAIN)) + if (params->options & VACOPT_PROCESS_MAIN) { - ClusterParams cluster_params = {0}; + /* + * Do the actual work --- either FULL or "lazy" vacuum + */ + if (params->options & VACOPT_FULL) + { + ClusterParams cluster_params = {0}; - /* close relation before vacuuming, but hold lock until commit */ - relation_close(rel, NoLock); - rel = NULL; + /* close relation before vacuuming, but hold lock until commit */ + relation_close(rel, NoLock); + rel = NULL; - if ((params->options & VACOPT_VERBOSE) != 0) - cluster_params.options |= CLUOPT_VERBOSE; + if ((params->options & VACOPT_VERBOSE) != 0) +cluster_params.options |= CLUOPT_VERBOSE; - /* VACUUM FULL is now a variant of CLUSTER; see cluster.c */ - cluster_rel(relid, InvalidOid, &cluster_params); + /* VACUUM FULL is now a variant of CLUSTER; see cluster.c */ + cluster_rel(relid, InvalidOid, &cluster_params); + } + else + table_relation_vacuum(rel, params, vac_strategy); } - else if (params->options & VACOPT_PROCESS_MAIN) - table_relation_vacuum(rel, params, vac_strategy); /* Roll back any GUC changes executed by index functions */ AtEOXact_GUC(false, save_nestlevel);
Re: psql: Add role's membership options to the \du+ command
On Mon, Mar 6, 2023 at 12:43 AM Pavel Luzanov wrote: > Indeed, adding ADMIN to pg_has_role looks logical. The function will show > whether one role can manage another directly or indirectly (via SET ROLE). > FWIW I've finally gotten to publishing my beta version of the Role Graph for PostgreSQL pseudo-extension I'd been talking about: https://github.com/polobo/RoleGraphForPostgreSQL The administration column basically determines all this via a recursive CTE. I'm pondering how to incorporate some of this core material into it now for both cross-validation purposes and ease-of-use. I handle EMPTY explicitly in the Role Graph but as I noted somewhere in my comments, it really shouldn't be possible to leave the database in that state. Do we need to bug Robert on this directly or do you plan to have a go at it? Adding ADMIN will lead to the question of naming other values. It is more > reasonable to have INHERIT instead of USAGE. > And it is not very clear whether (except for backward compatibility) a > separate MEMBER value is needed at all. > There is an appeal to breaking things thoroughly here and removing both MEMBER and USAGE terms while adding ADMIN, SET, INHERIT. However, I am against that. Most everyday usage of this would only likely care about SET and INHERIT even going forward - administration of roles is distinct from how those roles generally behave at runtime. Breaking the later because we improved upon the former doesn't seem defensible. Thus, while adding ADMIN makes sense, keeping MEMBER (a.k.a., SET) and USAGE (a.k.a., INHERIT) is my suggested way forward. I'll be looking over your v3 patch sometime this week, if not today. David J.
Re: Add support for unit "B" to pg_size_pretty()
On Wed, 8 Mar 2023 at 09:22, Peter Eisentraut wrote: > Ok, I have fixed the original documentation to that effect and > backpatched it. Thanks for fixing that. David
pipe_read_line for reading arbitrary strings
When skimming through pg_rewind during a small review I noticed the use of pipe_read_line for reading arbitrary data from a pipe, the mechanics of which seemed odd. Commit 5b2f4afffe6 refactored find_other_exec() and broke out pipe_read_line() as a static convenience routine for reading a single line of output to catch a version number. Many years later, commit a7e8ece41 exposed it externally in order to read a GUC from postgresql.conf using "postgres -C ..". f06b1c598 also make use of it for reading a version string much like find_other_exec(). Funnily enough, while now used for arbitrary string reading the variable is still "pgver". Since the function requires passing a buffer/size, and at most size - 1 bytes will be read via fgets(), there is a truncation risk when using this for reading GUCs (like how pg_rewind does, though the risk there is slim to none). If we are going to continue using this for reading $stuff from pipes, maybe we should think about presenting a nicer API which removes that risk? Returning an allocated buffer which contains all the output along the lines of the recent pg_get_line work seems a lot nicer and safer IMO. The attached POC diff replace fgets() with pg_get_line(), which may not be an Ok way to cross the streams (it's clearly not a great fit), but as a POC it provided a neater interface for reading one-off lines from a pipe IMO. Does anyone else think this is worth fixing before too many callsites use it, or is this another case of my fear of silent subtle truncation bugs? =) -- Daniel Gustafsson pipe_read_line.diff Description: Binary data
PGDOCS - Replica Identity quotes
PGDOCS - Replica Identity quotes Hi, Here are some trivial quote changes to a paragraph describing REPLICA IDENTITY. These changes were previously made in another ongoing R.I. patch v28-0001 [1], but it was decided that since they are not strictly related to that patch they should done separately. == logical-replication.sgml Section 31.1 Publication A published table must have a “replica identity” configured in order to be able to replicate UPDATE and DELETE operations, so that appropriate rows to update or delete can be identified on the subscriber side. By default, this is the primary key, if there is one. Another unique index (with certain additional requirements) can also be set to be the replica identity. If the table does not have any suitable key, then it can be set to replica identity “full”, which means the entire row becomes the key. This, however, is very inefficient and should only be used as a fallback if no other solution is possible. If a replica identity other than “full” is set on the publisher side, a replica identity comprising the same or fewer columns must also be set on the subscriber side. See REPLICA IDENTITY for details on how to set the replica identity. If a table without a replica identity is added to a publication that replicates UPDATE or DELETE operations then subsequent UPDATE or DELETE operations will cause an error on the publisher. INSERT operations can proceed regardless of any replica identity. ~~ Suggested changes: 1. The quoted "replica identity" should not be quoted -- This is the first time this term is used on this page so I think it should be using SGML tag, just the same as how publication looks at the top of this section. 2. The quoted "full" should also not be quoted. Replicate identities are not specified using text string "full" - they are specified as FULL (see [2]), so IMO these instances should be changed to FULL to eliminate that ambiguity. ~~~ PSA patch v1 which implements the above changes. -- [1] https://www.postgresql.org/message-id/CAA4eK1J8R-qS97cu27sF2%3DqzjhuQNkv%2BZvgaTzFv7rs-LA4c2w%40mail.gmail.com [2] https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-REPLICA-IDENTITY Kind Regards, Peter Smith. Fujitsu Australia v1-0001-PGDOCS-replica-identity-quotes.patch Description: Binary data
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Mon, Mar 6, 2023 at 7:40 PM Amit Kapila wrote: > > On Mon, Mar 6, 2023 at 1:40 PM Peter Smith wrote: > > ... > > > > Anyhow, if you feel those firstterm and FULL changes ought to be kept > > separate from this RI patch, please let me know and I will propose > > those changes in a new thread, > > > > Personally, I would prefer to keep those separate. So, feel free to > propose them in a new thread. > Done. Those suggested pg docs changes now have their own new thread [1]. -- [1] RI quotes - https://www.postgresql.org/message-id/CAHut%2BPst11ac2hcmePt1%3DoTmBwTT%3DDAssRR1nsdoy4BT%2B68%3DMg%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
On Tue, Mar 7, 2023 at 7:26 PM Pavel Borisov wrote: > On Tue, 7 Mar 2023, 10:10 Alexander Korotkov, wrote: >> I don't know what exactly Pavel meant, but average overall numbers for >> low concurrency are. >> master: 420401 (stddev of average 233) >> patchset v11: 420111 (stddev of average 199) >> The difference is less than 0.1% and that is very safely within the error. > > > Yes, the only thing that I meant is that for low-concurrency case the results > between patch and master are within the difference between repeated series of > measurements. So I concluded that the test can not prove any difference > between patch and master. > > I haven't meant or written there is some performance degradation. > > Alexander, I suppose did an extra step and calculated overall average and > stddev, from raw data provided. Thanks! Pavel, thank you for verifying this. Could you, please, rerun performance benchmarks for the v13? It introduces LazyTupleTableSlot, which shouldn't do any measurable impact on performance. But still. -- Regards, Alexander Korotkov
optimize several list functions with SIMD intrinsics
I noticed that several of the List functions do simple linear searches that can be optimized with SIMD intrinsics (as was done for XidInMVCCSnapshot in 37a6e5d). The following table shows the time spent iterating over a list of n elements (via list_member_int) one billion times on my x86 laptop. n | head (ms) | patched (ms) --+---+-- 2 | 3884 | 3001 4 | 5506 | 4092 8 | 6209 | 3026 16 | 8797 | 4458 32 | 25051 | 7032 64 | 37611 |12763 128 | 61886 |22770 256 |70 |59885 512 |209612 | 103378 1024 |407462 | 189484 I've attached a work-in-progress patch that implements these optimizations for both x86 and arm, and I will register this in the July commitfest. I'm posting this a little early in order to gauge interest. Presumably you shouldn't be using a List if you have many elements that must be routinely searched, but it might be nice to speed up these functions, anyway. This was mostly a fun weekend project, and I don't presently have any concrete examples of workloads where this might help. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 4e04f84766d98f9ba6bb6fdd03bcb431c8aad1d3 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Sat, 4 Mar 2023 23:16:07 -0800 Subject: [PATCH v1 1/1] speed up several list functions with SIMD --- src/backend/nodes/list.c| 254 +++- src/include/port/pg_lfind.h | 189 +++ src/include/port/simd.h | 103 +++ 3 files changed, 513 insertions(+), 33 deletions(-) diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index a709d23ef1..acc56dddb7 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -19,6 +19,7 @@ #include "nodes/pg_list.h" #include "port/pg_bitutils.h" +#include "port/pg_lfind.h" #include "utils/memdebug.h" #include "utils/memutils.h" @@ -680,11 +681,25 @@ list_member(const List *list, const void *datum) bool list_member_ptr(const List *list, const void *datum) { +#ifdef USE_NO_SIMD const ListCell *cell; +#endif Assert(IsPointerList(list)); check_list_invariants(list); +#ifndef USE_NO_SIMD + + Assert(sizeof(ListCell) == 8); + Assert(sizeof(void *) == 8); + + if (list == NIL) + return false; + + return pg_lfind64((uint64) datum, (uint64 *) list->elements, list->length); + +#else + foreach(cell, list) { if (lfirst(cell) == datum) @@ -692,46 +707,121 @@ list_member_ptr(const List *list, const void *datum) } return false; + +#endif } /* - * Return true iff the integer 'datum' is a member of the list. + * Optimized linear search routine (using SIMD intrinsics where available) for + * lists with inline 32-bit data. */ -bool -list_member_int(const List *list, int datum) +static inline bool +list_member_inline_internal(const List *list, uint32 datum) { + uint32 i = 0; const ListCell *cell; - Assert(IsIntegerList(list)); - check_list_invariants(list); +#ifndef USE_NO_SIMD - foreach(cell, list) + /* + * For better instruction-level parallelism, each loop iteration operates + * on a block of four registers. + */ + const Vector32 keys = vector32_broadcast(datum); /* load copies of key */ + const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32); + const uint32 nelem_per_iteration = 4 * nelem_per_vector; + const Vector32 mask = (Vector32) vector64_broadcast(UINT64CONST(0x)); + const uint32 *elements = (const uint32 *) list->elements; + + /* round down to multiple of elements per iteration */ + const uint32 tail_idx = (list->length * 2) & ~(nelem_per_iteration - 1); + + /* + * The SIMD-optimized portion of this routine is written with the + * expectation that the 32-bit datum we are searching for only takes up + * half of a ListCell. If that changes, this routine must change, too. + */ + Assert(sizeof(ListCell) == 8); + + for (i = 0; i < tail_idx; i += nelem_per_iteration) + { + Vector32 vals1, + vals2, + vals3, + vals4, + result1, + result2, + result3, + result4, + tmp1, + tmp2, + result, + masked; + + /* load the next block into 4 registers */ + vector32_load(&vals1, &elements[i]); + vector32_load(&vals2, &elements[i + nelem_per_vector]); + vector32_load(&vals3, &elements[i + nelem_per_vector * 2]); + vector32_load(&vals4, &elements[i + nelem_per_vector * 3]); + + /* compare each value to the key */ + result1 = vector32_eq(keys, vals1); + result2 = vector32_eq(keys, vals2); + result3 = vector32_eq(keys, vals3); + result4 = vector32_eq(keys, vals4); + + /* combine the results into a single variable */ + tmp1 = vector32_or(result1, result2); + tmp2 = vector32_or(result3, result4); + result = vector32_or(tmp1, tmp2); + + /* filter out matches in space between data */ + masked = vector32_and(re
Re: optimize several list functions with SIMD intrinsics
On Wed, 8 Mar 2023 at 13:25, Nathan Bossart wrote: > I've attached a work-in-progress patch that implements these optimizations > for both x86 and arm, and I will register this in the July commitfest. I'm > posting this a little early in order to gauge interest. Interesting and quite impressive performance numbers. >From having a quick glance at the patch, it looks like you'll need to take some extra time to make it work on 32-bit builds. David
Re: add PROCESS_MAIN to VACUUM
On Tue, Mar 07, 2023 at 12:55:08PM -0800, Nathan Bossart wrote: > On Tue, Mar 07, 2023 at 12:39:29PM -0500, Melanie Plageman wrote: >> Yes, sounds clear to me also! > > Here is an updated patch. Fine by me, so done. (I have cut a few words from the comment, without changing its meaning.) -- Michael signature.asc Description: PGP signature
Re: shoud be get_extension_schema visible?
On Mon, Mar 06, 2023 at 04:44:59PM +0900, Michael Paquier wrote: > I can see why you'd want that, so OK from here to provide this routine > for external consumption. Let's first wait a bit and see if others > have any kind of objections or comments. Done this one as of e20b1ea. -- Michael signature.asc Description: PGP signature
Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
Hi, On 2023-03-07 04:45:32 +0300, Alexander Korotkov wrote: > The second patch now implements a concept of LazyTupleTableSlot, a slot > which gets allocated only when needed. Also, there is more minor > refactoring and more comments. This patch already is pretty big for what it actually improves. Introducing even infrastructure to get a not that big win, in a not particularly interesting, extreme, workload... What is motivating this? Greetings, Andres Freund
Re: buildfarm + meson
Hi, On 2023-03-07 15:47:54 -0500, Andrew Dunstan wrote: > On 2023-03-07 Tu 14:37, Andres Freund wrote: > > The failures are like this: > > > > +ERROR: extension "dummy_index_am" is not available > > +DETAIL: Could not open extension control file > > "/home/bf/bf-build/piculet-meson/HEAD/inst/share/postgresql/extension/dummy_index_am.control": > > No such file or directory. > > +HINT: The extension must first be installed on the system where > > PostgreSQL is running. > > > > I assume this is in an interaction with b6a0d469cae. > > > > > > I think we need a install-test-modules or such that installs into the normal > > directory. > > > > Exactly. Here's a prototype for that. It adds an install-test-files target, Because we want to install into a normal directory, I removed the necessary munging of the target paths from meson.build and moved it into install-test-files. I also added DESTDIR support, so that installing can redirect the directory if desired. That's used for the tmp_install/ installation now. I didn't like the number of arguments necessary for install_test_files, so I changed it to use --install target list of files which makes it easier to use for further directories, if/when we need them. Greetings, Andres Freund >From fe6c77903696ccf03b618fb6f91119a12bea4a2f Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Tue, 7 Mar 2023 16:14:18 -0800 Subject: [PATCH v1] meson: Add target for installing test files & improve install_test_files --- meson.build | 32 ++-- src/tools/install_test_files | 25 +++-- 2 files changed, 33 insertions(+), 24 deletions(-) diff --git a/meson.build b/meson.build index 87cb974ad7c..6ce24207fb5 100644 --- a/meson.build +++ b/meson.build @@ -2830,6 +2830,22 @@ generated_sources_ac += {'': ['GNUmakefile']} testprep_targets += test_install_libs +# command to install files used for tests, which aren't installed by default +install_test_files = files('src/tools/install_test_files') +install_test_files_args = [ + install_test_files, + '--prefix', dir_prefix, + '--install', contrib_data_dir, test_install_data, + '--install', dir_lib_pkg, test_install_libs, +] + +# Target installing files required for installcheck of various modules +run_target('install-test-files', + command: [python] + install_test_files_args, + depends: testprep_targets, +) + + # If there are any files in the source directory that we also generate in the # build directory, they might get preferred over the newly generated files, # e.g. because of a #include "file", which always will search in the current @@ -2922,21 +2938,9 @@ test('tmp_install', is_parallel: false, suite: ['setup']) -# get full paths of test_install_libs to copy them -test_install_libs_fp = [] -foreach lib: test_install_libs - test_install_libs_fp += lib.full_path() -endforeach - -install_test_files = files('src/tools/install_test_files') test('install_test_files', -python, args: [ - install_test_files, - '--datadir', test_install_location / contrib_data_args['install_dir'], - '--libdir', test_install_location / dir_lib_pkg, - '--install-data', test_install_data, - '--install-libs', test_install_libs_fp, -], +python, +args: install_test_files_args + ['--destdir', test_install_destdir], priority: setup_tests_priority, is_parallel: false, suite: ['setup']) diff --git a/src/tools/install_test_files b/src/tools/install_test_files index e6ecdae10f8..8e0b36a74d1 100644 --- a/src/tools/install_test_files +++ b/src/tools/install_test_files @@ -6,23 +6,28 @@ import argparse import shutil import os +from pathlib import PurePath parser = argparse.ArgumentParser() -parser.add_argument('--datadir', type=str) -parser.add_argument('--libdir', type=str) -parser.add_argument('--install-data', type=str, nargs='*') -parser.add_argument('--install-libs', type=str, nargs='*') +parser.add_argument('--destdir', type=str, default=os.environ.get('DESTDIR', None)) +parser.add_argument('--prefix', type=str) +parser.add_argument('--install', type=str, nargs='+', action='append') args = parser.parse_args() +def copy_files(prefix: str, destdir: str, targetdir: str, src_list: list): +if not os.path.isabs(targetdir): +targetdir = os.path.join(prefix, targetdir) -def copy_files(src_list: list, dest: str): -os.makedirs(dest, exist_ok=True) +if destdir is not None: +# copy of meson's logic for joining destdir and install paths +targetdir = str(PurePath(destdir, *PurePath(targetdir).parts[1:])) + +os.makedirs(targetdir, exist_ok=True) for src in src_list: -shutil.copy2(src, dest) +shutil.copy2(src, targetdir) - -copy_files(args.install_data, args.datadir) -copy_files(args.install_libs, args.libdir) +for installs in args.install: +copy_files(args.prefix, args.destdir, installs[0], installs[1:]) -- 2.38.0
Re: Add LZ4 compression in pg_dump
On Wed, Mar 01, 2023 at 04:52:49PM +0100, Tomas Vondra wrote: > Thanks. That seems correct to me, but I find it somewhat confusing, > because we now have > > DeflateCompressorInit vs. InitCompressorGzip > > DeflateCompressorEnd vs. EndCompressorGzip > > DeflateCompressorData - The name doesn't really say what it does (would > be better to have a verb in there, I think). > > I wonder if we can make this somehow clearer? To move things along, I updated Georgios' patch: Rename DeflateCompressorData() to DeflateCompressorCommon(); Rearrange functions to their original order allowing a cleaner diff to the prior code; Change pg_fatal() to an assertion+comment; Update the commit message and fix a few typos; > Also, InitCompressorGzip says this: > >/* > * If the caller has defined a write function, prepare the necessary > * state. Avoid initializing during the first write call, because End > * may be called without ever writing any data. > */ > if (cs->writeF) > DeflateCompressorInit(cs); > > Does it actually make sense to not have writeF defined in some cases? InitCompressor is being called for either reading or writing, either of which could be null: src/bin/pg_dump/pg_backup_custom.c: ctx->cs = AllocateCompressor(AH->compression_spec, src/bin/pg_dump/pg_backup_custom.c- NULL, src/bin/pg_dump/pg_backup_custom.c- _CustomWriteFunc); -- src/bin/pg_dump/pg_backup_custom.c: cs = AllocateCompressor(AH->compression_spec, src/bin/pg_dump/pg_backup_custom.c- _CustomReadFunc, NULL); It's confusing that the comment says "Avoid initializing...". What it really means is "Initialize eagerly...". But that makes more sense in the context of the commit message for this bugfix than in a comment. So I changed that too. + /* If deflation was initialized, finalize it */ + if (cs->private_data) + DeflateCompressorEnd(AH, cs); Maybe it'd be more clear if this used "if (cs->writeF)", like in the init function ? -- Justin >From 5c027aa86e8591db140093c48a58aafba7a6c28c Mon Sep 17 00:00:00 2001 From: Georgios Kokolatos Date: Wed, 1 Mar 2023 12:42:32 + Subject: [PATCH] pg_dump: fix gzip compression of empty data When creating dumps with the Compressor API, it is possible to only call the Allocate and End compressor functions without ever writing any data. Since e9960732a, the gzip implementation wrongly assumed that the write function would always be called and deferred the initialization of the internal compression system until the first write call. The problem with that approach is that the End call would not finalize the internal compression system if it hadn't been initialized. This commit initializes the internal compression system during the Allocate call, whenever a write function was provided by the caller. Given that decompression does not need to keep track of any state, the compressor's private_data member is now populated only during compression. In passing, rearrange the functions to their original order, to allow usefully comparing with the previous code, like: git diff --diff-algorithm=minimal -w e9960732a~:src/bin/pg_dump/compress_io.c src/bin/pg_dump/compress_gzip.c Also replace an unreachable pg_fatal() with an assert+comment. I (Justin) argued that the new fatal shouldn't have been introduced in a refactoring commit, so this is a compromise. Report and initial patch by Justin Pryzby, test case by Georgios Kokolatos. https://www.postgresql.org/message-id/20230228235834.GC30529%40telsasoft.com --- src/bin/pg_dump/compress_gzip.c | 137 ++- src/bin/pg_dump/t/002_pg_dump.pl | 23 ++ 2 files changed, 101 insertions(+), 59 deletions(-) diff --git a/src/bin/pg_dump/compress_gzip.c b/src/bin/pg_dump/compress_gzip.c index 0af65afeb4e..3c9ac55c266 100644 --- a/src/bin/pg_dump/compress_gzip.c +++ b/src/bin/pg_dump/compress_gzip.c @@ -32,9 +32,75 @@ typedef struct GzipCompressorState size_t outsize; } GzipCompressorState; + /* Private routines that support gzip compressed data I/O */ +static void DeflateCompressorInit(CompressorState *cs); +static void DeflateCompressorEnd(ArchiveHandle *AH, CompressorState *cs); +static void DeflateCompressorCommon(ArchiveHandle *AH, CompressorState *cs, + bool flush); +static void EndCompressorGzip(ArchiveHandle *AH, CompressorState *cs); +static void WriteDataToArchiveGzip(ArchiveHandle *AH, CompressorState *cs, + const void *data, size_t dLen);
Re: buildfarm + meson
Hi, On 2023-02-23 06:27:23 -0500, Andrew Dunstan wrote: > Yeah. For touch I think we can probably just get rid of this line in the > root meson.build: > > touch = find_program('touch', native: true) Yep. > For cp there doesn't seem to be a formal requirement, but there is a recipe > in src/common/unicode/meson.build that uses it, maybe that's what caused the > failure. On Windows/msvc we could just use copy instead, I think. I don't know about using copy, it's very easy to get into trouble due to interpreting forward slashes as options etc. I propose that for now we just don't support update-unicode if cp isn't available - just as already not available when wget isn't available. Planning to apply something like the attached soon, unless somebody opposes that plan. Other unix tools we have a hard requirement on right now: - sed - would be pretty easy to replace with something else - tar, gzip - just for tests I'm not sure it's worth working on not requiring those. There's also flex, bison, perl, but those will stay a hard requirement for a while longer... :) Greetings, Andres Freund >From 564092bfb4c108c387cac3562a7dbad8c3126fea Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Tue, 7 Mar 2023 18:24:18 -0800 Subject: [PATCH] meson: don't require 'touch' binary, make use of 'cp' optional --- meson.build| 2 +- src/common/unicode/meson.build | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/meson.build b/meson.build index 87cb974ad7c..f1ce4cb8e03 100644 --- a/meson.build +++ b/meson.build @@ -333,11 +333,11 @@ prove = find_program(get_option('PROVE'), native: true, required: false) tar = find_program(get_option('TAR'), native: true) gzip = find_program(get_option('GZIP'), native: true) program_lz4 = find_program(get_option('LZ4'), native: true, required: false) -touch = find_program('touch', native: true) openssl = find_program(get_option('OPENSSL'), native: true, required: false) program_zstd = find_program(get_option('ZSTD'), native: true, required: false) dtrace = find_program(get_option('DTRACE'), native: true, required: get_option('dtrace')) missing = find_program('config/missing', native: true) +cp = find_program('cp', required: false, native: true) bison_flags = [] if bison.found() diff --git a/src/common/unicode/meson.build b/src/common/unicode/meson.build index 63c81664de3..1ffece1550a 100644 --- a/src/common/unicode/meson.build +++ b/src/common/unicode/meson.build @@ -5,7 +5,7 @@ UNICODE_VERSION = '15.0.0' unicode_data = {} unicode_baseurl = 'https://www.unicode.org/Public/@0@/ucd/@1@' -if not wget.found() +if not wget.found() or not cp.found() subdir_done() endif @@ -100,7 +100,7 @@ update_unicode = custom_target('update-unicode', depends: update_unicode_dep, output: ['dont-exist'], input: update_unicode_targets, - command: ['cp', '@INPUT@', '@SOURCE_ROOT@/src/include/common/'], + command: [cp, '@INPUT@', '@SOURCE_ROOT@/src/include/common/'], build_by_default: false, build_always_stale: true, ) -- 2.38.0
Re: psql: Add role's membership options to the \du+ command
On Tue, Mar 7, 2023 at 2:02 PM David G. Johnston wrote: > > I'll be looking over your v3 patch sometime this week, if not today. > > Moving the goal posts for this meta-command to >= 9.5 seems like it should be done as a separate patch and thread. The documentation presently states we are targeting 9.2 and newer. My suggestion for the docs is below. I find saying "additional information is shown...currently this adds the comment". Repeating that "+" means (show more) everywhere seems excessive, just state what those "more" things are. I consider \dFp and \dl to be good examples in this regard. I also think that "Wall of text" doesn't serve us well. See \dP for permission to use paragraphs. I didn't modify \du to match; keeping those in sync (as opposed to having \du just say "see \dg") seems acceptable. You had the direction of membership wrong in your copy: "For each membership in the role" describes the reverse of "Member of" which is what the column is. The actual format template is constructed properly. --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1727,15 +1727,18 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first value' 'second value' \g S modifier to include system roles. If pattern is specified, only those roles whose names match the pattern are listed. -For each membership in the role, the membership options and -the role that granted the membership are displayed. -Оne-letter abbreviations are used for membership options: -a — admin option, i — inherit option, -s — set option and empty if no one is set. -See GRANT command for their meaning. -If the form \dg+ is used, additional information -is shown about each role; currently this adds the comment for each -role. + + +Shown within each row, in newline-separated format, are the memberships granted to +the role. The presentation includes both the name of the grantor +as well as the membership permissions (in an abbreviated format: +a for admin option, i for inherit option, +s for set option.) The word empty is printed in +the case that none of those permissions are granted. +See the GRANT command for their meaning. + + +If the form \dg+ is used the comment attached to the role is shown. I would suggest tweaking the test output to include regress_du_admin and also to make regress_du_admin a CREATEROLE role with LOGIN. I'll need to update the Role Graph View to add the spaces and swap the order of the "s" and "i" symbols. David J.
RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Tue, Mar 7, 2023 9:47 PM Önder Kalacı wrote: > > I'm attaching v35. > I noticed that if the index column only exists on the subscriber side, this index can also be chosen. This seems a bit odd because the index column isn't sent from publisher. e.g. -- pub CREATE TABLE test_replica_id_full (x int, y int); ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL; CREATE PUBLICATION tap_pub_rep_full FOR TABLE test_replica_id_full; -- sub CREATE TABLE test_replica_id_full (x int, y int, z int); CREATE INDEX test_replica_id_full_idx ON test_replica_id_full(z); CREATE SUBSCRIPTION tap_sub_rep_full_0 CONNECTION 'dbname=postgres port=5432' PUBLICATION tap_pub_rep_full; I didn't see in any cases the behavior changed after applying the patch, which looks good. Besides, I tested the performance for such case. Steps: 1. create tables, index, publication, and subscription -- pub create table tbl (a int); alter table tbl replica identity full; create publication pub for table tbl; -- sub create table tbl (a int, b int); create index idx_b on tbl(b); create subscription sub connection 'dbname=postgres port=5432' publication pub; 2. setup synchronous replication 3. execute SQL: truncate tbl; insert into tbl select i from generate_series(0,1)i; update tbl set a=a+1; The time of UPDATE (take the average of 10 runs): master: 1356.06 ms patched: 3968.14 ms For the cases that all values of extra columns on the subscriber are NULL, index scan can't do better than sequential scan. This is not a real scenario and I think it only degrades when there are many NULL values in the index column, so this is probably not a case to worry about. I just share this case and then we can discuss should we pick the index which only contain the extra columns on the subscriber. Regards, Shi Yu
RE: Rework LogicalOutputPluginWriterUpdateProgress
On Tue, Mar 7, 2023 15:55 PM Kuroda, Hayato/黒田 隼人 wrote: > Dear Wang, > > Thank you for updating the patch! Followings are my comments. Thanks for your comments. > --- > 01. missing comments > You might miss the comment from Peter[1]. Or could you pin the related one? Since I think the functions WalSndPrepareWrite and WalSndWriteData have similar parameters and the HEAD has no related comments, I'm not sure whether we should add them in this patch, or in a separate patch to comment atop these callback functions or where they are called. > --- > 02. LogicalDecodingProcessRecord() > > Don't we have to call UpdateDecodingProgressAndKeepalive() when there is no > decoding function? Assuming that the timeout parameter does not have enough > time > period and there are so many sequential operations in the transaction. At that > time > there may be a possibility that timeout is occurred while calling > ReorderBufferProcessXid() > several times. It may be a bad example, but I meant to say that we may have > to > consider the case that decoding function has not implemented yet. I think it's ok in this function. If the decoding function has not been implemented for a record, I think we quickly return to the loop in the function WalSndLoop, where it will try to send the keepalive message. BTW, in the previous discussion [1], we decided to ignore some paths, because the gain from modifying them may not be so great. > --- > 03. stream_*_cb_wrapper > > Only stream_*_cb_wrapper have comments "don't call update progress, we > didn't really make any", but > there are more functions that does not send updates. Do you have any reasons > why only they have? Added this comment to more functions. I think the following six functions don't call the function UpdateProgressAndKeepalive in v5 patch: - begin_cb_wrapper - begin_prepare_cb_wrapper - startup_cb_wrapper - shutdown_cb_wrapper - filter_prepare_cb_wrapper - filter_by_origin_cb_wrapper I think the comment you mentioned means that no new progress needs to be updated in this *_cb_wrapper. Also, I think we don't need to update the progress at the beginning of a transaction, just like in HEAD. So, I added the same comment only in the 4 functions below: - startup_cb_wrapper - shutdown_cb_wrapper - filter_prepare_cb_wrapper - filter_by_origin_cb_wrapper Attach the new patch. [1] - https://www.postgresql.org/message-id/20230213180302.u5sqosteflr3zkiz%40awork3.anarazel.de Regards, Wang wei v6-0001-Rework-LogicalOutputPluginWriterUpdateProgress.patch Description: v6-0001-Rework-LogicalOutputPluginWriterUpdateProgress.patch
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Tue, Mar 7, 2023 at 7:17 PM Önder Kalacı wrote: > >> >> > > Let me give an example to demonstrate why I thought something is fishy >> > > here: >> > > >> > > Imagine rel has a (non-default) REPLICA IDENTITY with Oid=. >> > > Imagine the same rel has a PRIMARY KEY with Oid=. >> > > > > > Hmm, alright, this is syntactically possible, but not sure if any user > would do this. Still thanks for catching this. > > And, you are right, if a user has created such a schema, > IdxIsRelationIdentityOrPK() > would return the wrong result and we'd use sequential scan instead of index > scan. > This would be a regression. I think we should change the function. > > > Here is the example: > DROP TABLE tab1; > CREATE TABLE tab1 (a int NOT NULL); > CREATE UNIQUE INDEX replica_unique ON tab1(a); > ALTER TABLE tab1 REPLICA IDENTITY USING INDEX replica_unique; > ALTER TABLE tab1 ADD CONSTRAINT pkey PRIMARY KEY (a); > You have not given complete steps to reproduce the problem where instead of the index scan, a sequential scan would be picked. I have tried to reproduce by extending your steps but didn't see the problem. Let me know if I am missing something. Publisher postgres=# CREATE TABLE tab1 (a int NOT NULL); CREATE TABLE postgres=# Alter Table tab1 replica identity full; ALTER TABLE postgres=# create publication pub2 for table tab1; CREATE PUBLICATION postgres=# insert into tab1 values(1); INSERT 0 1 postgres=# update tab1 set a=2; Subscriber - postgres=# CREATE TABLE tab1 (a int NOT NULL); CREATE TABLE postgres=# CREATE UNIQUE INDEX replica_unique ON tab1(a); CREATE INDEX postgres=# ALTER TABLE tab1 REPLICA IDENTITY USING INDEX replica_unique; ALTER TABLE postgres=# ALTER TABLE tab1 ADD CONSTRAINT pkey PRIMARY KEY (a); ALTER TABLE postgres=# create subscription sub2 connection 'dbname=postgres' publication pub2; NOTICE: created replication slot "sub2" on publisher CREATE SUBSCRIPTION postgres=# select * from tab1; a --- 2 (1 row) I have debugged the above example and it uses an index scan during apply without your latest change which is what I expected. AFAICS, the use of IdxIsRelationIdentityOrPK() is to decide whether we will do tuples_equal() or not during the index scan and I see it gives the correct results with the example you provided. -- With Regards, Amit Kapila.
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Here are some review comments for v35-0001 == General 1. Saying the index "should" or "should not" do this or that sounds like it is still OK but just not recommended. TO remove this ambigity IMO most of the "should" ought to be changed to "must" because IIUC this patch will simply not consider indexes which do not obey all your rules. This comment applies to a few places (search for "should") e.g.1. - Commit Message e.g.2. - /* There should always be at least one attribute for the index scan. */ e.g.3. - The function comment for FindUsableIndexForReplicaIdentityFull(Relation localrel) == doc/src/sgml/logical-replication.sgml 2. A published table must have a “replica identity” configured in order to be able to replicate UPDATE and DELETE operations, so that appropriate rows to update or delete can be identified on the subscriber side. By default, this is the primary key, if there is one. Another unique index (with certain additional requirements) can also be set to be the replica identity. If the table does not have any suitable key, then it can be set to replica identity “full”, which means the entire row becomes the key. When replica identity “full” is specified, indexes can be used on the subscriber side for searching the rows. Candidate indexes must be btree, non-partial, and have at least one column reference (i.e. cannot consist of only expressions). These restrictions on the non-unique index properties adheres some of the restrictions that are enforced for primary keys. Internally, we follow a similar approach for supporting index scans within logical replication scope. If there are no such suitable indexes, the search on the subscriber side can be very inefficient, therefore replica identity “full” should only be used as a fallback if no other solution is possible. If a replica identity other than “full” is set on the publisher side, a replica identity comprising the same or fewer columns must also be set on the subscriber side. See REPLICA IDENTITY for details on how to set the replica identity. If a table without a replica identity is added to a publication that replicates UPDATE or DELETE operations then subsequent UPDATE or DELETE operations will cause an error on the publisher. INSERT operations can proceed regardless of any replica identity. ~ "adheres some of" --> "adhere to some of" ? == src/backend/executor/execReplication.c 3. build_replindex_scan_key { Oid operator; Oid opfamily; RegProcedure regop; - int pkattno = attoff + 1; - int mainattno = indkey->values[attoff]; - Oid optype = get_opclass_input_type(opclass->values[attoff]); + int table_attno = indkey->values[index_attoff]; + Oid optype = get_opclass_input_type(opclass->values[index_attoff]); These variable declarations might look tidier if you kept all the Oids together. == src/backend/replication/logical/relation.c 4. logicalrep_partition_open + /* + * Finding a usable index is an infrequent task. It occurs when an + * operation is first performed on the relation, or after invalidation of + * the relation cache entry (such as ANALYZE or CREATE/DROP index on the + * relation). + * + * We also prefer to run this code on the oldctx such that we do not + * leak anything in the LogicalRepPartMapContext (hence + * CacheMemoryContext). + */ + entry->localindexoid = FindLogicalRepLocalIndex(partrel, remoterel) "such that" --> "so that" ? ~~~ 5. IsIndexUsableForReplicaIdentityFull +bool +IsIndexUsableForReplicaIdentityFull(IndexInfo *indexInfo) +{ + bool is_btree = (indexInfo->ii_Am == BTREE_AM_OID); + bool is_partial = (indexInfo->ii_Predicate != NIL); + bool is_only_on_expression = IsIndexOnlyOnExpression(indexInfo); + + if (is_btree && !is_partial && !is_only_on_expression) + { + return true; + } + + return false; +} SUGGESTION (no need for 2 returns here) return is_btree && !is_partial && !is_only_on_expression; == src/backend/replication/logical/worker.c 6. FindReplTupleInLocalRel static bool FindReplTupleInLocalRel(EState *estate, Relation localrel, LogicalRepRelation *remoterel, Oid localidxoid, TupleTableSlot *remoteslot, TupleTableSlot **localslot) { bool found; /* * Regardless of the top-level operation, we're performing a read here, so * check for SELECT privileges. */ TargetPrivilegesCheck(localrel, ACL_SELECT); *localslot = table_slot_create(localrel, &estate->es_tupleTable); Assert(OidIsValid(localidxoid) || (remoterel->replident == REPLICA_IDENTITY_FULL)); if (OidIsValid(localidxoid)) found = RelationFindReplTupleByIndex(localrel, localidxoid, LockTupleExclusive, remoteslot, *localslot); else found = RelationFindReplTupleSeq(localrel, LockTupleExclusive, remoteslot, *localslot); return found; } ~ Since that 'found' variable is not used, you might as well remove the if/else and simplify the code. SUGGESTION static bool FindReplTupleInLocalRel(EState *estate, Relation localrel, LogicalRepRelation *remoterel, Oid localidxoid, TupleTableSlot *remoteslot, TupleTableSl
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
Hi, Thomas! On 04.03.2023 00:39, Thomas Munro wrote: It seems a good topic for a separate thread patch. Would you provide a link to the thread you mentioned please? https://www.postgresql.org/message-id/flat/367d01a7-90bb-9b70-4cda-248e81cc475c%40cosium.com Thanks! The important words there: But I fail to see how full_page_writes prevents this since it only act on writes It ensures the damage is later repaired during WAL replay. Which can only happen if the WAL contains the necessary information to do so - the full page writes. Together with the docs about restoring corrupted pg_control in the pg_resetwal utility description these words made me think about whether to save the contents of pg_control at the beginning and end of checkpoint into WAL and teach pg_resetwal to read them? It would be like a periodic backup of the pg_control to a safe place. This thought has nothing to do with this patch and this thread, and, in general, i'm not sure if it has any practical meaning, and whether, on the contrary, it may lead to some difficulties. If it seems that there is a sense, then it will be possible to think further about it. For example, see subscription/011_generated which failed like this: 2023-02-16 06:57:25.724 GMT [5736][not initialized] PANIC: could not read file "global/pg_control": Permission denied That was fixed after I changed it to also do locking in xlog.c ReadControlFile(), in the version you tested. There must be something I don't understand going on, because that cluster wasn't even running before: it had just been created by initdb. But, anyway, I have a new idea that makes that whole problem go away (though I'd still like to understand what happened there): Seems to be it's a race between the first reading of the pg_control in PostmasterMain() in LocalProcessControlFile(false) and the second one in SubPostmasterMain() here: /* * (re-)read control file, as it contains config. The postmaster will * already have read this, but this process doesn't know about that. */ LocalProcessControlFile(false); which crashes according to the crash log: crashlog-postgres.exe_19a0_2023-02-16_06-57-26-675.txt With the "pseudo-advisory lock for Windows" idea from the last email, we don't need to bother with file system level locking in many places. Just in controldata_utils.c, for FE/BE interlocking (that is, we don't need to use that for interlocking of concurrent reads and writes that are entirely in the backend, because we already have an LWLock that we could use more consistently). Changes: 1. xlog.c mostly uses no locking 2. basebackup.c now acquires ControlFileLock 3. only controldata_utils.c uses the new file locking, for FE-vs-BE interlocking 4. lock past the end (pseudo-advisory locking for Windows) Although the changes in 1. contributes to the problem described above again, but 4. fixes this. And i did not find any other places where ReadControlFile() can be called in different processes.That's all ok. Thanks to 4., now it is not necessary to use VSS to copy the pg_control file, it can be copied in a common way even during the torture test. This is very good. I really like the idea with LWLock where possible. In general, i think that these changes make the patch more lightweight and fast. Also i ran tests for more than a day in stress mode with fsync=off under windows and Linux and found no problems. Patch-tester also passes without errors. I would like to move this patch to RFC, since I don’t see any problems both in the code and in the tests, but the pg_backup_start() subproblem confuses me. Maybe move it to a separate patch in a distinct thread? As there are a number of suggestions and questions to discuss such as: Note that when we recover from a basebackup or pg_backup_start() backup, we use the backup label to find a redo start location in the WAL (see read_backup_label()), BUT we still read the copied pg_control file (one that might be too "new", so we don't use its redo pointer). So it had better not be torn, or the recovery will fail. So, in this version I protected that sendFile() with ControlFileLock. But... Isn't that a bit strange? To go down this path we would also need to document the need to copy the control file with the file locked to avoid a rare failure, in the pg_backup_start() documentation. That's annoying (I don't even know how to do it with easy shell commands; maybe we should provide a program that locks and cats the file...?). Variant with separate utility looks good, with the recommendation in the doc to use it for the pg_control coping. Also seems it is possible to make a function available in psql, such as export_pg_control('dst_path') with the destination path as argument and call it before pg_backup_stop(). Or pass the pg_control destination path to the pg_backup_stop() as extra argument. Or save pg_control to a predetermined location during pg_backup_stop() and specify in the do
Re: [Proposal] Add foreign-server health checks infrastructure
On Tue, 7 Mar 2023 at 09:53, Hayato Kuroda (Fujitsu) wrote: > > Dear Katsuragi-san, > > Thank you for reviewing! PSA new version patches. > > > I think we can update the status to ready for committer after > > this fix, if there is no objection. > > That's a very good news for me! How about other people? > > > >> 7. the document of postgres_fdw_verify_connection_states_all > > >> NULL > > >> + is returned if the local session does not have connection > > >> caches, > > >> or this > > >> + function is not available on this platform. > > >> > > >> I think there is a case where a connection cache exists but valid > > >> connections do not exist and NULL is returned (disconnection case). > > >> Almost the same document as the postgres_fdw_verify_connection_states > > >> case (comment no.5) seems better. > > > > > > Yes, but completely same statement cannot be used because these is not > > > specified foreign server. How about: > > > NULL is returned if there are no established connections or this > > > function ... > > > > Yes, to align with the postgres_fdw_verify_connection_states() > > case, how about writing the connection is not valid. Like the > > following? > > 'NULL is returned if no valid connections are established or > > this function...' > > Prefer yours, fixed. > > > This is my comment for v35. Please check. > > 0002: > > 1. the comment of verify_cached_connections (I missed one minor point.) > > + * This function emits warnings if a disconnection is found. This > > returns true > > + * if disconnections cannot be found, otherwise returns false. > > > > I think false is returned only if disconnections are found and > > true is returned in all other cases. So, modifying the description > > like the following seems better. > > 'This returns false if disconnections are found, otherwise > > returns true.' > > Fixed. Few comments: 1) There is no handling of forConnCheck in #else HAVE_POLL, if this is intentional we could add some comments for the same: static int -pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time) +pqSocketPoll(int sock, int forRead, +int forWrite, int forConnCheck, time_t end_time) { /* We use poll(2) if available, otherwise select(2) */ #ifdef HAVE_POLL @@ -1092,7 +1133,11 @@ pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time) int timeout_ms; if (!forRead && !forWrite) - return 0; 2) Can this condition be added to the parent if condition: if (!forRead && !forWrite) - return 0; + { + /* Connection check can be available on some limted platforms */ + if (!(forConnCheck && PQconnCheckable())) + return 0; + } 3) Can we add a comment for PQconnCheckable: +/* Check whether the postgres server is still alive or not */ +extern int PQconnCheck(PGconn *conn); +extern int PQconnCheckable(void); + 4) "Note that if 0 is returned and forConnCheck is requested" doesn't sound right, it can be changed to "Note that if 0 is returned when forConnCheck is requested" + * If neither forRead, forWrite nor forConnCheck are set, immediately return a + * timeout condition (without waiting). Return >0 if condition is met, 0 if a + * timeout occurred, -1 if an error or interrupt occurred. Note that if 0 is + * returned and forConnCheck is requested, it means that the socket has not + * matched POLLRDHUP event and the connection is not closed by the socket peer. Regards, Vignesh
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Wed, Mar 8, 2023 at 3:30 AM Nathan Bossart wrote: > > On Mon, Mar 06, 2023 at 07:27:59PM +0300, Önder Kalacı wrote: > > On the other hand, we already have a similar problem with > > recovery_min_apply_delay combined with hot_standby_feedback [1]. > > So, that probably is an acceptable trade-off for the pgsql-hackers. > > If you use this feature, you should be even more careful. > > Yes, but it's possible to turn off hot_standby_feedback so that you don't > incur bloat on the primary. And you don't need to store hours or days of > WAL on the primary. Right. This side effect belongs to the combination of recovery_min_apply_delay and hot_standby_feedback/replication slot. recovery_min_apply_delay itself can be used even without this side effect if we accept other trade-offs. When it comes to this time-delayed logical replication feature, there is no choice to avoid the side effect for users who want to use this feature. > I'm very late to this thread, but IIUC you cannot > avoid blocking VACUUM with the proposed feature. Right. > IMO the current set of > trade-offs (e.g., unavoidable bloat and WAL buildup) would make this > feature virtually unusable for a lot of workloads, so it's probably worth > exploring an alternative approach. It might require more engineering effort for alternative approaches such as one I proposed but the feature could become better from the user perspective. I also think it would be worth exploring it if we've not yet. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: buildfarm + meson
On 2023-03-07 18:26:21 -0800, Andres Freund wrote: > On 2023-02-23 06:27:23 -0500, Andrew Dunstan wrote: > > Yeah. For touch I think we can probably just get rid of this line in the > > root meson.build: > > > > touch = find_program('touch', native: true) > > Yep. > > > For cp there doesn't seem to be a formal requirement, but there is a recipe > > in src/common/unicode/meson.build that uses it, maybe that's what caused the > > failure. On Windows/msvc we could just use copy instead, I think. > > I don't know about using copy, it's very easy to get into trouble due to > interpreting forward slashes as options etc. I propose that for now we just > don't support update-unicode if cp isn't available - just as already not > available when wget isn't available. > > Planning to apply something like the attached soon, unless somebody opposes > that plan. Done.
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Wed, Mar 8, 2023 at 9:09 AM Peter Smith wrote: > > > == > src/backend/executor/execReplication.c > > 3. build_replindex_scan_key > > { > Oid operator; > Oid opfamily; > RegProcedure regop; > - int pkattno = attoff + 1; > - int mainattno = indkey->values[attoff]; > - Oid optype = get_opclass_input_type(opclass->values[attoff]); > + int table_attno = indkey->values[index_attoff]; > + Oid optype = get_opclass_input_type(opclass->values[index_attoff]); > > These variable declarations might look tidier if you kept all the Oids > together. > I am not sure how much that would be an improvement over the current but that will lead to an additional churn in the patch. > == > src/backend/replication/logical/worker.c > > 6. FindReplTupleInLocalRel > > static bool > FindReplTupleInLocalRel(EState *estate, Relation localrel, > LogicalRepRelation *remoterel, > Oid localidxoid, > TupleTableSlot *remoteslot, > TupleTableSlot **localslot) > { > bool found; > > /* > * Regardless of the top-level operation, we're performing a read here, so > * check for SELECT privileges. > */ > TargetPrivilegesCheck(localrel, ACL_SELECT); > > *localslot = table_slot_create(localrel, &estate->es_tupleTable); > > Assert(OidIsValid(localidxoid) || >(remoterel->replident == REPLICA_IDENTITY_FULL)); > > if (OidIsValid(localidxoid)) > found = RelationFindReplTupleByIndex(localrel, localidxoid, > LockTupleExclusive, > remoteslot, *localslot); > else > found = RelationFindReplTupleSeq(localrel, LockTupleExclusive, > remoteslot, *localslot); > > return found; > } > > ~ > > Since that 'found' variable is not used, you might as well remove the > if/else and simplify the code. > Hmm, but that is an existing style/code, and this patch has done nothing which requires that change. Personally, I find the current style better for the readability purpose. -- With Regards, Amit Kapila.
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
On Wed, Mar 8, 2023 at 4:43 PM Anton A. Melnikov wrote: > On 04.03.2023 00:39, Thomas Munro wrote: > > Could we make better use of the safe copy that we have in the log? > > Then the pg_backup_start() subproblem would disappear. Conceptually, > > that'd be just like the way we use FPI for data pages copied during a > > backup. I'm not sure about any of that, though, it's just an idea, > > not tested. > > Sorry, i didn't understand the question about log. Would you explain me > please what kind of log did you mention and where can i look this > safe copy creation in the code? Sorry, I was confused; please ignore that part. We don't have a copy of the control file anywhere else. (Perhaps we should, but that could be a separate topic.)
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Wed, Mar 8, 2023 at 3:03 PM Amit Kapila wrote: > > On Wed, Mar 8, 2023 at 9:09 AM Peter Smith wrote: > > > > > > == > > src/backend/executor/execReplication.c > > > > 3. build_replindex_scan_key > > > > { > > Oid operator; > > Oid opfamily; > > RegProcedure regop; > > - int pkattno = attoff + 1; > > - int mainattno = indkey->values[attoff]; > > - Oid optype = get_opclass_input_type(opclass->values[attoff]); > > + int table_attno = indkey->values[index_attoff]; > > + Oid optype = get_opclass_input_type(opclass->values[index_attoff]); > > > > These variable declarations might look tidier if you kept all the Oids > > together. > > > > I am not sure how much that would be an improvement over the current > but that will lead to an additional churn in the patch. That suggestion was because IMO the 'optype' and 'opfamily' belong together. TBH, really think the assignment of 'opttype' should happen later with the 'opfamilly' assignment too because then it will be *after* the (!AttributeNumberIsValid(table_attno)) check. > > > == > > src/backend/replication/logical/worker.c > > > > 6. FindReplTupleInLocalRel > > > > static bool > > FindReplTupleInLocalRel(EState *estate, Relation localrel, > > LogicalRepRelation *remoterel, > > Oid localidxoid, > > TupleTableSlot *remoteslot, > > TupleTableSlot **localslot) > > { > > bool found; > > > > /* > > * Regardless of the top-level operation, we're performing a read here, so > > * check for SELECT privileges. > > */ > > TargetPrivilegesCheck(localrel, ACL_SELECT); > > > > *localslot = table_slot_create(localrel, &estate->es_tupleTable); > > > > Assert(OidIsValid(localidxoid) || > >(remoterel->replident == REPLICA_IDENTITY_FULL)); > > > > if (OidIsValid(localidxoid)) > > found = RelationFindReplTupleByIndex(localrel, localidxoid, > > LockTupleExclusive, > > remoteslot, *localslot); > > else > > found = RelationFindReplTupleSeq(localrel, LockTupleExclusive, > > remoteslot, *localslot); > > > > return found; > > } > > > > ~ > > > > Since that 'found' variable is not used, you might as well remove the > > if/else and simplify the code. > > > > Hmm, but that is an existing style/code, and this patch has done > nothing which requires that change. Personally, I find the current > style better for the readability purpose. > OK. I failed to notice that was same as the existing code. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Mar 7, 2023 at 8:25 AM Masahiko Sawada wrote: > > 1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead. > > 2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further. > > I prefer option (1) as it's straight forward. I mentioned a similar > idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is > defined. It might be worth checking if there is visible overhead of > tracking chunk memory space. IIRC we've not evaluated it yet. Ok, let's try this -- I can test and profile later this week. -- John Naylor EDB: http://www.enterprisedb.com
RE: [Proposal] Add foreign-server health checks infrastructure
Dear Vignesh, Thank you for reviewing! PSA new version. > > Few comments: > 1) There is no handling of forConnCheck in #else HAVE_POLL, if this is > intentional we could add some comments for the same: > static int > -pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time) > +pqSocketPoll(int sock, int forRead, > +int forWrite, int forConnCheck, time_t end_time) > { > /* We use poll(2) if available, otherwise select(2) */ > #ifdef HAVE_POLL > @@ -1092,7 +1133,11 @@ pqSocketPoll(int sock, int forRead, int > forWrite, time_t end_time) > int timeout_ms; > > if (!forRead && !forWrite) > - return 0; Comments were added. This missing is intentional, because with the patch present we do not implement checking feature for kqueue(). If you want to check the detailed implementation of that, please see 0004 patch attached on [1]. > 2) Can this condition be added to the parent if condition: > if (!forRead && !forWrite) > - return 0; > + { > + /* Connection check can be available on some limted platforms > */ > + if (!(forConnCheck && PQconnCheckable())) > + return 0; > + } Changed, and added comments atop the if-statement. > 3) Can we add a comment for PQconnCheckable: > +/* Check whether the postgres server is still alive or not */ > +extern int PQconnCheck(PGconn *conn); > +extern int PQconnCheckable(void); > + Added. And I found the tab should be used to divide data type and name, so fixed. > 4) "Note that if 0 is returned and forConnCheck is requested" doesn't > sound right, it can be changed to "Note that if 0 is returned when > forConnCheck is requested" > + * If neither forRead, forWrite nor forConnCheck are set, immediately return > a > + * timeout condition (without waiting). Return >0 if condition is met, 0 if a > + * timeout occurred, -1 if an error or interrupt occurred. Note that if 0 is > + * returned and forConnCheck is requested, it means that the socket has not > + * matched POLLRDHUP event and the connection is not closed by the socket > peer. Fixed. [1]: https://www.postgresql.org/message-id/TYAPR01MB58669EAAC02493BFF9F39B06F5D99%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED v37-0001-Add-PQconnCheck-and-PQconnCheckable-to-libpq.patch Description: v37-0001-Add-PQconnCheck-and-PQconnCheckable-to-libpq.patch v37-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch Description: v37-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch v37-0003-add-test.patch Description: v37-0003-add-test.patch
Re: Testing autovacuum wraparound (including failsafe)
On Fri, Mar 3, 2023 at 8:34 PM Heikki Linnakangas wrote: > > On 16/11/2022 06:38, Ian Lawrence Barwick wrote: > > Thanks for the patch. While reviewing the patch backlog, we have determined > > that > > the latest version of this patch was submitted before meson support was > > implemented, so it should have a "meson.build" file added for consideration > > for > > inclusion in PostgreSQL 16. > > I wanted to do some XID wraparound testing again, to test the 64-bit > SLRUs patches [1], and revived this. Thank you for reviving this thread! > > I took a different approach to consuming the XIDs. Instead of setting > nextXID directly, bypassing GetNewTransactionId(), this patch introduces > a helper function to call GetNewTransactionId() repeatedly. But because > that's slow, it does include a shortcut to skip over "uninteresting" > XIDs. Whenever nextXid is close to an SLRU page boundary or XID > wraparound, it calls GetNewTransactionId(), and otherwise it bumps up > nextXid close to the next "interesting" value. That's still a lot slower > than just setting nextXid, but exercises the code more realistically. > > I've written some variant of this helper function many times over the > years, for ad hoc testing. I'd love to have it permanently in the git tree. These functions seem to be better than mine. > In addition to Masahiko's test for emergency vacuum, this includes two > other tests. 002_limits.pl tests the "warn limit" and "stop limit" in > GetNewTransactionId(), and 003_wraparound.pl burns through 10 billion > transactions in total, exercising XID wraparound in general. > Unfortunately these tests are pretty slow; the tests run for about 4 > minutes on my laptop in total, and use about 20 GB of disk space. So > perhaps these need to be put in a special test suite that's not run as > part of "check-world". Or perhaps leave out the 003_wraparounds.pl test, > that's the slowest of the tests. But I'd love to have these in the git > tree in some form. cbfot reports some failures. The main reason seems that meson.build in xid_wraparound directory adds the regression tests but the .sql and .out files are missing in the patch. Perhaps the patch wants to add only tap tests as Makefile doesn't define REGRESS? Even after fixing this issue, CI tests (Cirrus CI) are not happy and report failures due to a disk full. The size of xid_wraparound test directory is 105MB out of 262MB: % du -sh testrun 262Mtestrun % du -sh testrun/xid_wraparound/ 105Mtestrun/xid_wraparound/ % du -sh testrun/xid_wraparound/* 460Ktestrun/xid_wraparound/001_emergency_vacuum 93M testrun/xid_wraparound/002_limits 12M testrun/xid_wraparound/003_wraparounds % ls -lh testrun/xid_wraparound/002_limits/log* total 93M -rw---. 1 masahiko masahiko 93M Mar 7 17:34 002_limits_wraparound.log -rw-rw-r--. 1 masahiko masahiko 20K Mar 7 17:34 regress_log_002_limits The biggest file is the server logs since an autovacuum worker writes autovacuum logs for every table for every second (autovacuum_naptime is 1s). Maybe we can set log_autovacuum_min_duration reloption for the test tables instead of globally enabling it The 001 test uses the 2PC transaction that holds locks on tables but since we can consume xids while the server running, we don't need that. Instead I think we can keep a transaction open in the background like 002 test does. I'll try these ideas. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: optimize several list functions with SIMD intrinsics
On Wed, Mar 08, 2023 at 01:54:15PM +1300, David Rowley wrote: > Interesting and quite impressive performance numbers. Thanks for taking a look. > From having a quick glance at the patch, it looks like you'll need to > take some extra time to make it work on 32-bit builds. At the moment, the support for SIMD intrinsics in Postgres is limited to 64-bit (simd.h has the details). But yes, if we want to make this work for 32-bit builds, additional work is required. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: shoud be get_extension_schema visible?
st 8. 3. 2023 v 2:04 odesílatel Michael Paquier napsal: > On Mon, Mar 06, 2023 at 04:44:59PM +0900, Michael Paquier wrote: > > I can see why you'd want that, so OK from here to provide this routine > > for external consumption. Let's first wait a bit and see if others > > have any kind of objections or comments. > > Done this one as of e20b1ea. > Thank you very much Pavel > -- > Michael >
Re: Testing autovacuum wraparound (including failsafe)
On Fri, Mar 3, 2023 at 3:34 AM Heikki Linnakangas wrote: > I took a different approach to consuming the XIDs. Instead of setting > nextXID directly, bypassing GetNewTransactionId(), this patch introduces > a helper function to call GetNewTransactionId() repeatedly. But because > that's slow, it does include a shortcut to skip over "uninteresting" > XIDs. Whenever nextXid is close to an SLRU page boundary or XID > wraparound, it calls GetNewTransactionId(), and otherwise it bumps up > nextXid close to the next "interesting" value. That's still a lot slower > than just setting nextXid, but exercises the code more realistically. Surely your tap test should be using single user mode? Perhaps you missed the obnoxious HINT, that's part of the WARNING that the test parses? ;-) This is a very useful patch. I certainly don't want to make life harder by (say) connecting it to the single user mode problem. But...the single user mode thing really needs to go away. It's just terrible advice, and actively harms users. -- Peter Geoghegan
RE: Allow logical replication to copy tables in binary format
Dear Melih, >> I think we should add description to doc that it is more likely happen to >> fail >> the initial copy user should enable binary format after synchronization if >> tables have original datatype. > > I tried to explain when binary copy can cause failures in the doc. What > exactly > do you think is missing? I assumed here that "copy_format" and "binary" were combined into one option. Currently the binary option has descriptions : ``` Even when this option is enabled, only data types having binary send and receive functions will be transferred in binary ``` But this is not suitable for initial data sync, as we knew. I meant to say that it must be updated if options are combined. Note that following is not necessary for PG16, just an improvement for newer version. Is it possible to automatically switch the binary option from 'true' to 'false' when data transfer fails? As we found that while synchronizing the initial data with binary format may lead another error, and it can be solved if the options is changed. When DBAs check a log after synchronization and find the output like "binary option was changed and worker will restart..." or something, they can turn "binary" on again. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Date-time extraneous fields with reserved keywords
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed Thank you for the response and new patch. The scope of impact is limited to 'epoch' and 'infinity'. Also, it is unlikely that these reserved words will be used in combination with time/date, so this patch is appropriate. The new status of this patch is: Ready for Committer
Re: Move defaults toward ICU in 16?
On Fri, 2023-03-03 at 21:45 -0800, Jeff Davis wrote: > > > 0002: update template0 in new cluster (as described above) I think 0002 is about ready and I plan to commit it soon unless someone has more comments. I'm holding off on 0001 for now, because you objected. But I still think 0001 is a good idea so I'd like to hear more before I withdraw it. > > > 0003: default initdb to use ICU This is also about ready, and I plan to commit this soon after 0002. > A different issue: unaccent is calling t_isspace(), which is just not > properly handling locales. t_isspace() always passes NULL as the last > argument to char2wchar. There are TODO comments throughout that file. I posted a bug report and patch for this issue: https://www.postgresql.org/message-id/79e4354d9eccfdb00483146a6b9f6295202e7890.ca...@j-davis.com Regards, Jeff Davis
Re: Allow tailoring of ICU locales with custom rules
On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote: > You can mess with people by setting up your databases like this: > > initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d' > > ;-) Would we be the first major database to support custom collation rules? This sounds useful for testing, experimentation, hacking, etc. What are some of the use cases? Is it helpful to comply with unusual or outdated standards or formats? Maybe there are people using special delimiters/terminators and they need them to be treated a certain way during comparisons? Regards, Jeff Davis
Re: Normalization of utility queries in pg_stat_statements
On Fri, Mar 03, 2023 at 09:37:27AM +0900, Michael Paquier wrote: > Thanks for double-checking, applied 0001 to finish this part of the > work. I am attaching the remaining bits as of the attached, combined > into a single patch. Doing so as a single patch was not feeling right as this actually fixes issues with the location calculations for the Const node, so I have split that into three commits and finally applied the whole. As a bonus, please see attached a patch to apply the normalization to CALL statements using the new automated infrastructure. OUT parameters can be passed to a procedure, hence I guess that these had better be silenced as well. This is not aimed at being integrated, just for reference. -- Michael From 7fff41b050b8da4b7a006071b51ae5fa43bc7c0b Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 8 Mar 2023 15:14:15 +0900 Subject: [PATCH] Apply normalization to CALL statements --- src/include/nodes/parsenodes.h| 6 ++-- src/backend/nodes/queryjumblefuncs.c | 12 +++ .../pg_stat_statements/expected/utility.out | 33 --- contrib/pg_stat_statements/sql/utility.sql| 9 + 4 files changed, 53 insertions(+), 7 deletions(-) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 259e814253..32e5f535c1 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -3227,12 +3227,14 @@ typedef struct InlineCodeBlock */ typedef struct CallStmt { + pg_node_attr(custom_query_jumble) + NodeTag type; FuncCall *funccall; /* from the parser */ /* transformed call, with only input args */ - FuncExpr *funcexpr pg_node_attr(query_jumble_ignore); + FuncExpr *funcexpr; /* transformed output-argument expressions */ - List *outargs pg_node_attr(query_jumble_ignore); + List *outargs; } CallStmt; typedef struct CallContext diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c index d7fd72d70f..709f91ab0e 100644 --- a/src/backend/nodes/queryjumblefuncs.c +++ b/src/backend/nodes/queryjumblefuncs.c @@ -52,6 +52,7 @@ static void _jumbleNode(JumbleState *jstate, Node *node); static void _jumbleA_Const(JumbleState *jstate, Node *node); static void _jumbleList(JumbleState *jstate, Node *node); static void _jumbleRangeTblEntry(JumbleState *jstate, Node *node); +static void _jumbleCallStmt(JumbleState *jstate, Node *node); /* * Given a possibly multi-statement source string, confine our attention to the @@ -395,3 +396,14 @@ _jumbleRangeTblEntry(JumbleState *jstate, Node *node) break; } } + +static void +_jumbleCallStmt(JumbleState *jstate, Node *node) +{ + CallStmt *expr = (CallStmt *) node; + FuncExpr *func = expr->funcexpr; + + JUMBLE_FIELD_SINGLE(func->funcid); + _jumbleNode(jstate, (Node *) func->args); + _jumbleNode(jstate, (Node *) expr->outargs); +} diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out index 0047aba5d1..b5a8c6937c 100644 --- a/contrib/pg_stat_statements/expected/utility.out +++ b/contrib/pg_stat_statements/expected/utility.out @@ -240,6 +240,12 @@ DECLARE BEGIN SELECT (i + i)::int INTO r; END; $$ LANGUAGE plpgsql; +CREATE OR REPLACE PROCEDURE sum_out(IN i int, IN j int, OUT k int, OUT l int) AS $$ +DECLARE + r int; +BEGIN + SELECT (i + i)::int INTO r; +END; $$ LANGUAGE plpgsql; CREATE OR REPLACE PROCEDURE sum_two(i int, j int) AS $$ DECLARE r int; @@ -256,15 +262,32 @@ CALL sum_one(3); CALL sum_one(199); CALL sum_two(1,1); CALL sum_two(1,2); +CALL sum_out(1,1,1,1); + k | l +---+--- + | +(1 row) + +CALL sum_out(2,2,1,1); + k | l +---+--- + | +(1 row) + +CALL sum_out(2,3,4,5); + k | l +---+--- + | +(1 row) + SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"; calls | rows | query ---+--+--- - 1 |0 | CALL sum_one(199) - 1 |0 | CALL sum_one(3) - 1 |0 | CALL sum_two(1,1) - 1 |0 | CALL sum_two(1,2) + 2 |0 | CALL sum_one($1) + 3 |0 | CALL sum_out($1,$2,$3,$4) + 2 |0 | CALL sum_two($1,$2) 1 |1 | SELECT pg_stat_statements_reset() -(5 rows) +(4 rows) -- COPY CREATE TABLE copy_stats (a int, b int); diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql index 225d30a62a..dbf38c31bc 100644 --- a/contrib/pg_stat_statements/sql/utility.sql +++ b/contrib/pg_stat_statements/sql/utility.sql @@ -131,6 +131,12 @@ DECLARE BEGIN SELECT (i + i)::int INTO r; END; $$ LANGUAGE plpgsql; +CREATE OR REPLACE PROCEDURE sum_out(IN i int, IN j int, OUT k int, OUT l int) AS $$ +DECLARE + r int; +BEGIN + SELECT (i + i)::int INTO r; +END; $$ LANGUAGE plpgsql; CREATE OR REPLACE PROCEDURE sum_two(i int, j int) AS $$ DECLARE r int; @@ -142,6 +148,9 @@ CALL sum_one(3); CALL sum_one(199); CALL sum_two(1,1); CALL sum_two(1,2); +CALL
Re: Add standard collation UNICODE
On 04.03.23 19:29, Jeff Davis wrote: It looks like the way you've handled this is by inserting the collation with collprovider=icu even if built without ICU support. I think that's a new case, so we need to make sure it throws reasonable user-facing errors. It would look like this: => select * from t1 order by b collate unicode; ERROR: 0A000: ICU is not supported in this build I do like your approach though because, if someone is using a standard collation, I think "not built with ICU" (feature not supported) is a better error than "collation doesn't exist". It also effectively reserves the name "unicode". right
RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Tuesday, March 7, 2023 9:47 PM Önder Kalacı wrote: Hi, > > > Let me give an example to demonstrate why I thought something is fishy > > > here: > > > > > > Imagine rel has a (non-default) REPLICA IDENTITY with Oid=. > > > Imagine the same rel has a PRIMARY KEY with Oid=. > > > > > Hmm, alright, this is syntactically possible, but not sure if any user > would do this. Still thanks for catching this. > > And, you are right, if a user has created such a schema, > IdxIsRelationIdentityOrPK() > would return the wrong result and we'd use sequential scan instead of index > scan. > This would be a regression. I think we should change the function. I am looking at the latest patch and have a question about the following code. /* Try to find the tuple */ - if (index_getnext_slot(scan, ForwardScanDirection, outslot)) + while (index_getnext_slot(scan, ForwardScanDirection, outslot)) { - found = true; + /* +* Avoid expensive equality check if the index is primary key or +* replica identity index. +*/ + if (!idxIsRelationIdentityOrPK) + { + if (eq == NULL) + { +#ifdef USE_ASSERT_CHECKING + /* apply assertions only once for the input idxoid */ + IndexInfo *indexInfo = BuildIndexInfo(idxrel); + Assert(IsIndexUsableForReplicaIdentityFull(indexInfo)); +#endif + + /* +* We only need to allocate once. This is allocated within per +* tuple context -- ApplyMessageContext -- hence no need to +* explicitly pfree(). +*/ + eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts); + } + + if (!tuples_equal(outslot, searchslot, eq)) + continue; + } IIRC, it invokes tuples_equal for all cases unless we are using replica identity key or primary key to scan. But there seem some other cases where the tuples_equal looks unnecessary. For example, if the table on subscriber don't have a PK or RI key but have a not-null, non-deferrable, unique key. And if the apply worker choose this index to do the scan, it seems we can skip the tuples_equal as well. --Example pub: CREATE TABLE test_replica_id_full (a int, b int not null); ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL; CREATE PUBLICATION tap_pub_rep_full FOR TABLE test_replica_id_full; sub: CREATE TABLE test_replica_id_full (a int, b int not null); CREATE UNIQUE INDEX test_replica_id_full_idx ON test_replica_id_full(b); -- I am not 100% sure if it's worth optimizing this by complicating the check in idxIsRelationIdentityOrPK. What do you think ? Best Regards, Hou zj
Re: pg_upgrade and logical replication
On Sat, 4 Mar 2023, 14:13 Amit Kapila, wrote: > > > For the publisher nodes, that may be something nice to support (I'm > assuming it > > could be useful for more complex replication setups) but I'm not > interested in > > that at the moment as my goal is to reduce downtime for major upgrade of > > physical replica, thus *not* doing pg_upgrade of the primary node, > whether > > physical or logical. I don't see why it couldn't be done later on, > if/when > > someone has a use case for it. > > > > I thought there is value if we provide a way to upgrade both publisher > and subscriber. it's still unclear to me whether it's actually achievable on the publisher side, as running pg_upgrade leaves a "hole" in the WAL stream and resets the timeline, among other possible difficulties. Now I don't know much about logical replication internals so I'm clearly not the best person to answer those questions. Now, you came up with a use case linking it to a > physical replica where allowing an upgrade of only subscriber nodes is > useful. It is possible that users find your steps easy to perform and > didn't find them error-prone but it may be better to get some > authentication of the same. I haven't yet analyzed all the steps in > detail but let's see what others think. > It's been quite some time since and no one seemed to chime in or object. IMO doing a major version upgrade with limited downtime (so something faster than stopping postgres and running pg_upgrade) has always been difficult and never prevented anyone from doing it, so I don't think that it should be a blocker for what I'm suggesting here, especially since the current behavior of pg_upgrade on a subscriber node is IMHO broken. Is there something that can be done for pg16? I was thinking that having a fix for the normal and easy case could be acceptable: only allowing pg_upgrade to optionally, and not by default, preserve the subscription relations IFF all subscriptions only have tables in ready state. Different states should be transient, and it's easy to check as a user beforehand and also easy to check during pg_upgrade, so it seems like an acceptable limitations (which I personally see as a good sanity check, but YMMV). It could be lifted in later releases if wanted anyway. It's unclear to me whether this limited scope would also require to preserve the replication origins, but having looked at the code I don't think it would be much of a problem as the local LSN doesn't have to be preserved. In both cases I would prefer a single option (e. g. --preserve-logical-subscription-state or something like that) to avoid too much complications. Similarly, I still don't see any sensible use case for allowing such option in a normal pg_dump so I'd rather not expose that. >