Add index OID macro argument to DECLARE_INDEX
This patch changes places like this DECLARE_UNIQUE_INDEX_PKEY(pg_aggregate_fnoid_index, 2650, on pg_aggregate using btree(aggfnoid oid_ops)); #define AggregateFnoidIndexId 2650 to this DECLARE_UNIQUE_INDEX_PKEY(pg_aggregate_fnoid_index, 2650, AggregateFnoidIndexId, on pg_aggregate using btree(aggfnoid oid_ops)); and makes genbki.pl generate the #define's. This makes the handling of catalog index OIDs consistent with the handling of catalog tables. Compare with: CATALOG(pg_aggregate,2600,AggregateRelationId) From 43ffb9d952e3701b636d49dfc2884d78c0627cc7 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 21 Jun 2021 09:10:12 +0200 Subject: [PATCH] Add index OID macro argument to DECLARE_INDEX Instead of defining symbols such as AmOidIndexId explicitly, include them as an argument of DECLARE_INDEX() and have genbki.pl generate the way as the table OID symbols from the CATALOG() declaration. --- src/backend/catalog/Catalog.pm| 5 +++-- src/backend/catalog/genbki.pl | 8 src/backend/utils/cache/syscache.c| 4 ++-- src/include/catalog/genbki.h | 6 +++--- src/include/catalog/pg_aggregate.h| 3 +-- src/include/catalog/pg_am.h | 6 ++ src/include/catalog/pg_amop.h | 9 +++-- src/include/catalog/pg_amproc.h | 6 ++ src/include/catalog/pg_attrdef.h | 6 ++ src/include/catalog/pg_attribute.h| 6 ++ src/include/catalog/pg_auth_members.h | 6 ++ src/include/catalog/pg_authid.h | 6 ++ src/include/catalog/pg_cast.h | 6 ++ src/include/catalog/pg_class.h| 9 +++-- src/include/catalog/pg_collation.h| 6 ++ src/include/catalog/pg_constraint.h | 15 +-- src/include/catalog/pg_conversion.h | 9 +++-- src/include/catalog/pg_database.h | 6 ++ src/include/catalog/pg_db_role_setting.h | 3 +-- src/include/catalog/pg_default_acl.h | 6 ++ src/include/catalog/pg_depend.h | 6 ++ src/include/catalog/pg_description.h | 3 +-- src/include/catalog/pg_enum.h | 9 +++-- src/include/catalog/pg_event_trigger.h| 6 ++ src/include/catalog/pg_extension.h| 6 ++ src/include/catalog/pg_foreign_data_wrapper.h | 6 ++ src/include/catalog/pg_foreign_server.h | 6 ++ src/include/catalog/pg_foreign_table.h| 3 +-- src/include/catalog/pg_index.h| 6 ++ src/include/catalog/pg_inherits.h | 6 ++ src/include/catalog/pg_init_privs.h | 3 +-- src/include/catalog/pg_language.h | 6 ++ src/include/catalog/pg_largeobject.h | 3 +-- src/include/catalog/pg_largeobject_metadata.h | 3 +-- src/include/catalog/pg_namespace.h| 6 ++ src/include/catalog/pg_opclass.h | 6 ++ src/include/catalog/pg_operator.h | 6 ++ src/include/catalog/pg_opfamily.h | 6 ++ src/include/catalog/pg_partitioned_table.h| 3 +-- src/include/catalog/pg_policy.h | 6 ++ src/include/catalog/pg_proc.h | 6 ++ src/include/catalog/pg_publication.h | 6 ++ src/include/catalog/pg_publication_rel.h | 6 ++ src/include/catalog/pg_range.h| 8 ++-- src/include/catalog/pg_replication_origin.h | 6 ++ src/include/catalog/pg_rewrite.h | 6 ++ src/include/catalog/pg_seclabel.h | 3 +-- src/include/catalog/pg_sequence.h | 3 +-- src/include/catalog/pg_shdepend.h | 6 ++ src/include/catalog/pg_shdescription.h| 3 +-- src/include/catalog/pg_shseclabel.h | 3 +-- src/include/catalog/pg_statistic.h| 3 +-- src/include/catalog/pg_statistic_ext.h| 9 +++-- src/include/catalog/pg_statistic_ext_data.h | 3 +-- src/include/catalog/pg_subscription.h | 6 ++ src/include/catalog/pg_subscription_rel.h | 3 +-- src/include/catalog/pg_tablespace.h | 6 ++ src/include/catalog/pg_transform.h| 6 ++ src/include/catalog/pg_trigger.h | 9 +++-- src/include/catalog/pg_ts_config.h| 6 ++ src/include/catalog/pg_ts_config_map.h| 3 +-- src/include/catalog/pg_ts_dict.h | 6 ++ src/include/catalog/pg_ts_parser.h| 6 ++ src/include/catalog/pg_ts_template.h | 6 ++ src/include/catalog/pg_type.h | 6 ++ src/include/catalog/pg_user_mapping.h | 6 ++ 66 files changed, 133 insertions(+), 243 deletions(-) diff --git a/src/backend/catalog/Catalog.pm b/src/backend/catalog/Catalog.pm index a5e9869378..6eef0bc680 100644 --- a/src/backend/catalog
Re: Fix for segfault in logical replication on master
On Sat, 19 Jun 2021 at 17:18, Amit Kapila wrote: > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila wrote: >> >> > I thought it was cheap enough to check that the relation we open is an >> > index, because if it is not, we'll segfault when accessing fields of the >> > relation->rd_index struct. I wouldn't necessarily advocate doing any >> > really expensive checks here, but a quick sanity check seemed worth the >> > effort. >> > >> >> I am not telling you anything about the cost of these sanity checks. I >> suggest you raise elog rather than return NULL because if this happens >> there is definitely some problem and continuing won't be a good idea. >> > > Pushed, after making the above change. Additionally, I have moved the > test case to the existing file 001_rep_changes instead of creating a > new one as the test seems to fit there and I was not sure if the test > for just this case deserves a new file. Hi, Amit Sorry for the late repay. When we find that the relation has no replica identity index, I think we should free the memory of the indexoidlist. Since we free the memory owned by indexoidlist at end of RelationGetIdentityKeyBitmap(). if (!OidIsValid(relation->rd_replidindex)) { list_free(indexoidlist); return NULL; } Or we can free the memory owned by indexoidlist after check whether it is NIL, because we do not use it in the later. If we do not free the memory, there might be a memory leak when relation->rd_replidindex is invalid. Am I right? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
[Patch] Rename PQtraceSetFlags to PQsetTraceFlags for bookindex.html
Hi all PQtraceSetFlags has been renamed PQsetTraceFlags, but the has not been modified, so PQtraceSetFlags is still displayed in bookindex.html. - - PQtraceSetFlagsPQtraceSetFlags + + PQsetTraceFlagsPQtraceSetFlags https://github.com/postgres/postgres/commit/d0e750c0acaf31f60667b1635311bcef5ab38bbe Here is a patch. Best Regards! PQsetTraceFlags.patch Description: PQsetTraceFlags.patch
Autovacuum (analyze) on partitioned tables for ATTACH/DETACH/DROP commands
Hello, While discussing auto analyze on partitioned tables, we recognized that auto analyze should run on partitioned tables when ATTACH, DETACH and DROP commands are executed [1]. Partitioned tables are checked whether they need auto analyze according to their changes_since_analyze (total number of inserts/updates/deletes on partitions), but above DDL operations are not counted for now. To support ATTACH, DETACH and DROP commands, I proposed the idea as follows: * I made new configuration parameters, autovacuum_analyze_attach_partition, autovacuum_analyze_detach_partition and autovacuum_analyze_drop_partition to enable/disable this feature. * When a partition is attached/detached/dropped, pgstat_report_anl_ancestors() is called and checks the above configurations. If ture, the number of livetuples of the partition is counted in its ancestor's changed tuples in pgstat_recv_anl_ancestors. Attach the v1 patch. What do you think? [1] https://www.postgresql.org/message-id/ce5c3f04-fc17-7139-fffc-037f2c981bec%40enterprisedb.com -- Best regards, Yuzuko Hosoya NTT Open Source Software Center v1_autovacuum_for_attach_detach_drop_commands.patch Description: Binary data
Re: Fix for segfault in logical replication on master
On Mon, Jun 21, 2021 at 1:30 PM Japin Li wrote: > > On Sat, 19 Jun 2021 at 17:18, Amit Kapila wrote: > > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila wrote: > > Or we can free the memory owned by indexoidlist after check whether it is NIL, > because we do not use it in the later. > Valid point. But I am thinking do we really need to fetch and check indexoidlist here? -- With Regards, Amit Kapila.
Re: Fix for segfault in logical replication on master
On Mon, 21 Jun 2021 at 16:22, Amit Kapila wrote: > On Mon, Jun 21, 2021 at 1:30 PM Japin Li wrote: >> >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila wrote: >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila >> > wrote: >> >> Or we can free the memory owned by indexoidlist after check whether it is >> NIL, >> because we do not use it in the later. >> > > Valid point. But I am thinking do we really need to fetch and check > indexoidlist here? IMO, we shold not fetch and check the indexoidlist here, since we do not use it. However, we should use RelationGetIndexList() to update the reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we can use the following code: indexoidlist = RelationGetIndexList(relation); list_free(indexoidlist); Or does there any function that only update the relation->rd_replidindex or related fields, but do not fetch the indexoidlist? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: Pipeline mode and PQpipelineSync()
Alvaro Herrera writes: > I think I should rephrase this to say that PQpipelineSync() is needed > where the user needs the server to start executing commands; and > separately indicate that it is possible (but not promised) that the > server would start executing commands ahead of time because $reasons. I think always requiring PQpipelineSync() is fine since it also serves as an error recovery boundary. But the fact that the server waits until the sync message to start executing the pipeline is surprising. To me this seems to go contrary to the idea of a "pipeline". In fact, I see the following ways the server could behave: 1. The server starts executing queries and sending their results before receiving the sync message. 2. The server starts executing queries before receiving the sync message but buffers the results until it receives the sync message. 3. The server buffers the queries and only starts executing them and sending the results after receiving the sync message. My observations suggest that the server behaves as (3) but it could also be (2). While it can be tempting to say that this is an implementation detail, this affects the way one writes a client. For example, I currently have the following comment in my code: // Send queries until we get blocked. This feels like a better // overall strategy to keep the server busy compared to sending one // query at a time and then re-checking if there is anything to read // because the results of INSERT/UPDATE/DELETE are presumably small // and quite a few of them can get buffered before the server gets // blocked. This would be a good strategy for behavior (1) but not (3) (where it would make more sense to queue the queries on the client side). So I think it would be useful to clarify the server behavior and specify it in the documentation. > Do I have it right that other than this documentation problem, you've > been able to use pipeline mode successfully? So far I've only tried it in a simple prototype (single INSERT statement). But I am busy plugging it into ODB's bulk operation support (that we already have for Oracle and MSSQL) and once that's done I should be able to exercise things in more meaningful ways.
Re: Doc patch for Logical Replication Message Formats (PG14)
On Mon, Jun 21, 2021 at 12:26 PM Brar Piening wrote: > > Hello Hackers, > while amending Npgsql to account for the Logical Streaming Replication > Protocol changes in PostgreSQL 14 I stumbled upon two documentation > inaccuracies in the Logical Replication Message Formats documentation > (https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html) > that have been introduced (or rather omitted) with the recent changes to > allow pgoutput to send logical decoding messages > (https://github.com/postgres/postgres/commit/ac4645c0157fc5fcef0af8ff571512aa284a2cec) > and to allow logical replication to transfer data in binary format > (https://github.com/postgres/postgres/commit/9de77b5453130242654ff0b30a551c9c862ed661). > > > 1. The content of the logical decoding message in the 'Message' message > is prefixed with a length field (Int32) which isn't documented yet. > See > > https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/backend/replication/logical/proto.c#L388 > 2. The TupleData may now contain the byte 'b' as indicator for binary > data which isn't documented yet. See > > https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/include/replication/logicalproto.h#L83 > and > > https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/backend/replication/logical/proto.c#L558. > > The attached documentation patch fixes both. > Yeah, I think these should be fixed and your patch looks good to me in that regard. -- With Regards, Amit Kapila.
Re: [Patch] Rename PQtraceSetFlags to PQsetTraceFlags for bookindex.html
On Mon, Jun 21, 2021 at 02:36:19AM +, zhangj...@fujitsu.com wrote: > Here is a patch. Pushed.
Re: Fix for segfault in logical replication on master
On Mon, Jun 21, 2021 at 2:06 PM Japin Li wrote: > > On Mon, 21 Jun 2021 at 16:22, Amit Kapila wrote: > > On Mon, Jun 21, 2021 at 1:30 PM Japin Li wrote: > >> > >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila wrote: > >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila > >> > wrote: > >> > >> Or we can free the memory owned by indexoidlist after check whether it is > >> NIL, > >> because we do not use it in the later. > >> > > > > Valid point. But I am thinking do we really need to fetch and check > > indexoidlist here? > > IMO, we shold not fetch and check the indexoidlist here, since we do not > use it. However, we should use RelationGetIndexList() to update the > reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we > can use the following code: > > indexoidlist = RelationGetIndexList(relation); > list_free(indexoidlist); > > Or does there any function that only update the relation->rd_replidindex > or related fields, but do not fetch the indexoidlist? > How about RelationGetReplicaIndex? It fetches the indexlist only when required and frees it immediately. But otherwise, currently, there shouldn't be any memory leak because we allocate this in "logical replication output context" which is reset after processing each change message, see pgoutput_change. -- With Regards, Amit Kapila.
Remove useless int64 range checks on BIGINT sequence MINVALUE/MAXVALUE values
Hi, Sequence MINVALUE/MAXVALUE values are read into "int64" variables and then range-checked according to the sequence data-type. However, for a BIGINT sequence, checking whether these are < PG_INT64_MIN or > PG_INT64_MAX always evaluates to false, as an int64 can't hold such values. I've attached a patch to remove those useless checks. The MINVALUE/MAXVALUE values are anyway int64 range-checked prior to this, as part of conversion to int64. Regards, Greg Nancarrow Fujitsu Australia v1-0001-Remove-useless-int64-range-checks-on-BIGINT-sequence.patch Description: Binary data
Re: seawasp failing, maybe in glibc allocator
Hi, On 2021-06-20 19:56:56 -0400, Tom Lane wrote: > Thomas Munro writes: > > Looking at their release schedule on https://llvm.org/, I see we have > > a gamble to make. They currently plan to cut RC1 at the end of July, > > and to release in late September (every second LLVM major release > > coincides approximately with a PG major release). Option 1: wait > > until we branch for 14, and then push this to master so that at least > > seawasp can get back to looking for new problems, and then back-patch > > only after they release (presumably in time for our November > > releases). If their API change sticks, PostgreSQL crashes and gives > > weird results with the initial release of LLVM 13 until our fix comes > > out. Option 2: get ahead of their release and get this into 14 + > > August back branch releases based on their current/RC behaviour. If > > they decide to revert the change before the final release, we'll leak > > symbol names because we hold an extra reference, until we can fix > > that. I think I'd vote for 2 or 2+ (backpatch immediately). > If that's an accurate characterization of the tradeoff, I have little > difficulty in voting for #2. A crash is strictly worse than a memory > leak. Besides which, I've heard little indication that they might > revert. We might be able to get them to revert and put in a different API, but I don't think it'd clearly be an improvement at this point. Greetings, Andres Freund
Re: Remove useless int64 range checks on BIGINT sequence MINVALUE/MAXVALUE values
On Mon, 21 Jun 2021 at 22:10, Greg Nancarrow wrote: > Sequence MINVALUE/MAXVALUE values are read into "int64" variables and > then range-checked according to the sequence data-type. > However, for a BIGINT sequence, checking whether these are < > PG_INT64_MIN or > PG_INT64_MAX always evaluates to false, as an int64 > can't hold such values. It might be worth putting in a comment to mention that the check is not needed. Just in case someone looks again one day and thinks the checks are missing. Probably best to put this in the July commitfest so it does not get missed. David
Re: Fix for segfault in logical replication on master
On Mon, 21 Jun 2021 at 17:54, Amit Kapila wrote: > On Mon, Jun 21, 2021 at 2:06 PM Japin Li wrote: >> >> On Mon, 21 Jun 2021 at 16:22, Amit Kapila wrote: >> > On Mon, Jun 21, 2021 at 1:30 PM Japin Li wrote: >> >> >> >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila wrote: >> >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila >> >> > wrote: >> >> >> >> Or we can free the memory owned by indexoidlist after check whether it is >> >> NIL, >> >> because we do not use it in the later. >> >> >> > >> > Valid point. But I am thinking do we really need to fetch and check >> > indexoidlist here? >> >> IMO, we shold not fetch and check the indexoidlist here, since we do not >> use it. However, we should use RelationGetIndexList() to update the >> reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we >> can use the following code: >> >> indexoidlist = RelationGetIndexList(relation); >> list_free(indexoidlist); >> >> Or does there any function that only update the relation->rd_replidindex >> or related fields, but do not fetch the indexoidlist? >> > > How about RelationGetReplicaIndex? It fetches the indexlist only when > required and frees it immediately. But otherwise, currently, there > shouldn't be any memory leak because we allocate this in "logical > replication output context" which is reset after processing each > change message, see pgoutput_change. Thanks for your explanation. It might not be a memory leak, however it's a little confuse when we free the memory of the indexoidlist in one place, but not free it in another place. I attached a patch to fix this. Any thoughts? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index d55ae016d0..94fbf1aa19 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -5244,9 +5244,9 @@ Bitmapset * RelationGetIdentityKeyBitmap(Relation relation) { Bitmapset *idindexattrs = NULL;/* columns in the replica identity */ - List *indexoidlist; RelationindexDesc; int i; + Oid replidindex; MemoryContext oldcxt; /* Quick exit if we already computed the result */ @@ -5260,18 +5260,14 @@ RelationGetIdentityKeyBitmap(Relation relation) /* Historic snapshot must be set. */ Assert(HistoricSnapshotActive()); - indexoidlist = RelationGetIndexList(relation); - - /* Fall out if no indexes (but relhasindex was set) */ - if (indexoidlist == NIL) - return NULL; + replidindex = RelationGetReplicaIndex(relation); /* Fall out if there is no replica identity index */ - if (!OidIsValid(relation->rd_replidindex)) + if (!OidIsValid(replidindex)) return NULL; /* Look up the description for the replica identity index */ - indexDesc = RelationIdGetRelation(relation->rd_replidindex); + indexDesc = RelationIdGetRelation(replidindex); if (!RelationIsValid(indexDesc)) elog(ERROR, "could not open relation with OID %u", @@ -5295,7 +5291,6 @@ RelationGetIdentityKeyBitmap(Relation relation) } RelationClose(indexDesc); - list_free(indexoidlist); /* Don't leak the old values of these bitmaps, if any */ bms_free(relation->rd_idattr);
Re: disfavoring unparameterized nested loops
On Wed, 16 Jun 2021 at 15:08, Peter Geoghegan wrote: > It seems important to distinguish between risk and uncertainty -- > they're rather different things. The short version is that you can > model risk but you cannot model uncertainty. This seems like a problem > of uncertainty to me. You might be right there. "Uncertainty" or "Certainty" seems more like a value that clauselist_selectivity() would be able to calculate itself. It would just be up to the planner to determine what to do with it. One thing I thought about is that if the costing modal was able to separate out a cost of additional (unexpected) rows then it would be easier for add_path() to take into account how bad things might go if we underestimate. For example, in an unparameterized Nested Loop that estimates the outer Path to have 1 row will cost an entire additional inner cost if there are 2 rows. With Hash Join the cost is just an additional hashtable lookup, which is dead cheap. I don't know exactly how add_path() would weigh all that up, but it seems to me that I wouldn't take the risk unless I was 100% certain that the Nested Loop's outer Path would only return 1 row exactly, if there was any chance at all it could return more, I'd be picking some other join method. David
Re: Doc patch for Logical Replication Message Formats (PG14)
Amit Kapila wrote: On Mon, Jun 21, 2021 at 12:26 PM Brar Piening wrote: Hello Hackers, while amending Npgsql to account for the Logical Streaming Replication Protocol changes in PostgreSQL 14 I stumbled upon two documentation inaccuracies in the Logical Replication Message Formats documentation (https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html) that have been introduced (or rather omitted) with the recent changes to allow pgoutput to send logical decoding messages (https://github.com/postgres/postgres/commit/ac4645c0157fc5fcef0af8ff571512aa284a2cec) and to allow logical replication to transfer data in binary format (https://github.com/postgres/postgres/commit/9de77b5453130242654ff0b30a551c9c862ed661). 1. The content of the logical decoding message in the 'Message' message is prefixed with a length field (Int32) which isn't documented yet. See https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/backend/replication/logical/proto.c#L388 2. The TupleData may now contain the byte 'b' as indicator for binary data which isn't documented yet. See https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/include/replication/logicalproto.h#L83 and https://github.com/postgres/postgres/blob/69a58bfe4ab05567a8fab8bdce7f3095ed06b99c/src/backend/replication/logical/proto.c#L558. The attached documentation patch fixes both. Yeah, I think these should be fixed and your patch looks good to me in that regard. After looking at the docs once again I have another minor amendment (new patch attached). diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index bc2a2feb0b..7d29308abc 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -6498,6 +6498,18 @@ Message + + + +Int32 + + + +Length of the content. + + + + Byten @@ -7430,6 +7442,19 @@ TupleData + +Or + + + +Byte1('b') + + + +Identifies the data as binary value. + + + Int32 @@ -7446,8 +7471,8 @@ TupleData -The value of the column, in text format. (A future release -might support additional formats.) +The value of the column, eiter in binary or in text format. +(As specified in the preceding format byte). n is the above length.
RE: Refactor ECPGconnect and allow IPv6 connection
Dear Michael, Thank you for replying! > it does not strike me as a great idea to have a duplicate > logic doing the parsing of URIs, even if libpq accepts multiple > hosts/ports as an option. Yeah, I agree your argument that duplicated parse function should be removed. ECPG parses connection string because it uses PQconnectdbParams() even if target is specified in the new-style, hence my elementary idea is that the paring can be skipped if PQconnectdb() calls. I will try to remake patches based on the idea. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Optionally automatically disable logical replication subscriptions on error
On Mon, Jun 21, 2021 at 11:19 AM Amit Kapila wrote: > > On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger > wrote: > > > I don't mind if you want to store more information, and maybe that needs > > to be stored somewhere else. Do you believe pg_subscription_rel is a > > suitable location? > > > It won't be sufficient to store information in either > pg_subscription_rel or pg_susbscription. I think if we want to store > the required information in a catalog then we need to define a new > catalog (pg_subscription_conflicts or something like that) with > information corresponding to each rel in subscription (srsubid oid > (Reference to subscription), srrelid oid (Reference to relation), > ). OTOH, we can choose to send the error > information to stats collector which will then be available via stat > view and update system catalog to disable the subscription but there > will be a risk that we might send info of failed transaction to stats > collector but then fail to update system catalog to disable the > subscription. > I think we should store the input from the user (like disable_on_error flag or xid to skip) in the system catalog pg_subscription and send the error information (subscrtion_id, rel_id, xid of failed xact, error_code, error_message, etc.) to the stats collector which can be used to display such information via a stat view. The disable_on_error flag handling could be that on error it sends the required error info to stats collector and then updates the subenabled in pg_subscription. In rare conditions, where we are able to send the message but couldn't update the subenabled info in pg_subscription either due to some error or server restart, the apply worker would again try to apply the same change and would hit the same error again which I think should be fine because it will ultimately succeed. The skip xid handling will also be somewhat similar where on an error, we will send the error information to stats collector which will be displayed via stats view. Then the user is expected to ask for skip xid (Alter Subscription ... SKIP ) based on information displayed via stat view. Now, the apply worker can skip changes from such a transaction, and then during processing of commit record of the skipped transaction, it should update xid to invalid value, so that next time that shouldn't be used. I think it is important to update xid to an invalid value as part of the skipped transaction because otherwise, after the restart, we won't be able to decide whether we still want to skip the xid stored for a subscription. -- With Regards, Amit Kapila.
Re: Fix for segfault in logical replication on master
On Mon, Jun 21, 2021 at 4:09 PM Japin Li wrote: > > On Mon, 21 Jun 2021 at 17:54, Amit Kapila wrote: > > On Mon, Jun 21, 2021 at 2:06 PM Japin Li wrote: > >> > >> On Mon, 21 Jun 2021 at 16:22, Amit Kapila wrote: > >> > On Mon, Jun 21, 2021 at 1:30 PM Japin Li wrote: > >> >> > >> >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila > >> >> wrote: > >> >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila > >> >> > wrote: > >> >> > >> >> Or we can free the memory owned by indexoidlist after check whether it > >> >> is NIL, > >> >> because we do not use it in the later. > >> >> > >> > > >> > Valid point. But I am thinking do we really need to fetch and check > >> > indexoidlist here? > >> > >> IMO, we shold not fetch and check the indexoidlist here, since we do not > >> use it. However, we should use RelationGetIndexList() to update the > >> reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we > >> can use the following code: > >> > >> indexoidlist = RelationGetIndexList(relation); > >> list_free(indexoidlist); > >> > >> Or does there any function that only update the relation->rd_replidindex > >> or related fields, but do not fetch the indexoidlist? > >> > > > > How about RelationGetReplicaIndex? It fetches the indexlist only when > > required and frees it immediately. But otherwise, currently, there > > shouldn't be any memory leak because we allocate this in "logical > > replication output context" which is reset after processing each > > change message, see pgoutput_change. > > Thanks for your explanation. It might not be a memory leak, however it's > a little confuse when we free the memory of the indexoidlist in one place, > but not free it in another place. > > I attached a patch to fix this. Any thoughts? > Your patch appears to be on the lines we discussed but I would prefer to get it done after Beta2 as this is just a minor code improvement. Can you please send the change as a patch file instead of copy-pasting the diff at the end of the email? -- With Regards, Amit Kapila.
Re: Doc patch for Logical Replication Message Formats (PG14)
On Mon, Jun 21, 2021 at 4:13 PM Brar Piening wrote: > > Amit Kapila wrote: > > > After looking at the docs once again I have another minor amendment (new > patch attached). > +The value of the column, eiter in binary or in text format. Typo. /eiter/either -- With Regards, Amit Kapila.
RE: [bug?] Missed parallel safety checks, and wrong parallel safety
On Wednesday, June 16, 2021 11:27 AM houzj.f...@fujitsu.com wrote: > On Tuesday, June 15, 2021 10:01 PM Robert Haas wrote: > > Just to be clear here, I don't think it really matters what we *want* > > to do. I don't think it's reasonably *possible* to check all the > > partitions, because we don't hold locks on them. When we're assessing > > a bunch of stuff related to an individual relation, we have a lock on > > it. I think - though we should double-check tablecmds.c - that this is > > enough to prevent all of the dependent objects - triggers, > > constraints, etc. - from changing. So the stuff we care about is > > stable. But the situation with a partitioned table is different. In > > that case, we can't even examine that stuff without locking all the > > partitions. And even if we do lock all the partitions, the stuff could > > change > immediately afterward and we wouldn't know. So I think it would be difficult > to > >make it correct. > > Now, maybe it could be done, and I think that's worth a little more > > thought. For example, perhaps whenever we invalidate a relation, we > > could also somehow send some new, special kind of invalidation for its > > parent saying, essentially, "hey, one of your children has changed in > > a way you might care about." But that's not good enough, because it > > only goes up one level. The grandparent would still be unaware that a > > change it potentially cares about has occurred someplace down in the > > partitioning hierarchy. That seems hard to patch up, again because of > > the locking rules. The child can know the OID of its parent without > > locking the parent, but it can't know the OID of its grandparent > > without locking its parent. Walking up the whole partitioning > > hierarchy might be an issue for a number of reasons, including > > possible deadlocks, and possible race conditions where we don't emit > > all of the right invalidations in the face of concurrent changes. So I > > don't quite see a way around this part of the problem, but I may well be > missing something. In fact I hope I am missing something, because solving this > problem would be really nice. > > I think the check of partition could be even more complicated if we need to > check the parallel safety of partition key expression. If user directly > insert into a > partition, then we need invoke ExecPartitionCheck which will execute all it's > parent's and grandparent's partition key expressions. It means if we change a > parent table's partition key expression(by 1) change function in expr or 2) > attach the parent table as partition of another parent table), then we need to > invalidate all its child's relcache. > > BTW, currently, If user attach a partitioned table 'A' to be partition of > another > partitioned table 'B', the child of 'A' will not be invalidated. To be honest, I didn't find a cheap way to invalidate partitioned table's parallel safety automatically. For one thing, We need to recurse higher in the partition tree to invalid all the parent table's relcache(and perhaps all its children's relcache) not only when alter function parallel safety, but also for DDLs which will invalid the partition's relcache, such as CREATE/DROP INDEX/TRIGGER/CONSTRAINT. It seems too expensive. For another, even if we can invalidate the partitioned table's parallel safety automatically, we still need to lock all the partition when checking table's parallel safety, because the partition's parallel safety could be changed after checking the parallel safety. So, IMO, at least for partitioned table, an explicit flag looks more acceptable. For regular table, It seems we can work it out automatically, although I am not sure does anyone think it looks a bit inconsistent. Best regards, houzj
Re: Doc patch for Logical Replication Message Formats (PG14)
Amit Kapila schrieb: On Mon, Jun 21, 2021 at 4:13 PM Brar Piening wrote: Amit Kapila wrote: After looking at the docs once again I have another minor amendment (new patch attached). +The value of the column, eiter in binary or in text format. Typo. /eiter/either Fixed - thanks! diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index bc2a2feb0b..7d29308abc 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -6498,6 +6498,18 @@ Message + + + +Int32 + + + +Length of the content. + + + + Byten @@ -7430,6 +7442,19 @@ TupleData + +Or + + + +Byte1('b') + + + +Identifies the data as binary value. + + + Int32 @@ -7446,8 +7471,8 @@ TupleData -The value of the column, in text format. (A future release -might support additional formats.) +The value of the column, either in binary or in text format. +(As specified in the preceding format byte). n is the above length.
Is the testing a bit too light on GROUP BY DISTINCT?
In [1], Yaoguang reported an Assert failure in expand_grouping_sets. Since beta2 deadline is looming, I pushed a quick fix for that. As mentioned over on bugs, only 1 test triggers that code and because the List of IntLists always had an empty list as the first element due to the code just above sorting the top-level List by the number of elements each of the contained IntLists, the NIL was always at the start of the top-level List. It wasn't too hard to modify the test to change that. I wonder if the testing for the feature is just a bit too light. Would it maybe be worth adding a GROUP BY DISTINCT with GROUPING SETS test? Any thoughts? David [1] https://www.postgresql.org/message-id/17067-665d50fa321f7...@postgresql.org
Re: SSL/TLS instead of SSL in docs
> On 18 Jun 2021, at 07:37, Michael Paquier wrote: > > On Tue, Jun 15, 2021 at 03:59:18PM +0200, Daniel Gustafsson wrote: >> While in there I added IMO missing items to the glossary and acronyms >> sections >> as well as fixed up markup around OpenSSL. >> >> This only deals with docs, but if this is deemed interesting then userfacing >> messages in the code should use SSL/TLS as well of course. > > +SNI > + > + > + Server Name Indication > + > + > It looks inconsistent to me to point to the libpq documentation to get > the details about SNI. Wouldn't is be better to have an item in the > glossary that refers to the bits of RFC 6066, and remove the reference > of the RPC from the libpq page? I opted for a version with SNI in the glossary but without a link to the RFC since no definitions in the glossary has ulinks. > - to present a valid (trusted) SSL certificate, while > + to present a valid (trusted) > SSL/TLS certificate, while > This style with two acronyms for what we want to be one thing is > heavy. Could it be better to just have one single acronym called > SSL/TLS that references both parts? Maybe, I don't know. I certainly don't prefer the way which is in the patch but I also think it's the most "correct" way so I opted for that to start the discussion. If we're fine with one acronym tag for the combination then I'm happy to change to that. A slightly more invasive idea would be to not have acronyms at all and instead move the ones that do benefit from clarification to the glossary? ISTM that having a brief description of terms and not just the expansion is beneficial to the users. That would however be for another thread, but perhaps thats worth discussing? > Patch 0003, for the markups with OpenSSL, included one > SSL/TLS entry. Ugh, sorry, that must've been a git add -p fat-finger. -- Daniel Gustafsson https://vmware.com/ v2-0002-docs-Replace-usage-of-SSL-with-SSL-TLS.patch Description: Binary data v2-0001-docs-SSL-TLS-related-acronyms-and-glossary.patch Description: Binary data
Re: disfavoring unparameterized nested loops
On Tue, Jun 15, 2021 at 8:00 PM David Rowley wrote: > In my experience, the most common reason that the planner chooses > non-parameterized nested loops wrongly is when there's row > underestimation that says there's just going to be 1 row returned by > some set of joins. The problem often comes when some subsequent join > is planned and the planner sees the given join rel only produces one > row. The cheapest join method we have to join 1 row is Nested Loop. > So the planner just sticks the 1-row join rel on the outer side > thinking the executor will only need to scan the inner side of the > join once. When the outer row count blows up, then we end up scanning > that inner side many more times. The problem is compounded when you > nest it a few joins deep > > Most of the time when I see that happen it's down to either the > selectivity of some correlated base-quals being multiplied down to a > number low enough that we clamp the estimate to be 1 row. The other > case is similar, but with join quals. If an estimate is lower than 1, that should be a red flag that Something Is Wrong. This is kind of a crazy idea, but what if we threw it back the other way by computing 1 / est , and clamping that result to 2 <= res < 10 (or 100 or something)? The theory is, the more impossibly low it is, the more wrong it is. I'm attracted to the idea of dealing with it as an estimation problem and not needing to know about join types. Might have unintended consequences, though. Long term, it would be great to calculate something about the distribution of cardinality estimates, so we can model risk in the estimates. -- John Naylor EDB: http://www.enterprisedb.com
Re: Remove useless int64 range checks on BIGINT sequence MINVALUE/MAXVALUE values
On Mon, Jun 21, 2021 at 8:32 PM David Rowley wrote: > > It might be worth putting in a comment to mention that the check is > not needed. Just in case someone looks again one day and thinks the > checks are missing. > > Probably best to put this in the July commitfest so it does not get missed. Updated the patch, and will add it to the Commitfest, thanks. Regards, Greg Nancarrow Fujitsu Australia v2-0001-Remove-useless-int64-range-checks-on-BIGINT-sequence.patch Description: Binary data
Re: Add index OID macro argument to DECLARE_INDEX
On Mon, Jun 21, 2021 at 3:23 AM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > > This patch changes places like this > > DECLARE_UNIQUE_INDEX_PKEY(pg_aggregate_fnoid_index, 2650, on > pg_aggregate using btree(aggfnoid oid_ops)); > #define AggregateFnoidIndexId 2650 > > to this > > DECLARE_UNIQUE_INDEX_PKEY(pg_aggregate_fnoid_index, 2650, > AggregateFnoidIndexId, on pg_aggregate using btree(aggfnoid oid_ops)); +1, and the patch looks good to me. -- John Naylor EDB: http://www.enterprisedb.com
Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Hi, On 2021-06-16 12:12:23 -0700, Andres Freund wrote: > Could you share your testcase? I've been working on a series of patches > to address this (I'll share in a bit), and I've run quite a few tests, > and didn't hit any infinite loops. Sorry for not yet doing that. Unfortunately I have an ongoing family health issue (& associated travel) claiming time and energy :(. I've pushed the minimal fix due to beta 2. Beyond beta 2 I am thinking of the below to unify the horizon determination: static inline GlobalVisHorizonKind GlobalVisHorizonKindForRel(Relation rel) { if (!rel) return VISHORIZON_SHARED; /* * Other relkkinds currently don't contain xids, nor always the necessary * logical decoding markers. */ Assert(rel->rd_rel->relkind == RELKIND_RELATION || rel->rd_rel->relkind == RELKIND_MATVIEW || rel->rd_rel->relkind == RELKIND_TOASTVALUE); if (rel->rd_rel->relisshared || RecoveryInProgress()) return VISHORIZON_SHARED; else if (IsCatalogRelation(rel) || RelationIsAccessibleInLogicalDecoding(rel)) return VISHORIZON_CATALOG; else if (!RELATION_IS_LOCAL(rel)) return VISHORIZON_DATA; else return VISHORIZON_TEMP; } That's then used in GetOldestNonRemovableTransactionId(), GlobalVisTestFor(). Makes sense? Regards, Andres
Using indexUnchanged with nbtree
Seems like we can skip the uniqueness check if indexUnchanged, which will speed up non-HOT UPDATEs on tables with B-Trees. Passes tests. Comments? -- Simon Riggshttp://www.EnterpriseDB.com/ skip_nonHOT_btree_unique_check.v1.patch Description: Binary data
Doc chapter for Hash Indexes
New chapter for Hash Indexes, designed to help users understand how they work and when to use them. Mostly newly written, but a few paras lifted from README when they were helpful. -- Simon Riggshttp://www.EnterpriseDB.com/ doc_hash_index.v1.patch Description: Binary data
Re: Character expansion with ICU collations
I have a proposal for how to support tailoring rules in ICU collations: The ucol_openRules() function is an alternative to the ucol_open() function that PostgreSQL calls today, but it takes the collation strength as one if its parameters so the locale string would need to be parsed before creating the collator. After the collator is created using either ucol_openRules or ucol_open, the ucol_setAttribute() function may be used to set individual attributes from keyword=value pairs in the locale string as it does now, except that the strength probably can't be changed after opening the collator with ucol_openRules. So the logic in pg_locale.c would need to be reorganized a little bit, but that sounds straightforward. One simple solution would be to have the tailoring rules be specified as a new keyword=value pair, such as colTailoringRules=. Since the may contain single quote characters or PostgreSQL escape characters, any single quote characters or escapes would need to be escaped using PostgreSQL escape rules. If colTailoringRules is present, colStrength would also be known prior to opening the collator, or would default to tertiary, and we would keep a local flag indicating that we should not process the colStrength keyword again, if specified. Representing the TailoringRules as just another keyword=value in the locale string means that we don't need any change to the catalog to store it. It's just part of the locale specification. I think we wouldn't even need to bump the catversion. Are there any tailoring rules, such as expansions and contractions, that we should disallow? I realize that we don't handle nondeterministic collations in LIKE or regular expression operations as of PG14, but given expr LIKE 'a%' on a database with a UTF-8 encoding and arbitrary tailoring rules that include expansions and contractions, is it still guaranteed that expr must sort BETWEEN 'a' AND ('a' || E'/u') ?
Fix pkg-config file for static linking
Since https://github.com/postgres/postgres/commit/ea53100d5 (or Postgres 12.0) the shipped pkg-config file is broken for statically linking libpq because libpgcommon and libpgport are missing. This patch adds those two missing private dependencies. diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile index 0c4e55b6ad..fe21335d2d 100644 --- a/src/interfaces/libpq/Makefile +++ b/src/interfaces/libpq/Makefile @@ -94,7 +94,7 @@ SHLIB_PREREQS = submake-libpgport SHLIB_EXPORTS = exports.txt -PKG_CONFIG_REQUIRES_PRIVATE = libssl libcrypto +PKG_CONFIG_REQUIRES_PRIVATE = libpgcommon libpgport libssl libcrypto all: all-lib
Re: disfavoring unparameterized nested loops
> > > > Most of the time when I see that happen it's down to either the > > selectivity of some correlated base-quals being multiplied down to a > > number low enough that we clamp the estimate to be 1 row. The other > > case is similar, but with join quals. > > If an estimate is lower than 1, that should be a red flag that Something Is > Wrong. This is kind of a crazy idea, but what if we threw it back the other > way by computing 1 / est , and clamping that result to 2 <= res < 10 (or > 100 or something)? The theory is, the more impossibly low it is, the more > wrong it is. I'm attracted to the idea of dealing with it as an estimation > problem and not needing to know about join types. Might have unintended > consequences, though. > > Long term, it would be great to calculate something about the distribution > of cardinality estimates, so we can model risk in the estimates. > Hi, Laurenz suggested clamping to 2 in this thread in 2017: https://www.postgresql.org/message-id/1509611428.3268.5.camel%40cybertec.at Having been the victim of this problem in the past, I like the risk based approach to this. If the cost of being wrong in the estimate is high, use a merge join instead. In every case that I have encountered, that heuristic would have given the correct performant plan. Regards, Ken
Re: Use simplehash.h instead of dynahash in SMgr
I'd been thinking of this patch again. When testing with simplehash, I found that the width of the hash bucket type was fairly critical for getting good performance from simplehash.h. With simplehash.h I didn't manage to narrow this any more than 16 bytes. I needed to store the 32-bit hash value and a pointer to the data. On a 64-bit machine, with padding, that's 16-bytes. I've been thinking about a way to narrow this down further to just 8 bytes and also solve the stable pointer problem at the same time... I've come up with a new hash table implementation that I've called generichash. It works similarly to simplehash in regards to the linear probing, only instead of storing the data in the hash bucket, we just store a uint32 index that indexes off into an array. To keep the pointers in that array stable, we cannot resize the array as the table grows. Instead, I just allocate another array of the same size. Since these arrays are always sized as powers of 2, it's very fast to index into them using the uint32 index that's stored in the bucket. Unused buckets just store the special index of 0x. I've also proposed to use this hash table implementation over in [1] to speed up LockReleaseAll(). The 0001 patch here is just the same as the patch from [1]. The 0002 patch includes using a generichash hash table for SMgr. The performance using generichash.h is about the same as the simplehash.h version of the patch. Although, the test was not done on the same version of master. Master (97b713418) drowley@amd3990x:~$ tail -f pg.log | grep "redo done" CPU: user: 124.85 s, system: 6.83 s, elapsed: 131.74 s CPU: user: 115.01 s, system: 4.76 s, elapsed: 119.83 s CPU: user: 122.13 s, system: 6.41 s, elapsed: 128.60 s CPU: user: 113.85 s, system: 6.11 s, elapsed: 120.02 s CPU: user: 121.40 s, system: 6.28 s, elapsed: 127.74 s CPU: user: 113.71 s, system: 5.80 s, elapsed: 119.57 s CPU: user: 113.96 s, system: 5.90 s, elapsed: 119.92 s CPU: user: 122.74 s, system: 6.21 s, elapsed: 129.01 s CPU: user: 122.00 s, system: 6.38 s, elapsed: 128.44 s CPU: user: 113.06 s, system: 6.14 s, elapsed: 119.25 s CPU: user: 114.42 s, system: 4.35 s, elapsed: 118.82 s Median: 120.02 s master + v1 + v2 drowley@amd3990x:~$ tail -n 0 -f pg.log | grep "redo done" CPU: user: 107.75 s, system: 4.61 s, elapsed: 112.41 s CPU: user: 108.07 s, system: 4.49 s, elapsed: 112.61 s CPU: user: 106.89 s, system: 5.55 s, elapsed: 112.49 s CPU: user: 107.42 s, system: 5.64 s, elapsed: 113.12 s CPU: user: 106.85 s, system: 4.42 s, elapsed: 111.31 s CPU: user: 107.36 s, system: 4.76 s, elapsed: 112.16 s CPU: user: 107.20 s, system: 4.47 s, elapsed: 111.72 s CPU: user: 106.94 s, system: 5.89 s, elapsed: 112.88 s CPU: user: 115.32 s, system: 6.12 s, elapsed: 121.49 s CPU: user: 108.02 s, system: 4.48 s, elapsed: 112.54 s CPU: user: 106.93 s, system: 4.54 s, elapsed: 111.51 s Median: 112.49 s So about a 6.69% speedup David [1] https://www.postgresql.org/message-id/caaphdvokqwrxw5nnupz8+majkhpopxygoy1gqdh0wes4+bi...@mail.gmail.com v1-0001-Add-a-new-hash-table-type-which-has-stable-pointe.patch Description: Binary data v1-0002-Use-generichash.h-hashtables-in-SMgr.patch Description: Binary data
Re: PG 14 release notes, first draft
On Mon, Jun 21, 2021 at 02:47:16PM +0900, Tatsuo Ishii wrote: > > I got the parse error after applying the patch: > > > > release-14.sgml:3562: parser error : Input is not proper UTF-8, > > indicate encoding ! > > Bytes: 0xE9 0x20 0x53 0x61 > > (Juan Jos Santamara Flecha) > > ^ > > > > Is that a problem with my environment? > > Me too. I think the problem is, Bruce's patch is encoded in > ISO-8859-1, not UTF-8. As far as I know PostgreSQL never encodes > *.sgml files in ISO-8859-1. Anyway, attached is the Bruce's patch > encoded in UTF-8. This works for me. > > My guess is, when Bruce attached the file, his MUA automatically > changed the file encoding from UTF-8 to ISO-8859-1 (it could happen in > many MUA). Also that's the reason why he does not see the problem > while compiling the sgml files. In his environment release-14.sgml is > encoded in UTF-8, I guess. To prevent the problem next time, it's > better to change the mime type of the attached file to > Application/Octet-Stream. Oh, people were testing by building from the attached patch, not from the git tree. Yes, I see now the email was switched to a single-byte encoding, and the attachment header confirms it: Content-Type: text/x-diff; charset=iso-8859-1 -- Content-Disposition: attachment; filename="master.diff" Content-Transfer-Encoding: 8bit I guess my email program, mutt, is trying to be helpful by using a single-byte encoding when UTF is not necessary, which I guess makes sense. I will try to remember this can cause problems with SGML attachments. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Use extended statistics to estimate (Var op Var) clauses
On Sun, 13 Jun 2021 at 21:28, Tomas Vondra wrote: > > Here is a slightly updated version of the patch > > The main issue I ran into > is the special case clauselist_selectivity, which does > > if (list_length(clauses) == 1) > return clause_selectivity_ext(...); > > which applies to cases like "WHERE a < b" which can now be handled by > extended statistics, thanks to this patch. But clause_selectivity_ext > only used to call restriction_selectivity for these clauses, which does > not use extended statistics, of course. > > I considered either getting rid of the special case, passing everything > through extended stats, including cases with a single clause. But that > ends up affecting e.g. OR clauses, so I tweaked clause_selectivity_ext a > bit, which seems like a better approach. Yeah, I looked at this a few months ago, while looking at the other extended stats stuff, and I came to the same conclusion that solving this issue by tweaking clause_selectivity_ext() is the best approach. I haven't looked at the patch in much detail yet, but I think the basic approach looks OK. There are a few comments that need updating, e.g.: - In statext_is_compatible_clause_internal(), before the "if (is_opclause(clause))" test. - The description of the arguments for examine_opclause_args(). I wonder if "expronleftp" for examine_opclause_args() should be "constonrightp", or some such -- as it stands it's being set to false for an Expr Op Expr clause, which doesn't seem right because there *is* an expression on the left. Regards, Dean
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 6:41 AM David Rowley wrote: > For example, in an unparameterized Nested Loop that estimates the > outer Path to have 1 row will cost an entire additional inner cost if > there are 2 rows. With Hash Join the cost is just an additional > hashtable lookup, which is dead cheap. I don't know exactly how > add_path() would weigh all that up, but it seems to me that I wouldn't > take the risk unless I was 100% certain that the Nested Loop's outer > Path would only return 1 row exactly, if there was any chance at all > it could return more, I'd be picking some other join method. It seems like everyone agrees that it would be good to do something about this problem, but the question is whether it's best to do something that tries to be general, or whether we should instead do something about this specific case. I favor the latter approach. Risk and uncertainty exist all over the place, but dealing with that in a general way seems difficult, and maybe unnecessary. Addressing the case of unparameterized nest loops specifically seems simpler, because it's easier to reason about what the alternatives are. Your last sentence here seems right on point to me. Basically, what you argue for there is disabling unparameterized nested loops entirely except when we can prove that the inner side will never generate more than one row. But, that's almost never going to be something that we can prove. If the inner side is coming from a table or sub-join, it can turn out to be big. As far as I can see, the only way that this doesn't happen is if it's something like a subquery that aggregates everything down to one row, or has LIMIT 1, but those are such special cases that I don't even think we should be worrying about them. So my counter-proposal is: how about if we split merge_unsorted_outer() into two functions, one of which generates nested loops only based on parameterized paths and the other of which generates nested loops based only on unparameterized paths, and then rejigger add_paths_to_joinrel() so that we do the latter between the steps that are currently number 5 and 6 and only if we haven't got any other paths yet? If somebody later comes up with logic for proving that the inner side can never have more than 1 row, we can let this be run in those cases as well. In the meantime, if somebody does something like a JOIN b ON a.x < b.x, we will still generate these paths because there's no other approach, or similarly if it's a.x = b.x but for some strange type that doesn't have a hash-joinable or merge-joinable equality operator. But otherwise we omit those paths entirely. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Use simplehash.h instead of dynahash in SMgr
On Mon, Jun 21, 2021 at 10:15 AM David Rowley wrote: > I've come up with a new hash table implementation that I've called > generichash. At the risk of kibitzing the least-important detail of this proposal, I'm not very happy with the names of our hash implementations. simplehash is not especially simple, and dynahash is not particularly dynamic, especially now that the main place we use it is for shared-memory hash tables that can't be resized. Likewise, generichash doesn't really give any kind of clue about how this hash table is different from any of the others. I don't know how possible it is to do better here; naming things is one of the two hard problems in computer science. In a perfect world, though, our hash table implementations would be named in such a way that somebody might be able to look at the names and guess on that basis which one is best-suited to a given task. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [bug?] Missed parallel safety checks, and wrong parallel safety
On Mon, Jun 21, 2021 at 12:56 AM Amit Kapila wrote: > I thought if we scan a system catalog using DirtySnapshot, then we > should be able to find such a relation. But, if the system catalog is > updated after our scan then surely we won't be able to see it and in > that case, we won't be able to send invalidation. Now, say if the rel > is not visible to us because of the snapshot we used or due to some > race condition then we won't be able to send the invalidation but why > we want to consider it worse than the case where we miss such > invalidations (invalidations due to change of parallel-safe property) > when the insertion into relation is in-progress. A concurrent change is something quite different than a change that happened some time in the past. We all know that DROP TABLE blocks if it is run while the table is in use, and everybody considers that acceptable, but if DROP TABLE were to block because the table was in use at some previous time, everybody would complain, and rightly so. The same principle applies here. It's not possible to react to a change that happens in the middle of the query. Somebody could argue that we ought to lock all the functions we're using against concurrent changes so that attempts to change their properties block on a lock rather than succeeding. But given that that's not how it works, we can hardly go back in time and switch to a non-parallel plan after we've already decided on a parallel one. On the other hand, we should be able to notice a change that has *already completed* at the time we do planning. I don't see how we can blame failure to do that on anything other than bad coding. > Yeah, the session in which we are doing Alter Function won't be able > to lock it but it will wait for the AccessExclusiveLock on the rel to > be released because it will also try to acquire it before sending > invalidation. I think users would not be very happy with such behavior. Users accept that if they try to access a relation, they might end up needing to wait for a lock on it, but what you are proposing here might make a session block waiting for a lock on a relation which it never attempted to access. I think this whole line of attack is a complete dead-end. We can invent new types of invalidations if we want, but they need to be sent based on which objects actually got changed, not based on what we think might be affected indirectly as a result of those changes. It's reasonable to regard something like a trigger or constraint as a property of the table because it is really a dependent object. It is associated with precisely one table when it is created and the association can never be changed. On the other hand, functions clearly have their own existence. They can be created and dropped independently of any table and the tables with which they are associated can change at any time. In that kind of situation, invalidating the table based on changes to the function is riddled with problems which I am pretty convinced we're never going to be able to solve. I'm not 100% sure what we ought to do here, but I'm pretty sure that looking up the tables that happen to be associated with the function in the session that is modifying the function is not it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add version macro to libpq-fe.h
On Sat, Jun 19, 2021 at 11:45 AM Tom Lane wrote: > Hearing no further comments, done that way. What will prevent us from forgetting to do something about this again, a year from now? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Maintaining a list of pgindent commits for "git blame" to ignore
Peter Geoghegan writes: > What do you think of the attached? I prefer the ISO date style myself, > so I went with that. I grepped the git logs for "indent" and found a bunch more commits that look like they should be included: db6e2b4c5 d84213909 1e9c85809 f04d4ac91 9ef2dbefc 651902deb ce5548103 b5bce6c1e de94e2af1 d0cd7bda9 befa3e648 7584649a1 84288a86a d74714027 b6b71b85b 46785776c 089003fb4 ea08e6cd5 59f6a57e5 It's possible that some of these touch few enough lines that they are not worth listing; I did not check the commit delta sizes. > Note that I have included "Modify BufferGetPage() to prepare for > "snapshot too old" feature", as well as "Revert no-op changes to > BufferGetPage()". I've noticed that those two particular commits cause > unhelpful noise when I run "git blame" on access method code. Meh. I can get on board with the idea of adding commit+revert pairs to this list, but I'd like a more principled selection filter than "which ones bother Peter". Maybe the size of the reverted patch should factor into it. Do we have an idea of how much adding entries to this list slows down "git blame"? If we include commit+revert pairs more than very sparingly, I'm afraid we'll soon have an enormous list, and I wonder what that will cost us. regards, tom lane
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 7:45 AM Robert Haas wrote: > On Mon, Jun 21, 2021 at 6:41 AM David Rowley wrote: > > For example, in an unparameterized Nested Loop that estimates the > > outer Path to have 1 row will cost an entire additional inner cost if > > there are 2 rows. With Hash Join the cost is just an additional > > hashtable lookup, which is dead cheap. I don't know exactly how > > add_path() would weigh all that up, but it seems to me that I wouldn't > > take the risk unless I was 100% certain that the Nested Loop's outer > > Path would only return 1 row exactly, if there was any chance at all > > it could return more, I'd be picking some other join method. > > It seems like everyone agrees that it would be good to do something > about this problem, but the question is whether it's best to do > something that tries to be general, or whether we should instead do > something about this specific case. I favor the latter approach. I agree with your conclusion, but FWIW I am sympathetic to David's view too. I certainly understand why he'd naturally want to define the class of problems that are like this one, to understand what the boundaries are. The heuristic that has the optimizer flat out avoids an unparameterized nested loop join is justified by the belief that that's fundamentally reckless. Even though we all agree on that much, I don't know when it stops being reckless and starts being "too risky for me, but not fundamentally reckless". I think that that's worth living with, but it isn't very satisfying. > Risk > and uncertainty exist all over the place, but dealing with that in a > general way seems difficult, and maybe unnecessary. Addressing the > case of unparameterized nest loops specifically seems simpler, because > it's easier to reason about what the alternatives are. Your last > sentence here seems right on point to me. Right. Part of why this is a good idea is that the user is exposed to so many individual risks and uncertainties. We cannot see any one risk as existing in a vacuum. It is not the only risk the user will ever take in the planner -- if it was then it might actually be okay to allow unparameterized nested loop joins. A bad unparameterized nested loop join plan has, in a sense, unknown and unbounded cost/downside. But it is only very slightly faster than a hash join, by a fixed well understood amount. With enough "trials" and on a long enough timeline, it will inevitably blow up and cause the application to grind to a halt. It seems like no amount of fixed, bounded benefit from "fast unparameterized nested loop joins" could possibly make up for that. The life of Postgres users would be a lot better if bad plans were at least "survivable events". -- Peter Geoghegan
Re: Add version macro to libpq-fe.h
Robert Haas writes: > What will prevent us from forgetting to do something about this again, > a year from now? As long as we notice it before 15.0, we can fix it retroactively, as we just did for 14. For that matter, fixing before 15.1 or so would likely be Good Enough. But realistically, how is this any worse of a problem than a hundred other easily-forgotten coding rules we have? We manage to uphold most of them most of the time. regards, tom lane
Re: Use simplehash.h instead of dynahash in SMgr
Robert Haas writes: > On Mon, Jun 21, 2021 at 10:15 AM David Rowley wrote: >> I've come up with a new hash table implementation that I've called >> generichash. > At the risk of kibitzing the least-important detail of this proposal, > I'm not very happy with the names of our hash implementations. I kind of wonder if we really need four different hash table implementations (this being the third "generic" one, plus hash join has its own, and I may have forgotten others). Should we instead think about revising simplehash to gain the benefits of this patch? regards, tom lane
Re: disfavoring unparameterized nested loops
Peter Geoghegan writes: > The heuristic that has the optimizer flat out avoids an > unparameterized nested loop join is justified by the belief that > that's fundamentally reckless. Even though we all agree on that much, > I don't know when it stops being reckless and starts being "too risky > for me, but not fundamentally reckless". I think that that's worth > living with, but it isn't very satisfying. There are certainly cases where the optimizer can prove (in principle; it doesn't do so today) that a plan node will produce at most one row. They're hardly uncommon either: an equality comparison on a unique key, or a subquery with a simple aggregate function, come to mind. In such cases, not only is this choice not reckless, but it's provably superior to a hash join. So in the end this gets back to the planning risk factor that we keep circling around but nobody quite wants to tackle. I'd be a lot happier if this proposal were couched around some sort of estimate of the risk of the outer side producing more than the expected number of rows. The arguments so far seem like fairly lame rationalizations for not putting forth the effort to do that. regards, tom lane
Re: Add version macro to libpq-fe.h
> On 21 Jun 2021, at 17:27, Robert Haas wrote: > > On Sat, Jun 19, 2021 at 11:45 AM Tom Lane wrote: >> Hearing no further comments, done that way. > > What will prevent us from forgetting to do something about this again, > a year from now? An entry in a release checklist could perhaps be an idea? -- Daniel Gustafsson https://vmware.com/
Re: Detecting File Damage & Inconsistencies
On Thu, Mar 18, 2021 at 6:20 AM Craig Ringer wrote: > > On Mon, 15 Mar 2021 at 21:01, David Steele wrote: >> >> On 11/18/20 5:23 AM, Simon Riggs wrote: >> > On Wed, 18 Nov 2020 at 06:42, Craig Ringer >> > wrote: >> >> >> >> On Fri, Nov 13, 2020 at 7:24 PM Simon Riggs wrote: >> >>> >> >>> >> >>> What I'm proposing is an option to add 16 bytes onto each COMMIT >> >>> record >> >> >> >> >> >> Would it make sense to write this at the time we write a topxid >> >> assignment to WAL instead? >> >> >> >> Otherwise it won't be accessible to streaming-mode logical decoding. >> > >> > Do you mean extend the xl_xact_assignment record? My understanding is >> > that is not sent in all cases, so not sure what you mean by "instead". >> >> Craig, can you clarify? > > > Right. Or write a separate WAL record when the feature is enabled. But it's > probably sufficient to write it as an optional chunk on xl_xact_assignment > records. We often defer writing them so we can optimise away xacts that never > actually wrote anything, but IIRC we still write one before we write any WAL > that references the xid. That'd be fine, since we don't need the info any > sooner than that during decoding. I'd have to double check that we write it > in all cases and won't get to that too soon, but I'm pretty sure we do... The commit record is optimized away if no xid is assigned, though is still present if we didn't write any WAL records. But if a commit record exists in the WAL stream, we want to know where it came from. A later patch will add PITR capability based on this information so attaching it directly to the commit record is fairly important, IMHO. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 8:55 AM Tom Lane wrote: > There are certainly cases where the optimizer can prove (in principle; > it doesn't do so today) that a plan node will produce at most one row. > They're hardly uncommon either: an equality comparison on a unique > key, or a subquery with a simple aggregate function, come to mind. That sounds like it might be useful in general. > In such cases, not only is this choice not reckless, but it's provably > superior to a hash join. So in the end this gets back to the planning > risk factor that we keep circling around but nobody quite wants to > tackle. Let's assume for the sake of argument that we really have to have that additional infrastructure to move forward with the idea. (I'm not sure if it's possible in principle to use infrastructure like that for some of the cases that Robert has in mind, but for now I'll assume that it is both possible and a practical necessity.) Even when I make this working assumption I don't see what it changes at a fundamental level. You've merely come up with a slightly more specific definition of the class of plans that are "reckless". You've only refined the original provisional definition of "reckless" to exclude specific "clearly not reckless" cases (I think). But the definition of "reckless" is no less squishy than what we started out with. > I'd be a lot happier if this proposal were couched around some sort > of estimate of the risk of the outer side producing more than the > expected number of rows. The arguments so far seem like fairly lame > rationalizations for not putting forth the effort to do that. I'm not so sure that it is. The point isn't the risk, even if it could be calculated. The point is the downsides of being wrong are huge and pretty much unbounded, whereas the benefits of being right are tiny and bounded. It almost doesn't matter what the underlying probabilities are. To be clear I'm not arguing against modelling risk. I'm just not sure that it's useful to think of this problem as truly a problem of risk. -- Peter Geoghegan
Re: Add version macro to libpq-fe.h
Daniel Gustafsson writes: > On 21 Jun 2021, at 17:27, Robert Haas wrote: >> What will prevent us from forgetting to do something about this again, >> a year from now? > An entry in a release checklist could perhaps be an idea? Yeah, I was wondering if adding an entry to RELEASE_CHANGES would be helpful. Again, I'm not sure that this coding rule is much more likely to be violated than any other. On the other hand, the fact that it's not critical until we approach release does suggest that maybe it'd be useful to treat it as a checklist item. regards, tom lane
Re: disfavoring unparameterized nested loops
On 6/21/21 5:55 PM, Tom Lane wrote: Peter Geoghegan writes: The heuristic that has the optimizer flat out avoids an unparameterized nested loop join is justified by the belief that that's fundamentally reckless. Even though we all agree on that much, I don't know when it stops being reckless and starts being "too risky for me, but not fundamentally reckless". I think that that's worth living with, but it isn't very satisfying. There are certainly cases where the optimizer can prove (in principle; it doesn't do so today) that a plan node will produce at most one row. They're hardly uncommon either: an equality comparison on a unique key, or a subquery with a simple aggregate function, come to mind. In such cases, not only is this choice not reckless, but it's provably superior to a hash join. So in the end this gets back to the planning risk factor that we keep circling around but nobody quite wants to tackle. Agreed. I'd be a lot happier if this proposal were couched around some sort of estimate of the risk of the outer side producing more than the expected number of rows. The arguments so far seem like fairly lame rationalizations for not putting forth the effort to do that. I agree having such measure would be helpful, but do you have an idea how it could be implemented? I've been thinking about this a bit recently and searching for papers talking about this, and but it's not clear to me how to calculate the risk (and propagate it through the plan) without making the whole cost evaluation way more complicated / expensive :-( The "traditional approach" to quantifying risk would be confidence intervals, i.e. for each cardinality estimate "e" we'd determine a range [a,b] so that P(a <= e <= b) = p. So we could pick "p" depending on how "robust" the plan choice should be (say 0.9 for "risky" and 0.999 for "safe" plans) and calculate a,b. Or maybe we could calculate where the plan changes, and then we'd see if those "break points" are within the confidence interval. If not, great - we can assume the plan is stable, otherwise we'd need to consider the other plans too, somehow. But what I'm not sure about is: 1) Now we're dealing with three cardinality estimates (the original "e" and the boundaries "a, "b"). So which one do we use to calculate cost and pass to upper parts of the plan? 2) The outer relation may be a complex join, so we'd need to combine the confidence intervals for the two input relations, somehow. 3) We'd need to know how to calculate the confidence intervals for most plan nodes, which I'm not sure we know how to do. So it's not clear to me how to do this, which seems rather problematic because we need to propagate and combine those confidence intervals through the plan. But maybe you have thought about some much simpler approach, addressing this sufficiently well? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add version macro to libpq-fe.h
On 6/21/21 12:34 PM, Tom Lane wrote: > Daniel Gustafsson writes: >> On 21 Jun 2021, at 17:27, Robert Haas wrote: >>> What will prevent us from forgetting to do something about this again, >>> a year from now? >> An entry in a release checklist could perhaps be an idea? > Yeah, I was wondering if adding an entry to RELEASE_CHANGES would be > helpful. Again, I'm not sure that this coding rule is much more > likely to be violated than any other. On the other hand, the fact > that it's not critical until we approach release does suggest that > maybe it'd be useful to treat it as a checklist item. > > Maybe for release note preparation, since that's focused on new features, but this doesn't sound like a release prep function to me. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Discarding DISCARD ALL
On Wed, Jan 20, 2021 at 3:53 PM James Coleman wrote: > > On Wed, Jan 20, 2021 at 9:58 AM Simon Riggs wrote: > > > > On Wed, 20 Jan 2021 at 14:21, James Coleman wrote: > > > > > An alternative approach that occurred to me while typing this reply: a > > > setting in Postgres that would disallow setting session level GUCs > > > (i.e., enforce `SET LOCAL` transaction level usage instead) would > > > remove a large chunk of our need to set server_reset_query_always=1 > > > (and more interestingly it'd highlight when broken code gets pushed). > > > But even with that, I see some value in the proposed setting since > > > there is additional session state beyond GUCs. > > > > With transaction_cleanup=on we could force all SETs to be SET LOCAL. > > > > The point is that if we declare ahead of time that the transaction > > will be reset then we can act differently and more easily for various > > circumstances, for SETs, for Temp tables and others. > > Right, I agree it's independently useful. My "alternative" is a subset > of that functionality and doesn't cover as many cases. So if we go for that option, would we call it? session_state = 'session' (default) | 'local_set' If you use 'local' then you find that all state is transaction only * SET defaults to meaning SET LOCAL * SET SESSION returns ERROR -- Simon Riggshttp://www.EnterpriseDB.com/
Re: disfavoring unparameterized nested loops
Peter Geoghegan writes: > On Mon, Jun 21, 2021 at 8:55 AM Tom Lane wrote: >> I'd be a lot happier if this proposal were couched around some sort >> of estimate of the risk of the outer side producing more than the >> expected number of rows. The arguments so far seem like fairly lame >> rationalizations for not putting forth the effort to do that. > I'm not so sure that it is. The point isn't the risk, even if it could > be calculated. The point is the downsides of being wrong are huge and > pretty much unbounded, whereas the benefits of being right are tiny > and bounded. It almost doesn't matter what the underlying > probabilities are. You're throwing around a lot of pejorative adjectives there without having bothered to quantify any of them. This sounds less like a sound argument to me than like a witch trial. Reflecting for a bit on the ancient principle that "the only good numbers in computer science are 0, 1, and N", it seems to me that we could make an effort to classify RelOptInfos as provably empty, provably at most one row, and others. (This would be a property of relations, not individual paths, so it needn't bloat struct Path.) We already understand about provably-empty rels, so this is just generalizing that idea a little. Once we know about that, at least for the really-common cases like unique keys, I'd be okay with a hard rule that we don't consider unparameterized nestloop joins with an outer side that falls in the "N" category. Unless there's no alternative, of course. Another thought that occurs to me here is that maybe we could get rid of the enable_material knob in favor of forcing (or at least encouraging) materialization when the outer side isn't provably one row. regards, tom lane
Re: disfavoring unparameterized nested loops
Tomas Vondra writes: > I've been thinking about this a bit recently and searching for papers > talking about this, and but it's not clear to me how to calculate the > risk (and propagate it through the plan) without making the whole cost > evaluation way more complicated / expensive :-( Yeah, a truly complete approach using confidence intervals or the like seems frighteningly complicated. > But maybe you have thought about some much simpler approach, addressing > this sufficiently well? See my nearby response to Peter. The main case that's distressing me is the possibility of forcing a hash join even when the outer side is obviously only one row. If we could avoid that, at least for large values of "obvious", I'd be far more comfortable. regards, tom lane
Out-of-bounds (src/backend/utils/misc/queryjumble.c)
Hi, Per Coverity. 3 out-of-bounds at function AppendJumble. They have the face, smell and color of typo. And we usually increment the character count after a memcpy. Coverity no longer complained after the patch. Thoughts? regards, Ranier Vilela fix_out_of_bounds_queryjumble.patch Description: Binary data
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 9:52 AM Tom Lane wrote: > > I'm not so sure that it is. The point isn't the risk, even if it could > > be calculated. The point is the downsides of being wrong are huge and > > pretty much unbounded, whereas the benefits of being right are tiny > > and bounded. It almost doesn't matter what the underlying > > probabilities are. > > You're throwing around a lot of pejorative adjectives there without > having bothered to quantify any of them. This sounds less like a sound > argument to me than like a witch trial. I'm not sure what you mean by pejorative. Isn't what I said about downsides and upsides pretty accurate? > Reflecting for a bit on the ancient principle that "the only good numbers > in computer science are 0, 1, and N", it seems to me that we could make > an effort to classify RelOptInfos as provably empty, provably at most one > row, and others. (This would be a property of relations, not individual > paths, so it needn't bloat struct Path.) We already understand about > provably-empty rels, so this is just generalizing that idea a little. It sounds like you're concerned about properly distinguishing between: 1. Cases where the only non-reckless choice is a hash join over a unparameterized nested loop join 2. Cases that look like that at first, that don't really have that quality on closer examination. This seems like a reasonable concern. > Once we know about that, at least for the really-common cases like unique > keys, I'd be okay with a hard rule that we don't consider unparameterized > nestloop joins with an outer side that falls in the "N" category. > Unless there's no alternative, of course. I thought that you were arguing against the premise itself. It's now clear that you weren't, though. I don't have an opinion for or against bringing the provably-empty rels stuff into the picture. At least not right now. -- Peter Geoghegan
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 11:55 AM Tom Lane wrote: > There are certainly cases where the optimizer can prove (in principle; > it doesn't do so today) that a plan node will produce at most one row. > They're hardly uncommon either: an equality comparison on a unique > key, or a subquery with a simple aggregate function, come to mind. Hmm, maybe I need to see an example of the sort of plan shape that you have in mind. To me it feels like a comparison on a unique key ought to use a *parameterized* nested loop. And it's also not clear to me why a nested loop is actually better in a case like this. If the nested loop iterates more than once because there are more rows on the outer side, then you don't want to have something on the inner side that might be expensive, and either an aggregate or an unparameterized search for a unique value are potentially quite expensive. Now if you put a materialize node on top of the inner side, then you don't have to worry about that problem, but how much are you saving at that point vs. just doing a hash join? > I'd be a lot happier if this proposal were couched around some sort > of estimate of the risk of the outer side producing more than the > expected number of rows. The arguments so far seem like fairly lame > rationalizations for not putting forth the effort to do that. I don't understand how to generate a risk assessment or what we ought to do with it if we had one. I don't even understand what units we would use. We measure costs using abstract cost units, but those abstract cost units are intended to be a proxy for runtime. If it's not the case that a plan that runs for longer has a higher cost, then something's wrong with the costing model or the settings. In the case of risk, the whole thing seems totally subjective. We're talking about the risk that our estimate is bogus, but how do we estimate the risk that we don't know how to estimate? Given quals (x + 0) = x, x = some MCV, and x = some non-MCV, we can say that we're most likely to be wrong about the first one and least likely to be wrong about the second one, but how much more likely? I don't know how you can decide that, even in principle. We can also say that an unparameterized nested loop is more risky than some other plan because it could turn out to be crazy expensive, but is that more or less risky than scanning the index on b as a way to implement SELECT * FROM foo WHERE a = 1 ORDER BY b LIMIT 1? How much more risky, and why? And then, even supposing we had a risk metric for every path, what exactly would we do with it? Suppose path A is cheaper than path B, but also more risky. Which should we keep? We could keep both, but that seems to be just kicking the can down the road. If plan B is likely to blow up in our face, we should probably just get rid of it, or not even generate it in the first place. Even if we do keep both, at some point we're going to have to make a cost-vs-risk tradeoff, and I don't see how to do that intelligently, because the point is precisely that if the risk is high, the cost number might be totally wrong. If we know that plan A is cheaper than plan B, we should choose plan A. But if all we know is that plan A would be cheaper than plan B if our estimate of the cost were correct, but also that it probably isn't, then what we actually know is nothing. We have no principled basis for deciding anything based on cost unless we're reasonably confident that the cost estimate is pretty good. So AFAICT the only principled strategy here is to throw away high risk paths as early as we possibly can. What am I missing? The other thing is - the risk of a particular path doesn't matter in an absolute sense, only a relative one. In the particular case I'm on about here, we *know* there's a less-risky alternative. We don't need to quantify the risk to know which of the two options has more. In many other cases, the risk is irreducible e.g. a default estimate could be totally bogus, but switching paths is of no help in getting rid of it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Out-of-bounds (src/backend/utils/misc/queryjumble.c)
Ranier Vilela writes: > Per Coverity. > 3 out-of-bounds at function AppendJumble. > They have the face, smell and color of typo. > And we usually increment the character count after a memcpy. > Coverity no longer complained after the patch. > Thoughts? This patch is incorrect on its face, as you would know if you'd spent even a couple minutes absorbing the comment in that function. I wonder about Coverity here ... independently of whether the hash-accumulation logic does what we want, it looks to me like the proposed change doesn't so much remove a buffer overrun as create one. It would break the property jumble_len < JUMBLE_SIZE that the subsequent lines rely on. Please stop sending us random patches and expecting us to sort out which ones are valid. You're rapidly approaching the status of "boy who cried wolf too many times". regards, tom lane
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 1:11 PM Peter Geoghegan wrote: > On Mon, Jun 21, 2021 at 9:52 AM Tom Lane wrote: > > > I'm not so sure that it is. The point isn't the risk, even if it could > > > be calculated. The point is the downsides of being wrong are huge and > > > pretty much unbounded, whereas the benefits of being right are tiny > > > and bounded. It almost doesn't matter what the underlying > > > probabilities are. > > > > You're throwing around a lot of pejorative adjectives there without > > having bothered to quantify any of them. This sounds less like a sound > > argument to me than like a witch trial. > > I'm not sure what you mean by pejorative. Isn't what I said about > downsides and upsides pretty accurate? Yeah, I don't see why Peter's characterization deserves to be labelled as pejorative here. A hash join or merge join or parameterized nested loop can turn out to be slower than some other algorithm, but all of those approaches have some component that tends to make the asymptotic cost less than the product of the sizes of the inputs. I don't think that's true in absolutely every case; for example, if a merge join has every row duplicated on both sides, it will have to scan every inner tuple once per outer tuple, just like a nested loop, and the other algorithms also are going to degrade toward O(NM) performance in the face of many duplicates. Also, a hash join can be pretty close to that if it needs a shazillion batches. But in normal cases, any algorithm other than an unparameterized nested loop tends to read each input tuple on each side ONCE, so the cost is more like the sum of the input sizes than the product. And there's nothing pejorative in saying that N + M can be less than N * M by an unbounded amount. That's just the facts. -- Robert Haas EDB: http://www.enterprisedb.com
Re: disfavoring unparameterized nested loops
Robert Haas writes: > On Mon, Jun 21, 2021 at 11:55 AM Tom Lane wrote: >> There are certainly cases where the optimizer can prove (in principle; >> it doesn't do so today) that a plan node will produce at most one row. >> They're hardly uncommon either: an equality comparison on a unique >> key, or a subquery with a simple aggregate function, come to mind. > Hmm, maybe I need to see an example of the sort of plan shape that you > have in mind. To me it feels like a comparison on a unique key ought > to use a *parameterized* nested loop. The unique-key comparison would be involved in the outer scan in the cases I'm thinking of. As an example, select * from t1, t2 where t1.id = constant and t1.x op t2.y; where I'm not assuming much about the properties of "op". This could be amenable to a plan like NestLoop Join Join Filter: t1.x op t2.y -> Index Scan on t1_pkey Index Cond: t1.id = constant -> Seq Scan on t2 and if we can detect that the pkey indexscan produces just one row, this is very possibly the best available plan. Nor do I think this is an unusual situation that we can just ignore. BTW, it strikes me that there might be an additional consideration here: did parameterization actually help anything? That is, the proposed rule wants to reject the above but allow NestLoop Join -> Index Scan on t1_pkey Index Cond: t1.id = constant -> Seq Scan on t2 Filter: t1.x op t2.y even though the latter isn't meaningfully better. It's possible this won't arise because we don't consider parameterized paths except where the parameter is used in an indexqual or the like, but I'm not confident of that. See in particular reparameterize_path and friends before you assert there's no such issue. So we might need to distinguish essential from incidental parameterization, or something like that. regards, tom lane
Re: disfavoring unparameterized nested loops
I wrote: > ... As an example, > select * from t1, t2 where t1.id = constant and t1.x op t2.y; > where I'm not assuming much about the properties of "op". > This could be amenable to a plan like > NestLoop Join > Join Filter: t1.x op t2.y > -> Index Scan on t1_pkey > Index Cond: t1.id = constant > -> Seq Scan on t2 Also, to enlarge on that example: if "op" isn't hashable then the original argument is moot. However, it'd still be useful to know if the outer scan is sure to return no more than one row, as that could inform the choice whether to plaster a Materialize node on the inner scan. regards, tom lane
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 1:38 PM Tom Lane wrote: > > Hmm, maybe I need to see an example of the sort of plan shape that you > > have in mind. To me it feels like a comparison on a unique key ought > > to use a *parameterized* nested loop. > > The unique-key comparison would be involved in the outer scan in > the cases I'm thinking of. As an example, > > select * from t1, t2 where t1.id = constant and t1.x op t2.y; > > where I'm not assuming much about the properties of "op". > This could be amenable to a plan like > > NestLoop Join > Join Filter: t1.x op t2.y > -> Index Scan on t1_pkey >Index Cond: t1.id = constant > -> Seq Scan on t2 > > and if we can detect that the pkey indexscan produces just one row, > this is very possibly the best available plan. Hmm, yeah, I guess that's possible. How much do you think this loses as compared with: Hash Join Hash Cond: t1.x op t2.y -> Seq Scan on t2 -> Hash -> Index Scan on t1_pkey (If the operator is not hashable then this plan is impractical, but in such a case the question of preferring the hash join over the nested loop doesn't arise anyway.) > BTW, it strikes me that there might be an additional consideration > here: did parameterization actually help anything? That is, the > proposed rule wants to reject the above but allow > > NestLoop Join > -> Index Scan on t1_pkey >Index Cond: t1.id = constant > -> Seq Scan on t2 >Filter: t1.x op t2.y > > even though the latter isn't meaningfully better. It's possible > this won't arise because we don't consider parameterized paths > except where the parameter is used in an indexqual or the like, > but I'm not confident of that. See in particular reparameterize_path > and friends before you assert there's no such issue. So we might > need to distinguish essential from incidental parameterization, > or something like that. Hmm, perhaps. I think it won't happen in the normal cases, but I can't completely rule out the possibility that there are corner cases where it does. -- Robert Haas EDB: http://www.enterprisedb.com
Re: disfavoring unparameterized nested loops
Robert Haas writes: > On Mon, Jun 21, 2021 at 1:38 PM Tom Lane wrote: >> NestLoop Join >> Join Filter: t1.x op t2.y >> -> Index Scan on t1_pkey >>Index Cond: t1.id = constant >> -> Seq Scan on t2 > Hmm, yeah, I guess that's possible. How much do you think this loses > as compared with: > Hash Join > Hash Cond: t1.x op t2.y > -> Seq Scan on t2 > -> Hash > -> Index Scan on t1_pkey Hard to say. The hash overhead might or might not pay for itself. If the equality operator proper is expensive and we get to avoid applying it at most t2 rows, then this might be an OK alternative; otherwise not so much. In any case, the former looks like plans that we generate now, the second not. Do you really want to field a lot of questions about why we suddenly changed to a not-clearly-superior plan? regards, tom lane
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 10:14 AM Robert Haas wrote: > Hmm, maybe I need to see an example of the sort of plan shape that you > have in mind. To me it feels like a comparison on a unique key ought > to use a *parameterized* nested loop. And it's also not clear to me > why a nested loop is actually better in a case like this. If the > nested loop iterates more than once because there are more rows on the > outer side, then you don't want to have something on the inner side > that might be expensive, and either an aggregate or an unparameterized > search for a unique value are potentially quite expensive. Now if you > put a materialize node on top of the inner side, then you don't have > to worry about that problem, but how much are you saving at that point > vs. just doing a hash join? I suspected that that was true, but even that doesn't seem like the really important thing. While it may be true that the simple heuristic can't be quite as simple as we'd hoped at first, ISTM that this is ultimately not much of a problem. The basic fact remains: some more or less simple heuristic makes perfect sense, and should be adapted. This conclusion is counterintuitive because it's addressing a very complicated problem with a very simple solution. However, if we lived in a world where things that sound too good to be true always turned out to be false, we'd also live in a world where optimizers were completely impractical and useless. Optimizers have that quality already, whether or not we officially acknowledge it. > I don't understand how to generate a risk assessment or what we ought > to do with it if we had one. I don't even understand what units we > would use. We measure costs using abstract cost units, but those > abstract cost units are intended to be a proxy for runtime. If it's > not the case that a plan that runs for longer has a higher cost, then > something's wrong with the costing model or the settings. In the case > of risk, the whole thing seems totally subjective. We're talking about > the risk that our estimate is bogus, but how do we estimate the risk > that we don't know how to estimate? Clearly we need a risk estimate for our risk estimate! > The other thing is - the risk of a particular path doesn't matter in > an absolute sense, only a relative one. In the particular case I'm on > about here, we *know* there's a less-risky alternative. Exactly! This, a thousand times. This reminds me of how people behave in the real world. In the real world people deal with this without too much difficulty. Everything is situational and based on immediate trade-offs, with plenty of uncertainty at every step. For example, if you think that there is even a tiny chance of a piece of fruit being poisonous, you don't eat the piece of fruit -- better to wait until lunchtime. This is one of the *easiest* decisions I can think of, despite the uncertainty. (Except perhaps if you happen to be in danger of dying of starvation, in which case it might be a very different story. And so on.) -- Peter Geoghegan
Re: disfavoring unparameterized nested loops
Peter Geoghegan writes: > On Mon, Jun 21, 2021 at 10:14 AM Robert Haas wrote: >> The other thing is - the risk of a particular path doesn't matter in >> an absolute sense, only a relative one. In the particular case I'm on >> about here, we *know* there's a less-risky alternative. > Exactly! This, a thousand times. This is a striking oversimplification. You're ignoring the fact that the plan shape we generate now is in fact *optimal*, and easily proven to be so, in some very common cases. I don't think the people whose use-cases get worse are going to be mollified by the argument that you reduced their risk, when there is provably no risk. Obviously the people whose use-cases are currently hitting the wrong end of the risk will be happy with any change whatever, but those may not be the same people. I'm willing to take some flak if there's not an easy proof that the outer scan is single-row, but I don't think we should just up and change it for cases where there is. regards, tom lane
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 1:52 PM Tom Lane wrote: > You're ignoring the fact that the plan shape we generate now is in fact > *optimal*, and easily proven to be so, in some very common cases. As I've said I don't reject the idea that there is room for disagreement on the specifics. For example perhaps it'll turn out that only a restricted subset of the cases that Robert originally had in mind will truly turn out to work as well as hoped. But that just seems like a case of Robert refining a very preliminary proposal. I absolutely expect there to be some need to iron out the wrinkles. > I don't > think the people whose use-cases get worse are going to be mollified by > the argument that you reduced their risk, when there is provably no risk. By definition what we're doing here is throwing away slightly cheaper plans when the potential downside is much higher than the potential upside of choosing a reasonable alternative. I don't think that the downside is particularly likely. In fact I believe that it's fairly unlikely in general. This is an imperfect trade-off, at least in theory. I fully own that. > I'm willing to take some flak if there's not an easy proof that the outer > scan is single-row, but I don't think we should just up and change it > for cases where there is. Seems reasonable to me. -- Peter Geoghegan
Re: Doc chapter for Hash Indexes
On Mon, Jun 21, 2021 at 02:08:12PM +0100, Simon Riggs wrote: > New chapter for Hash Indexes, designed to help users understand how > they work and when to use them. > > Mostly newly written, but a few paras lifted from README when they were > helpful. + + PostgreSQL includes an implementation of persistent on-disk hash indexes, + which are now fully crash recoverable. Any data type can be indexed by a I don't see any need to mention that they're "now" crash safe. + Each hash index tuple stores just the 4-byte hash value, not the actual + column value. As a result, hash indexes may be much smaller than B-trees + when indexing longer data items such as UUIDs, URLs etc.. The absence of comma: URLs, etc. + the column value also makes all hash index scans lossy. Hash indexes may + take part in bitmap index scans and backward scans. Isn't it more correct to say that it must use a bitmap scan? + through the tree until the leaf page is found. In tables with millions + of rows this descent can increase access time to data. The equivalent rows comma + that hash value. When scanning a hash bucket during queries we need to queries comma + + As a result of the overflow cases, we can say that hash indexes are + most suitable for unique, nearly unique data or data with a low number + of rows per hash bucket will be suitable for hash indexes. One The beginning and end of the sentence duplicate "suitable". + Each row in the table indexed is represented by a single index tuple in + the hash index. Hash index tuples are stored in the bucket pages, and if + they exist, the overflow pages. "the overflow pages" didn't sound right, but I was confused by the comma. I think it should say ".. in bucket pages and overflow pages, if any." -- Justin
Re: Maintaining a list of pgindent commits for "git blame" to ignore
On Mon, Jun 21, 2021 at 8:34 AM Tom Lane wrote: > It's possible that some of these touch few enough lines that they > are not worth listing; I did not check the commit delta sizes. Commits that touch very few lines weren't included originally, just because it didn't seem necessary. Even still, I've looked through the extra commits now, and everything that you picked out looks in scope. I'm just going to include these extra commits. Attached is a new version of the same file, based on your feedback (with those extra commits, and some commits from the last version removed). I'll produce a conventional patch file in the next revision, most likely. > Meh. I can get on board with the idea of adding commit+revert pairs > to this list, but I'd like a more principled selection filter than > "which ones bother Peter". Maybe the size of the reverted patch > should factor into it I have to admit that you're right. That was why I picked those two out. Of course I can defend this choice in detail, but in the interest of not setting a terrible precedent I won't do that. The commits in question have been removed from this revised version. I think it's important that we not get into the business of adding stuff to this willy-nilly. Inevitably everybody will have their own pet peeve noisy commit, and will want to add to the list -- just like I did. Naturally nobody will be interested in arguing against including whatever individual pet peeve commit each time this comes up. Regardless of the merits of the case. Let's just not go there (unless perhaps it's truly awful for almost everybody). > Do we have an idea of how much adding entries to this list slows > down "git blame"? If we include commit+revert pairs more than > very sparingly, I'm afraid we'll soon have an enormous list, and > I wonder what that will cost us. I doubt it costs us much, at least in any way that has a very noticeable relationship as new commits are added. I've now settled on 68 commits, and expect that this won't need to grow very quickly, so that seems fine. From my point of view it makes "git blame" far more useful. LLVM uses a file with fewer entries, and have had such a file since last year: https://github.com/llvm/llvm-project/blob/main/.git-blame-ignore-revs The list of commit hashes in the file that the Blender project uses is about the same size: https://github.com/blender/blender/blob/master/.git-blame-ignore-revs We're using more commits than either project here, but it's within an order of magnitude. Note that this is going to be opt-in, not opt-out. It won't do anything unless an individual hacker decides to enable it locally. The convention seems to be that it is located in the top-level directory. ISTM that we should follow that convention, since following the convention is good, and does not in itself force anybody to ignore any of the listed commits. Any thoughts on that? -- Peter Geoghegan git-blame-ignore-revs Description: Binary data
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 12:52:39PM -0400, Tom Lane wrote: > You're throwing around a lot of pejorative adjectives there without > having bothered to quantify any of them. This sounds less like a sound > argument to me than like a witch trial. > > Reflecting for a bit on the ancient principle that "the only good numbers > in computer science are 0, 1, and N", it seems to me that we could make > an effort to classify RelOptInfos as provably empty, provably at most one > row, and others. (This would be a property of relations, not individual > paths, so it needn't bloat struct Path.) We already understand about > provably-empty rels, so this is just generalizing that idea a little. > Once we know about that, at least for the really-common cases like unique > keys, I'd be okay with a hard rule that we don't consider unparameterized > nestloop joins with an outer side that falls in the "N" category. > Unless there's no alternative, of course. > > Another thought that occurs to me here is that maybe we could get rid of > the enable_material knob in favor of forcing (or at least encouraging) > materialization when the outer side isn't provably one row. There were a lot of interesting ideas in this thread and I want to analyze some of them. First, there is the common assumption (not stated) that over-estimating by 5% and underestimating by 5% cause the same harm, which is clearly false. If I go to a restaurant and estimate the bill to be 5% higher or %5 lower, assuming I have sufficient funds, under or over estimating is probably fine. If I am driving and under-estimate the traction of my tires, I am probably fine, but if I over-estimate their traction by 5%, I might crash. Closer to home, Postgres is more tolerant of memory usage under-estimation than over-estimation: https://momjian.us/main/blogs/pgblog/2018.html#December_7_2018 What I hear Robert saying is that unparameterized nested loops are much more sensitive to misestimation than hash joins, and only slightly faster than hash joins when they use accurate row counts, so we might want to avoid them by default. Tom is saying that if we know the outer side will have zero or one row, we should still do unparameterized nested loops because those are not more sensitive to misestimation than hash joins, and slightly faster. If that is accurate, I think the big question is how common are cases where the outer side can't be proven to have zero or one row and nested loops are enough of a win to risk its greater sensitivity to misestimation. If it is uncommon, seems we could just code the optimizer to use hash joins in those cases without a user-visible knob, beyond the knob that already turns off nested loop joins. Peter's comment about having nodes in the executor that adjust to the row counts it finds is interesting, and eventually might be necessary once we are sure we have gone as far as we can without it. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Maintaining a list of pgindent commits for "git blame" to ignore
Peter Geoghegan writes: > The convention seems to be that it is located in the top-level > directory. ISTM that we should follow that convention, since following > the convention is good, and does not in itself force anybody to ignore > any of the listed commits. Any thoughts on that? Agreed. I think I'd previously suggested something under src/tools, but we might as well do like others are doing; especially since we have .gitattributes and the like there already. regards, tom lane
Re: Different compression methods for FPI
On Sun, Jun 20, 2021 at 11:15:08PM +0500, Andrey Borodin wrote: > I have some small questions. > > 1. > + report_invalid_record(record, "image at %X/%X > compressed with %s not supported, block %d", > + (uint32) > (record->ReadRecPtr >> 32), > + (uint32) > record->ReadRecPtr, > + "lz4", > + block_id); > Can we point out to user that the problem is in the build? What about the following error then? Say: "image at %X/%X compressed with LZ4 not supported by build, block %d". > Also, maybe %s can be inlined to lz4 in this case. I think that's a remnant of the zstd part of the patch set, where I wanted to have only one translatable message. Sure, I can align lz4 with the message. > 2. > > const char *method = "???"; > Maybe we can use something like "unknown" for unknown compression > methods? Or is it too long string for waldump output? A couple of extra bytes for pg_waldump will not matter much. Using "unknown" is fine by me. > 3. Can we exclude lz4 from config if it's not supported by the build? Enforcing the absence of this value at GUC level is enough IMO: +static const struct config_enum_entry wal_compression_options[] = { + {"pglz", WAL_COMPRESSION_PGLZ, false}, +#ifdef USE_LZ4 + {"lz4", WAL_COMPRESSION_LZ4, false}, +#endif [...] +typedef enum WalCompression +{ + WAL_COMPRESSION_NONE = 0, + WAL_COMPRESSION_PGLZ, + WAL_COMPRESSION_LZ4 +} WalCompression; It is not possible to set the GUC, still it is listed in the enum that allows us to track it. That's the same thing as default_toast_compression with its ToastCompressionId and its default_toast_compression_options. -- Michael signature.asc Description: PGP signature
Re: Different compression methods for FPI
On Tue, Jun 22, 2021 at 09:11:26AM +0900, Michael Paquier wrote: > On Sun, Jun 20, 2021 at 11:15:08PM +0500, Andrey Borodin wrote: > > I have some small questions. > > > > 1. > > + report_invalid_record(record, "image at %X/%X > > compressed with %s not supported, block %d", > > + (uint32) > > (record->ReadRecPtr >> 32), > > + (uint32) > > record->ReadRecPtr, > > + "lz4", > > + block_id); > > Can we point out to user that the problem is in the build? > > What about the following error then? Say: > "image at %X/%X compressed with LZ4 not supported by build, block > %d". The two similar, existing messages are: +#define NO_LZ4_SUPPORT() \ + ereport(ERROR, \ + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), \ +errmsg("unsupported LZ4 compression method"), \ +errdetail("This functionality requires the server to be built with lz4 support."), \ +errhint("You need to rebuild PostgreSQL using --with-lz4."))) src/bin/pg_dump/pg_backup_archiver.c: fatal("cannot restore from compressed archive (compression not supported in this installation)"); src/bin/pg_dump/pg_backup_archiver.c: pg_log_warning("archive is compressed, but this installation does not support compression -- no data will be available"); src/bin/pg_dump/pg_dump.c: pg_log_warning("requested compression not available in this installation -- archive will be uncompressed"); -- Justin
Re: PXGS vs TAP tests
On Sun, Jun 20, 2021 at 01:24:04PM -0400, Andrew Dunstan wrote: > Tests pass with the attached patch, which puts the setting in the > Makefile for the recovery tests. The script itself doesn't need any > changing. +REGRESS_SHLIB=$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX) +export REGRESS_SHLIB It may be better to add a comment here explaning why REGRESS_SHLIB is required in this Makefile then? While on it, could we split those commands into multiple lines and reduce the noise of future diffs? Something as simple as that would make those prove commands easier to follow: +cd $(srcdir) && TESTDIR='$(CURDIR)' \ + $(with_temp_install) \ + PGPORT='6$(DEF_PGPORT)' \ + PG_REGRESS='$(CURDIR)/$(top_builddir)/src/test/regress/pg_regress' \ + REGRESS_SHLIB= '$(abs_top_builddir)/src/test/regress/regress$(DLSUFFIX)' \ + $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) $(if $(PROVE_TESTS),$(PROVE_TESTS),t/*.pl) There are other places where this could happen, but the TAP commands are particularly long. -- Michael signature.asc Description: PGP signature
Re: pgbench logging broken by time logic changes
On Sun, Jun 20, 2021 at 10:15:55AM -0400, Andrew Dunstan wrote: > If this were core server code threatening data integrity I would be > inclined to be more strict, but after all pg_bench is a utility program, > and I think we can allow a little more latitude. +1. Let's be flexible here. It looks better to not rush a fix, and we still have some time ahead. -- Michael signature.asc Description: PGP signature
Re: disfavoring unparameterized nested loops
On Mon, Jun 21, 2021 at 4:51 PM Bruce Momjian wrote: > There were a lot of interesting ideas in this thread and I want to > analyze some of them. First, there is the common assumption (not > stated) that over-estimating by 5% and underestimating by 5% cause the > same harm, which is clearly false. If I go to a restaurant and estimate > the bill to be 5% higher or %5 lower, assuming I have sufficient funds, > under or over estimating is probably fine. If I am driving and > under-estimate the traction of my tires, I am probably fine, but if I > over-estimate their traction by 5%, I might crash. My favorite analogy is the home insurance one: It might make sense to buy home insurance because losing one's home (say through fire) is a loss that usually just cannot be tolerated -- you are literally ruined. We can debate how likely it is to happen, but in the end it's not so unlikely that it can't be ruled out. At the same time I may be completely unwilling to buy insurance for personal electronic devices. I can afford to replace all of them if I truly have to. And the chances of all of them breaking or being stolen on the same day is remote (unless my home burns down!). If I drop my cell phone and crack the screen, I'll be annoyed, but it's certainly not the end of the world. This behavior will make perfect sense to most people. But it doesn't scale up or down. I have quite a few electronic devices, but only a single home, so technically I'm taking risks way more often than I am playing it safe here. Am I risk tolerant when it comes to insurance? Conservative? I myself don't think that it is sensible to apply either term here. It's easier to just look at the specifics. A home is a pretty important thing to almost everybody; we can afford to treat it as a special case. > If that is accurate, I think the big question is how common are cases > where the outer side can't be proven to have zero or one row and nested > loops are enough of a win to risk its greater sensitivity to > misestimation. If it is uncommon, seems we could just code the > optimizer to use hash joins in those cases without a user-visible knob, > beyond the knob that already turns off nested loop joins. I think it's possible that Robert's proposal will lead to very slightly slower plans in the vast majority of cases that are affected, while still being a very good idea. Why should insurance be 100% free, though? Maybe it can be in some cases where we get lucky, but why should that be the starting point? It just has to be very cheap relative to what we do today for us to come out ahead, certainly, but that seems quite possible in at least this case. -- Peter Geoghegan
Re: Optionally automatically disable logical replication subscriptions on error
Much of the discussion above seems to be related to where to store the error information and how much information is needed to be useful. As a summary, the 5 alternatives I have seen mentioned are: #1. Store some simple message in the pg_subscription ("I wasn't trying to capture everything, just enough to give the user a reasonable indication of what went wrong" [Mark-1]). Storing the error message was also seen as a convenience for writing TAP tests ("I originally ran into the motivation to write this patch when frustrated that TAP tests needed to parse the apply worker log file" [Mark-2}). It also can sometimes provide a simple clue for the error (e.g. PK violation for table TBL) but still the user will have to look elsewhere for details to resolve the error. So while this implementation seems good for simple scenarios, it appears to have been shot down because the non-trivial scenarios either have insufficient or wrong information in the error message. Some DETAILS could have been added to give more information but that would maybe bloat the catalog ("I have not yet included the DETAIL field because I didn't want to bloat the catalog." [Mark-3]) #2. Similarly another idea was to use another existing catalog table pg_subscription_rel. This could have the same problems ("It won't be sufficient to store information in either pg_subscription_rel or pg_susbscription." [Amit-1]) #3. There is another suggestion to use the Stats Collector to hold the error message [Amit-2]. For me, this felt like blurring too much the distinction between "stats tracking/metrics" and "logs". ERROR logs must be flushed, whereas for stats (IIUC) there is no guarantee that everything you need to see would be present. Indeed Amit wrote "But in this case, if the stats collector missed updating the information, the user may have to manually update the subscription and let the error happen again to see it." [Amit-3]. Requesting the user to cause the same error again just in case it was not captured a first time seems too strange to me. #4. The next idea was to have an entirely new catalog for holding the subscription error information. I feel that storing/duplicating lots of error information in another table seems like a bridge too far. What about the risks of storing incorrect or sufficient information? What is the advantage of duplicating errors over just referring to the log files for ERROR details? #5. Document to refer to the logs. All ERROR details are already in the logs, and this seems to me the intuitive place to look for them. Searching for specific errors becomes difficult programmatically (is this really a problem other than complex TAP tests?). But here there is no risk of missing or insufficient information captured in the log files ("but still there will be some information related to ERROR which we wanted the user to see unless we ask them to refer to logs for that." [Amit-4}). --- My preferred alternative is #5. ERRORs are logged in the log file, so there is nothing really for this patch to do in this regard (except documentation), and there is no risk of missing any information, no ambiguity of having duplicated errors, and it is the intuitive place the user would look. So I felt current best combination is just this: a) A tri-state indicating the state of the subscription: e.g. something like "enabled" ('e')/ "disabled" ('d') / "auto-disabled" ('a') [Amit-5] b) For "auto-disabled" the PG docs would be updated tell the user to check the logs to resolve the problem before re-enabling the subscription // IMO it is not made exactly clear to me what is the main goal of this patch. Because of this, I feel that you can't really judge if this new option is actually useful or not except only in hindsight. It seems like whatever you implement can be made to look good or bad, just by citing different test scenarios. e.g. * Is the goal mainly to help automated (TAP) testing? In that case, then maybe you do want to store the error message somewhere other than the log files. But still I wonder if results would be unpredictable anyway - e.g if there are multiple tables all with errors then it depends on the tablesync order of execution which error you see caused the auto-disable, right? And if it is not predictable maybe it is less useful. * Is the goal to prevent some *unattended* SUBSCRIPTION from going bad at some point in future and then going into a relaunch loop for days/weeks and causing 1000's of errors without the user noticing. In that case, this patch seems to be quite useful, but for this goal maybe you don't want to be checking the tablesync workers at all, but should only be checking the apply worker like your original v1 patch did. * Is the goal just to be a convenient way to disable the subscription during the CREATE SUBSCRIPTION phase so that the user can make corrections in peace without the workers re-launching and making more error logs? Here the patch is helpful, but only for simple scen
Re: Different compression methods for FPI
On Mon, Jun 21, 2021 at 07:19:27PM -0500, Justin Pryzby wrote: > The two similar, existing messages are: > > +#define NO_LZ4_SUPPORT() \ > + ereport(ERROR, \ > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), \ > +errmsg("unsupported LZ4 compression method"), \ > +errdetail("This functionality requires the server to > be built with lz4 support."), \ > +errhint("You need to rebuild PostgreSQL using > --with-lz4."))) > > src/bin/pg_dump/pg_backup_archiver.c: fatal("cannot > restore from compressed archive (compression not supported in this > installation)"); > src/bin/pg_dump/pg_backup_archiver.c: pg_log_warning("archive is > compressed, but this installation does not support compression -- no data > will be available"); > src/bin/pg_dump/pg_dump.c: pg_log_warning("requested compression > not available in this installation -- archive will be uncompressed"); The difference between the first message and the rest is that the backend has much more room in terms of error verbosity while xlogreader.c needs to worry also about the frontend. In this case, we need to worry about the block involved and its LSN. Perhaps you have a suggestion? -- Michael signature.asc Description: PGP signature
Re: Added schema level support for publication.
On Mon, Jun 21, 2021 at 3:16 PM vignesh C wrote: > I felt this is ok as we specify the keycount to be 1, so only the > key[0] will be used. Thoughts? > ScanKeyInit(&key[0], > Anum_pg_class_relkind, > BTEqualStrategyNumber, F_CHAREQ, > CharGetDatum(RELKIND_PARTITIONED_TABLE)); > > scan = table_beginscan_catalog(classRel, 1, key); > It maybe fine, just doesn't look correct when you look at the function as a whole. > > = > > > > in UpdatePublicationTypeTupleValue(): > > > > + tup = heap_modify_tuple(tup, RelationGetDescr(rel), values, nulls, > > + replaces); > > + > > + /* Update the catalog. */ > > + CatalogTupleUpdate(rel, &tup->t_self, tup); > > > > Not sure if this tup needs to be freed or if the memory context will > > take care of it. > > I felt this is ok, as the cleanup is handled in the caller function > "AlterPublication", thoughts? > /* Cleanup. */ > heap_freetuple(tup); > table_close(rel, RowExclusiveLock); that should be fine. regards, Ajin Cherian Fujitsu Australia
Re: Use simplehash.h instead of dynahash in SMgr
On Tue, 22 Jun 2021 at 02:53, Robert Haas wrote: > At the risk of kibitzing the least-important detail of this proposal, > I'm not very happy with the names of our hash implementations. > simplehash is not especially simple, and dynahash is not particularly > dynamic, especially now that the main place we use it is for > shared-memory hash tables that can't be resized. Likewise, generichash > doesn't really give any kind of clue about how this hash table is > different from any of the others. I don't know how possible it is to > do better here; naming things is one of the two hard problems in > computer science. In a perfect world, though, our hash table > implementations would be named in such a way that somebody might be > able to look at the names and guess on that basis which one is > best-suited to a given task. I'm certainly open to better names. I did almost call it stablehash, in regards to the pointers to elements not moving around like they do with simplehash. I think more generally, hash table implementations are complex enough that it's pretty much impossible to give them a short enough meaningful name. Most papers just end up assigning a name to some technique. e.g Robinhood, Cuckoo etc. Both simplehash and generichash use a variant of Robinhood hashing. simplehash uses open addressing and generichash does not. Instead of Andres naming it simplehash, if he'd instead called it "robinhoodhash", then someone might come along and complain that his implementation is broken because it does not implement tombstoning. Maybe Andres thought he'd avoid that by not claiming that it's an implementation of a Robinhood hash table. That seems pretty wise to me. Naming it simplehash was a pretty simple way of avoiding that problem. Anyway, I'm open to better names, but I don't think the name should drive the implementation. If the implementation does not fit the name perfectly, then the name should change rather than the implementation. Personally, I think we should call it RowleyHash, but I think others might object. ;-) David
Re: Use simplehash.h instead of dynahash in SMgr
On Tue, 22 Jun 2021 at 03:43, Tom Lane wrote: > I kind of wonder if we really need four different hash table > implementations (this being the third "generic" one, plus hash join > has its own, and I may have forgotten others). Should we instead > think about revising simplehash to gain the benefits of this patch? hmm, yeah. I definitely agree with trying to have as much reusable code as we can when we can. It certainly reduces maintenance and bugs tend to be found more quickly too. It's a very worthy cause. I did happen to think of this when I was copying swathes of code out of simplehash.h. However, I decided that the two implementations are sufficiently different that if I tried to merge them both into one .h file, we'd have some unreadable and unmaintainable mess. I just don't think their DNA is compatible enough for the two to be mated successfully. For example, simplehash uses open addressing and generichash does not. This means that things like iterating over the table works completely differently. Lookups in generichash need to perform an extra step to fetch the actual data from the segment arrays. I think it would certainly be possible to merge the two, but I just don't think it would be easy code to work on if we did that. The good thing is that that the API between the two is very similar and it's quite easy to swap one for the other. I did make changes around memory allocation due to me being too cheap to zero memory when I didn't need to and simplehash not having any means of allocating memory without zeroing it. I also think that there's just no one-size-fits-all hash table type. simplehash will not perform well when the size of the stored element is very large. There's simply too much memcpying to move data around during insert/delete. simplehash will also have terrible iteration performance in sparsely populated tables. However, simplehash will be pretty much unbeatable for lookups where the element type is very small, e.g single Datum, or an int. The CPU cache efficiency there will be pretty much unbeatable. I tried to document the advantages of each in the file header comments. I should probably also add something to simplehash.h's comments to mention generichash.h David
Re: Fix for segfault in logical replication on master
On Mon, 21 Jun 2021 at 19:06, Amit Kapila wrote: > On Mon, Jun 21, 2021 at 4:09 PM Japin Li wrote: >> >> On Mon, 21 Jun 2021 at 17:54, Amit Kapila wrote: >> > On Mon, Jun 21, 2021 at 2:06 PM Japin Li wrote: >> >> >> >> On Mon, 21 Jun 2021 at 16:22, Amit Kapila wrote: >> >> > On Mon, Jun 21, 2021 at 1:30 PM Japin Li wrote: >> >> >> >> >> >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila >> >> >> wrote: >> >> >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila >> >> >> > wrote: >> >> >> >> >> >> Or we can free the memory owned by indexoidlist after check whether it >> >> >> is NIL, >> >> >> because we do not use it in the later. >> >> >> >> >> > >> >> > Valid point. But I am thinking do we really need to fetch and check >> >> > indexoidlist here? >> >> >> >> IMO, we shold not fetch and check the indexoidlist here, since we do not >> >> use it. However, we should use RelationGetIndexList() to update the >> >> reladion->rd_replidindex, so we should fetch the indexoidlist, maybe we >> >> can use the following code: >> >> >> >> indexoidlist = RelationGetIndexList(relation); >> >> list_free(indexoidlist); >> >> >> >> Or does there any function that only update the relation->rd_replidindex >> >> or related fields, but do not fetch the indexoidlist? >> >> >> > >> > How about RelationGetReplicaIndex? It fetches the indexlist only when >> > required and frees it immediately. But otherwise, currently, there >> > shouldn't be any memory leak because we allocate this in "logical >> > replication output context" which is reset after processing each >> > change message, see pgoutput_change. >> >> Thanks for your explanation. It might not be a memory leak, however it's >> a little confuse when we free the memory of the indexoidlist in one place, >> but not free it in another place. >> >> I attached a patch to fix this. Any thoughts? >> > > Your patch appears to be on the lines we discussed but I would prefer > to get it done after Beta2 as this is just a minor code improvement. > Can you please send the change as a patch file instead of copy-pasting > the diff at the end of the email? Thanks for your review! Attached v1 patch. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. >From 0582d92ba4df517f3e67717751cddb8916c8ab9f Mon Sep 17 00:00:00 2001 From: Japin Li Date: Tue, 22 Jun 2021 09:59:38 +0800 Subject: [PATCH v1] Minor code beautification for RelationGetIdentityKeyBitmap --- src/backend/utils/cache/relcache.c | 13 - 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index d55ae016d0..94fbf1aa19 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -5244,9 +5244,9 @@ Bitmapset * RelationGetIdentityKeyBitmap(Relation relation) { Bitmapset *idindexattrs = NULL; /* columns in the replica identity */ - List *indexoidlist; Relation indexDesc; int i; + Oid replidindex; MemoryContext oldcxt; /* Quick exit if we already computed the result */ @@ -5260,18 +5260,14 @@ RelationGetIdentityKeyBitmap(Relation relation) /* Historic snapshot must be set. */ Assert(HistoricSnapshotActive()); - indexoidlist = RelationGetIndexList(relation); - - /* Fall out if no indexes (but relhasindex was set) */ - if (indexoidlist == NIL) - return NULL; + replidindex = RelationGetReplicaIndex(relation); /* Fall out if there is no replica identity index */ - if (!OidIsValid(relation->rd_replidindex)) + if (!OidIsValid(replidindex)) return NULL; /* Look up the description for the replica identity index */ - indexDesc = RelationIdGetRelation(relation->rd_replidindex); + indexDesc = RelationIdGetRelation(replidindex); if (!RelationIsValid(indexDesc)) elog(ERROR, "could not open relation with OID %u", @@ -5295,7 +5291,6 @@ RelationGetIdentityKeyBitmap(Relation relation) } RelationClose(indexDesc); - list_free(indexoidlist); /* Don't leak the old values of these bitmaps, if any */ bms_free(relation->rd_idattr); -- 2.30.2
Re: Optionally automatically disable logical replication subscriptions on error
> On Jun 21, 2021, at 5:57 PM, Peter Smith wrote: > > #5. Document to refer to the logs. All ERROR details are already in > the logs, and this seems to me the intuitive place to look for them. My original motivation came from writing TAP tests to check that the permissions systems would properly deny the apply worker when running under a non-superuser role. The idea is that the user with the responsibility for managing subscriptions won't have enough privilege to read the logs. Whatever information that user needs (if any) must be someplace else. > Searching for specific errors becomes difficult programmatically (is > this really a problem other than complex TAP tests?). I believe there is a problem, because I remain skeptical that these errors will be both existent and rare. Either you've configured your system correctly and you get zero of these, or you've misconfigured it and you get some non-zero number of them. I don't see any reason to assume that number will be small. The best way to deal with that is to be able to tell the system what to do with them, like "if the error has this error code and the error message matches this regular expression, then do this, else do that." That's why I think allowing triggers to be created on subscriptions makes the most sense (though is probably the hardest system being proposed so far.) > But here there > is no risk of missing or insufficient information captured in the log > files ("but still there will be some information related to ERROR > which we wanted the user to see unless we ask them to refer to logs > for that." [Amit-4}). Not only is there a problem if the user doesn't have permission to view the logs, but also, if we automatically disable the subscription until the error is manually cleared, the logs might be rotated out of existence before the user takes any action. In that case, the logs will be entirely missing, and not even the error message will remain. At least with the patch I submitted, the error message will remain, though I take Amit's point that there are deficiencies in handling parallel tablesync workers, etc. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: RFC: Logging plan of the running query
On 2021-06-16 20:36, torikoshia wrote: other background or parallel worker. As far as I looked around, there seems no easy ways to do so. If we were to invent a new mechanism just for addressing the above comment, I would rather choose to not do that as it seems like an overkill. We can leave it up to the user whether or not to unnecessarily signal those processes which are bound to print "PID XXX is not executing queries now" message. +1. I'm going to proceed in this direction. Updated the patch. On Thu, Jun 10, 2021 at 11:09 AM torikoshia wrote: On 2021-06-09 23:04, Fujii Masao wrote: > auto_explain can log the plan of even nested statement > if auto_explain.log_nested_statements is enabled. > But ISTM that pg_log_current_plan() cannot log that plan. > Is this intentional? > I think that it's better to make pg_log_current_plan() log > the plan of even nested statement. +1. It would be better. But currently plan information is got from ActivePortal and ISTM there are no easy way to retrieve plan information of nested statements from ActivePortal. Anyway I'll do some more research. I haven't found a proper way yet but it seems necessary to use something other than ActivePortal and I'm now thinking this could be a separate patch in the future. -- Regards, -- Atsushi Torikoshi NTT DATA CORPORATIONFrom 7ad9a280b6f74e4718293863716046c02b0a3835 Mon Sep 17 00:00:00 2001 From: atorik Date: Tue, 22 Jun 2021 10:03:39 +0900 Subject: [PATCH v4] Add function to log the untruncated query string and its plan for the query currently running on the backend with the specified process ID. Currently, we have to wait for the query execution to finish before we check its plan. This is not so convenient when investigating long-running queries on production environments where we cannot use debuggers. To improve this situation, this patch adds pg_log_current_query_plan() function that requests to log the plan of the specified backend process. Only superusers are allowed to request to log the plans because allowing any users to issue this request at an unbounded rate would cause lots of log messages and which can lead to denial of service. On receipt of the request, at the next CHECK_FOR_INTERRUPTS(), the target backend logs its plan at LOG_SERVER_ONLY level, so that these plans will appear in the server log but not be sent to the client. Since some codes and comments of pg_log_current_query_plan() are the same with pg_log_backend_memory_contexts(), this patch also refactors them to make them common. Reviewd-by: Bharath Rupireddy, Fujii Masao, Dilip Kumar --- doc/src/sgml/func.sgml | 44 src/backend/commands/explain.c | 75 src/backend/storage/ipc/procsignal.c | 66 + src/backend/tcop/postgres.c | 4 ++ src/backend/utils/adt/mcxtfuncs.c| 51 + src/backend/utils/init/globals.c | 1 + src/include/catalog/pg_proc.dat | 6 ++ src/include/commands/explain.h | 2 + src/include/miscadmin.h | 1 + src/include/storage/procsignal.h | 3 + src/test/regress/expected/misc_functions.out | 12 +++- src/test/regress/sql/misc_functions.sql | 8 +-- 12 files changed, 217 insertions(+), 56 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 6388385edc..dad2d34a04 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -24940,6 +24940,27 @@ SELECT collation for ('foo' COLLATE "de_DE"); + + + + pg_log_current_query_plan + +pg_log_current_query_plan ( pid integer ) +boolean + + +Requests to log the untruncated query string and its plan for +the query currently running on the backend with the specified +process ID. +They will be logged at LOG message level and +will appear in the server log based on the log +configuration set (See +for more information), but will not be sent to the client +regardless of . +Only superusers can request to log plan of the running query. + + + @@ -25053,6 +25074,29 @@ LOG: Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560 because it may generate a large number of log messages. + +pg_log_current_query_plan can be used +to log the plan of a backend process. For example: + +postgres=# SELECT pg_log_current_query_plan(201116); + pg_log_current_query_plan +--- + t +(1 row) + +The format of the query plan is the same as when FORMAT TEXT +and VEBOSE are used in the EXPLAIN command. +For example: + +LOG: plan of the query running on backend with PID 201116 is: +Query Text: SELECT * FROM pgbench_accounts; +Seq Scan on public.pgbench_accounts (cost=0.00..263
Re: Optionally automatically disable logical replication subscriptions on error
> On Jun 21, 2021, at 5:57 PM, Peter Smith wrote: > > * Is the goal mainly to help automated (TAP) testing? Absolutely, that was my original motivation. But I don't think that is the primary reason the patch would be accepted. There is a cost to having the logical replication workers attempt ad infinitum to apply a transaction that will never apply. Also, if you are waiting for a subscription to catch up, it is far from obvious that you will wait forever. > In that case, > then maybe you do want to store the error message somewhere other than > the log files. But still I wonder if results would be unpredictable > anyway - e.g if there are multiple tables all with errors then it > depends on the tablesync order of execution which error you see caused > the auto-disable, right? And if it is not predictable maybe it is less > useful. But if you are writing a TAP test, you should be the one controlling whether that is the case. I don't think it would be unpredictable from the point of view of the test author. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: row filtering for logical replication
On Fri, Jun 18, 2021 at 9:40 PM Amit Kapila wrote: > [...] > I have rebased the patch so that you can try it out. The main thing I > have done is to remove changes in worker.c and created a specialized > function to create estate for pgoutput.c as I don't think we need what > is done in worker.c. Thanks for the recent rebase. - The v15 patch applies OK (albeit with whitespace warning) - make check is passing OK - the new TAP tests 020_row_filter is passing OK. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Optionally automatically disable logical replication subscriptions on error
On Mon, Jun 21, 2021 at 7:48 PM Amit Kapila wrote: > > On Mon, Jun 21, 2021 at 11:19 AM Amit Kapila wrote: > > > > On Mon, Jun 21, 2021 at 10:55 AM Mark Dilger > > wrote: > > > > > I don't mind if you want to store more information, and maybe that needs > > > to be stored somewhere else. Do you believe pg_subscription_rel is a > > > suitable location? > > > > > It won't be sufficient to store information in either > > pg_subscription_rel or pg_susbscription. I think if we want to store > > the required information in a catalog then we need to define a new > > catalog (pg_subscription_conflicts or something like that) with > > information corresponding to each rel in subscription (srsubid oid > > (Reference to subscription), srrelid oid (Reference to relation), > > ). OTOH, we can choose to send the error > > information to stats collector which will then be available via stat > > view and update system catalog to disable the subscription but there > > will be a risk that we might send info of failed transaction to stats > > collector but then fail to update system catalog to disable the > > subscription. > > > > I think we should store the input from the user (like disable_on_error > flag or xid to skip) in the system catalog pg_subscription and send > the error information (subscrtion_id, rel_id, xid of failed xact, > error_code, error_message, etc.) to the stats collector which can be > used to display such information via a stat view. > > The disable_on_error flag handling could be that on error it sends the > required error info to stats collector and then updates the subenabled > in pg_subscription. In rare conditions, where we are able to send the > message but couldn't update the subenabled info in pg_subscription > either due to some error or server restart, the apply worker would > again try to apply the same change and would hit the same error again > which I think should be fine because it will ultimately succeed. > > The skip xid handling will also be somewhat similar where on an error, > we will send the error information to stats collector which will be > displayed via stats view. Then the user is expected to ask for skip > xid (Alter Subscription ... SKIP ) based on information > displayed via stat view. Now, the apply worker can skip changes from > such a transaction, and then during processing of commit record of the > skipped transaction, it should update xid to invalid value, so that > next time that shouldn't be used. I think it is important to update > xid to an invalid value as part of the skipped transaction because > otherwise, after the restart, we won't be able to decide whether we > still want to skip the xid stored for a subscription. Sounds reasonable. The feature that sends the error information to the stats collector is a common feature for both and itself is also useful. As discussed in that skip transaction patch thread, it would also be good if we write error information (relation, action, xid, etc) to the server log too. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Maintaining a list of pgindent commits for "git blame" to ignore
On Mon, Jun 21, 2021 at 5:06 PM Tom Lane wrote: > Agreed. I think I'd previously suggested something under src/tools, > but we might as well do like others are doing; especially since > we have .gitattributes and the like there already. Great. Attached is a patch file that puts it all together. I would like to commit this in the next couple of days. -- Peter Geoghegan v1-0001-Add-list-of-ignorable-pgindent-commits-for-git-bl.patch Description: Binary data
Re: Optionally automatically disable logical replication subscriptions on error
On Tue, Jun 22, 2021 at 6:27 AM Peter Smith wrote: > > #3. There is another suggestion to use the Stats Collector to hold the > error message [Amit-2]. For me, this felt like blurring too much the > distinction between "stats tracking/metrics" and "logs". ERROR logs > must be flushed, whereas for stats (IIUC) there is no guarantee that > everything you need to see would be present. Indeed Amit wrote "But in > this case, if the stats collector missed updating the information, the > user may have to manually update the subscription and let the error > happen again to see it." [Amit-3]. Requesting the user to cause the > same error again just in case it was not captured a first time seems > too strange to me. > I don't think it will often be the case that the stats collector will miss updating the information. I am not feeling comfortable storing error information in system catalogs. We have some other views which capture somewhat similar conflict information (pg_stat_database_conflicts) or failed transactions information. So, I thought here we are extending the similar concept by storing some additional information about errors. -- With Regards, Amit Kapila.
Re: Use simplehash.h instead of dynahash in SMgr
On Tue, Jun 22, 2021 at 1:55 PM David Rowley wrote: > On Tue, 22 Jun 2021 at 03:43, Tom Lane wrote: > > I kind of wonder if we really need four different hash table > > implementations (this being the third "generic" one, plus hash join > > has its own, and I may have forgotten others). Should we instead > > think about revising simplehash to gain the benefits of this patch? > > hmm, yeah. I definitely agree with trying to have as much reusable > code as we can when we can. It certainly reduces maintenance and bugs > tend to be found more quickly too. It's a very worthy cause. It is indeed really hard to decide when you have a new thing, and when you need a new way to parameterise the existing generic thing. I wondered about this how-many-hash-tables-does-it-take question a lot when writing dshash.c (a chaining hash table that can live in weird dsa.c memory, backed by DSM segments created on the fly that may be mapped at different addresses in each backend, and has dynahash-style partition locking), and this was around the time Andres was talking about simplehash. In retrospect, I'd probably kick out the built-in locking and partitions and instead let callers create their own partitioning scheme on top from N tables, and that'd remove one quirk, leaving only the freaky pointers and allocator. I recall from a previous life that Boost's unordered_map template is smart enough to support running in shared memory mapped at different addresses just through parameterisation that controls the way it deals with internal pointers (boost::unordered_map<..., ShmemAllocator>), which seemed pretty clever to me, and it might be achievable to do the same with a generic hash table for us that could take over dshash's specialness. One idea I had at the time is that the right number of hash table implementations in our tree is two: one for chaining (like dynahash) and one for open addressing/probing (like simplehash), and that everything else should be hoisted out (locking, partitioning) or made into template parameters through the generic programming technique that simplehash.h has demonstrated (allocators, magic pointer type for internal pointers, plus of course the inlinable ops). But that was before we'd really fully adopted the idea of this style of template code. (I also assumed the weird memory stuff would be temporary and we'd move to threads, but that's another topic for another thread.) It seems like you'd disagree with this, and you'd say the right number is three. But it's also possible to argue for one... A more superficial comment: I don't like calling hash tables "hash". I blame perl.
Re: Optionally automatically disable logical replication subscriptions on error
> On Jun 21, 2021, at 5:57 PM, Peter Smith wrote: > > * Is the goal to prevent some *unattended* SUBSCRIPTION from going bad > at some point in future and then going into a relaunch loop for > days/weeks and causing 1000's of errors without the user noticing. In > that case, this patch seems to be quite useful, but for this goal > maybe you don't want to be checking the tablesync workers at all, but > should only be checking the apply worker like your original v1 patch > did. Yeah, my motivation was preventing an infinite loop, and providing a clean way for the users to know that replication they are waiting for won't ever complete, rather than having to infer that it will never halt. > * Is the goal just to be a convenient way to disable the subscription > during the CREATE SUBSCRIPTION phase so that the user can make > corrections in peace without the workers re-launching and making more > error logs? No. This is not and never was my motivation. It's an interesting question, but that idea never crossed my mind. I'm not sure what changes somebody would want to make *after* creating the subscription. Certainly, there may be problems with how they have things set up, but they won't know that until the first error happens. > Here the patch is helpful, but only for simple scenarios > like 1 faulty table. Imagine if there are 10 tables (all with PK > violations at DATASYNC copy) then you will encounter them one at a > time and have to re-enable the subscription 10 times, after fixing > each error in turn. You are assuming disable_on_error=true. It is false by default. But ok, let's accept that assumption for the sake of argument. Now, will you have to manually go through the process 10 times? I'm not sure. The user might figure out their mistake after seeing the first error. > So in this scenario the new option might be more > of a hindrance than a help because it would be easier if the user just > did "ALTER SUBSCRIPTION sub DISABLE" manually and fixed all the > problems in one sitting before re-enabling. Yeah, but since the new option is off by default, I don't see any sensible complaint. > > * etc > > // > > Finally, here is one last (crazy?) thought-bubble just for > consideration. I might be wrong, but my gut feeling is that the Stats > Collector is intended more for "tracking" and for "metrics" rather > than for holding duplicates of logged error messages. At the same > time, I felt that disabling an entire subscription due to a single > rogue error might be overkill sometimes. I'm happy to entertain criticism of the particulars of how my patch approaches this problem, but it is already making a distinction between transient errors (resources, network, etc.) vs. ones that are non-transient. Again, I might not have drawn the line in the right place, but the patch is not intended to disable subscriptions in response to transient errors. > But I wonder if there is a > way to combine those two ideas so that the Stats Collector gets some > new counter for tracking the number of worker re-launches that have > occurred, meanwhile there could be a subscription option which gives a > threshold above which you would disable the subscription. > e.g. > "disable_on_error_threshold=0" default, relaunch forever > "disable_on_error_threshold=1" disable upon first error encountered. > (This is how your patch behaves now I think.) > "disable_on_error_threshold=500" disable if the re-launch errors go > unattended and happen 500 times. That sounds like a misfeature to me. You could have a subscription that works fine for a month, surviving numerous short network outages, but then gets autodisabled after a longer network outage. I'm not sure why anybody would want that. You might argue for exponential backoff, where it never gets autodisabled on transient errors, but retries less frequently. But I don't want to expand the scope of this patch to include that, at least not without a lot more evidence that it is needed. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Maintaining a list of pgindent commits for "git blame" to ignore
On Mon, Jun 21, 2021 at 07:43:59PM -0700, Peter Geoghegan wrote: > On Mon, Jun 21, 2021 at 5:06 PM Tom Lane wrote: > > Agreed. I think I'd previously suggested something under src/tools, > > but we might as well do like others are doing; especially since > > we have .gitattributes and the like there already. > > Great. > > Attached is a patch file that puts it all together. I would like to > commit this in the next couple of days. +1 -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Two patches to speed up pg_rewind.
On Thu, Jun 17, 2021 at 3:19 PM Michael Paquier wrote: > > On Wed, Jun 02, 2021 at 05:02:10PM +0900, Michael Paquier wrote: > > On Wed, Jun 02, 2021 at 06:20:30PM +1200, Thomas Munro wrote: > > > The main thing I noticed was that Linux < 5.3 can fail with EXDEV if > > > you cross a filesystem boundary, is that something we need to worry > > > about there? > > > > Hmm. Good point. That may justify having a switch to control that. > > Paul, the patch set still needs some work, so I am switching it as > waiting on author. I am pretty sure that we had better have a > fallback implementation of copy_file_range() in src/port/, and that we > are going to need an extra switch in pg_rewind to allow users to > bypass copy_file_range()/EXDEV if they do a local rewind operation > across different FSes with a kernel < 5.3. > -- I did modification on the copy_file_range() patch yesterday by simply falling back to read()+write() but I think it could be improved further. We may add a function to determine two file/path are copy_file_range() capable or not (using POSIX standard statvfs():f_fsid?) - that could be used by other copy_file_range() users although in the long run the function is not needed. And even having this we may still need the fallback code if needed. - For pg_rewind, we may just determine that ability once on src/dst pgdata, but since there might be soft link (tablespace/wal) in pgdata so we should still allow fallback for those non copy_fie_range() capable file copying. - Also it seems that sometimes copy_file_range() could return ENOTSUP/EOPNOTSUP (the file system does not support that and the kernel does not fall back to simple copying?) although this is not documented and it seems not usual? Any idea?
Re: Different compression methods for FPI
On Tue, May 25, 2021 at 12:05:19PM +0530, Dilip Kumar wrote: > +++ b/src/test/recovery/t/011_crash_recovery.pl > @@ -14,7 +14,7 @@ use Config; > plan tests => 3; > > my $node = get_new_node('primary'); > -$node->init(allows_streaming => 1); > +$node->init(); > $node->start; > > How this change is relevant? It's necessary for the tests to pass - see the prior discussions. Revert them and the tests fail. time make -C src/test/recovery check # Failed test 'new xid after restart is greater' @Michael: I assume that if you merge this patch, you'd set your animals to use wal_compression=lz4, and then they would fail the recovery tests. So the patches that you say are unrelated still seem to me to be a prerequisite. From: Kyotaro Horiguchi Subject: [PATCH v8 2/9] Run 011_crash_recovery.pl with wal_level=minimal From: Kyotaro Horiguchi Subject: [PATCH v8 3/9] Make sure published XIDs are persistent +/* compression methods supported */ +#define BKPIMAGE_COMPRESS_PGLZ 0x04 +#define BKPIMAGE_COMPRESS_ZLIB 0x08 +#define BKPIMAGE_COMPRESS_LZ4 0x10 +#define BKPIMAGE_COMPRESS_ZSTD 0x20 +#defineBKPIMAGE_IS_COMPRESSED(info) \ + ((info & (BKPIMAGE_COMPRESS_PGLZ | BKPIMAGE_COMPRESS_ZLIB | \ + BKPIMAGE_COMPRESS_LZ4 | BKPIMAGE_COMPRESS_ZSTD)) != 0) You encouraged saving bits here, so I'm surprised to see that your patches use one bit per compression method: 2 bits to support no/pglz/lz4, 3 to add zstd, and the previous patch used 4 bits to also support zlib. There are spare bits available for that, but now there can be an inconsistency if two bits are set. Also, 2 bits could support 4 methods (including "no"). -- Justin
RE: Fix for segfault in logical replication on master
On Tuesday, June 22, 2021 11:08 AM Japin Li wrote: > On Mon, 21 Jun 2021 at 19:06, Amit Kapila wrote: > > On Mon, Jun 21, 2021 at 4:09 PM Japin Li wrote: > >> > >> On Mon, 21 Jun 2021 at 17:54, Amit Kapila > wrote: > >> > On Mon, Jun 21, 2021 at 2:06 PM Japin Li wrote: > >> >> > >> >> On Mon, 21 Jun 2021 at 16:22, Amit Kapila > wrote: > >> >> > On Mon, Jun 21, 2021 at 1:30 PM Japin Li > wrote: > >> >> >> > >> >> >> On Sat, 19 Jun 2021 at 17:18, Amit Kapila > wrote: > >> >> >> > On Fri, Jun 18, 2021 at 9:18 AM Amit Kapila > wrote: > >> >> >> > >> >> >> Or we can free the memory owned by indexoidlist after check > >> >> >> whether it is NIL, because we do not use it in the later. > >> >> >> > >> >> > > >> >> > Valid point. But I am thinking do we really need to fetch and > >> >> > check indexoidlist here? > >> >> > >> >> IMO, we shold not fetch and check the indexoidlist here, since we > >> >> do not use it. However, we should use RelationGetIndexList() to > >> >> update the > >> >> reladion->rd_replidindex, so we should fetch the indexoidlist, > >> >> reladion->maybe we > >> >> can use the following code: > >> >> > >> >> indexoidlist = RelationGetIndexList(relation); > >> >> list_free(indexoidlist); > >> >> > >> >> Or does there any function that only update the > >> >> relation->rd_replidindex or related fields, but do not fetch the > indexoidlist? > >> >> > >> > > >> > How about RelationGetReplicaIndex? It fetches the indexlist only > >> > when required and frees it immediately. But otherwise, currently, > >> > there shouldn't be any memory leak because we allocate this in > >> > "logical replication output context" which is reset after > >> > processing each change message, see pgoutput_change. > >> > >> Thanks for your explanation. It might not be a memory leak, however > >> it's a little confuse when we free the memory of the indexoidlist in > >> one place, but not free it in another place. > >> > >> I attached a patch to fix this. Any thoughts? > >> > > > > Your patch appears to be on the lines we discussed but I would prefer > > to get it done after Beta2 as this is just a minor code improvement. > > Can you please send the change as a patch file instead of copy-pasting > > the diff at the end of the email? > > Thanks for your review! Attached v1 patch. Your patch can be applied to the HEAD. And, I also reviewed your patch, which seems OK. Make check-world has passed with your patch in my env as well. Best Regards, Takamichi Osumi
Re: Different compression methods for FPI
On Mon, Jun 21, 2021 at 10:13:58PM -0500, Justin Pryzby wrote: > @Michael: I assume that if you merge this patch, you'd set your animals to use > wal_compression=lz4, and then they would fail the recovery tests. Yes, I'd like to do that on my animal dangomushi. > So the patches that you say are unrelated still seem to me to be a > prerequisite . I may be missing something, of course, but I still don't see why that's necessary? We don't get any test failures on HEAD by switching wal_compression to on, no? That's easy enough to test with a two-line change that changes the default. > +/* compression methods supported */ > +#define BKPIMAGE_COMPRESS_PGLZ 0x04 > +#define BKPIMAGE_COMPRESS_ZLIB 0x08 > +#define BKPIMAGE_COMPRESS_LZ4 0x10 > +#define BKPIMAGE_COMPRESS_ZSTD 0x20 > +#defineBKPIMAGE_IS_COMPRESSED(info) \ > + ((info & (BKPIMAGE_COMPRESS_PGLZ | BKPIMAGE_COMPRESS_ZLIB | \ > + BKPIMAGE_COMPRESS_LZ4 | BKPIMAGE_COMPRESS_ZSTD)) != > 0) > > You encouraged saving bits here, so I'm surprised to see that your patches > use one bit per compression method: 2 bits to support no/pglz/lz4, 3 to add > zstd, and the previous patch used 4 bits to also support zlib. Yeah, I know. I have just finished with that to get something readable for the sake of the tests. As you say, the point is moot just we add one new method, anyway, as we need just one new bit. And that's what I would like to do for v15 with LZ4 as the resulting patch is simple. It would be an idea to discuss more compression methods here once we hear more from users when this is released in the field, re-considering at this point if more is necessary or not. -- Michael signature.asc Description: PGP signature
Re: SSL/TLS instead of SSL in docs
On Mon, Jun 21, 2021 at 01:23:56PM +0200, Daniel Gustafsson wrote: > On 18 Jun 2021, at 07:37, Michael Paquier wrote: >> It looks inconsistent to me to point to the libpq documentation to get >> the details about SNI. Wouldn't is be better to have an item in the >> glossary that refers to the bits of RFC 6066, and remove the reference >> of the RPC from the libpq page? > > I opted for a version with SNI in the glossary but without a link to the RFC > since no definitions in the glossary has ulinks. Okay, but why making all this complicated if it can be simple? If I read the top of the page, the description of the glossary refers more to terms that apply to PostgreSQL and RDBMs in general. I think that something like the attached is actually more adapted, where there are acronyms for SNI and MITM, and where the references we have in the libpq docs are moved to the page for acronyms. That's consistent with what we do now for the existing acronyms SSL and TLS, and there does not seem to need any references to all those terms in the glossary. >> - to present a valid (trusted) SSL certificate, while >> + to present a valid (trusted) >> SSL/TLS certificate, while >> This style with two acronyms for what we want to be one thing is >> heavy. Could it be better to just have one single acronym called >> SSL/TLS that references both parts? > > Maybe, I don't know. I certainly don't prefer the way which is in the patch > but I also think it's the most "correct" way so I opted for that to start the > discussion. If we're fine with one acronym tag for the combination then I'm > happy to change to that. That feels a bit more natural to me to have SSL/TLS in the contexts where they apply as a single keyword. Do we have actually sections in the docs where only one of them apply, like some protocol references? > A slightly more invasive idea would be to not have acronyms at all and instead > move the ones that do benefit from clarification to the glossary? ISTM that > having a brief description of terms and not just the expansion is beneficial > to > the users. That would however be for another thread, but perhaps thats worth > discussing? Not sure about this bit. That's a more general topic for sure, but I also like the separation we have not between the glossary and the acronyms. Having an entry in one does not make necessary the same entry in the other, and vice-versa. >> Patch 0003, for the markups with OpenSSL, included one >> SSL/TLS entry. > > Ugh, sorry, that must've been a git add -p fat-finger. The extra SSL/TLS entry was on one of the files changed f80979f, so git add has been working just fine :) -- Michael diff --git a/doc/src/sgml/acronyms.sgml b/doc/src/sgml/acronyms.sgml index 13bd819eb1..658000b29c 100644 --- a/doc/src/sgml/acronyms.sgml +++ b/doc/src/sgml/acronyms.sgml @@ -410,6 +410,17 @@ + +MITM + + + https://en.wikipedia.org/wiki/Man-in-the-middle_attack";> + Man-in-the-middle + + + + MSVC @@ -590,6 +601,18 @@ + +SNI + + + https://en.wikipedia.org/wiki/Server_Name_Indication";> + Server Name Indication, + https://tools.ietf.org/html/rfc6066#section-3";>RFC 6066 + + + + SPI diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index eeba2caa43..f82ae4e461 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1378,7 +1378,7 @@ include_dir 'conf.d' Disables anonymous cipher suites that do no authentication. Such -cipher suites are vulnerable to man-in-the-middle attacks and +cipher suites are vulnerable to MITM attacks and therefore should not be used. diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 441cc0da3a..641970f2a6 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -1783,18 +1783,17 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname By default, libpq sets the TLS extension Server Name -Indication (SNI) on SSL-enabled connections. See https://tools.ietf.org/html/rfc6066#section-3";>RFC 6066 -for details. By setting this parameter to 0, this is turned off. +Indication (SNI) on SSL-enabled connections. +By setting this parameter to 0, this is turned off. The Server Name Indication can be used by SSL-aware proxies to route connections without having to decrypt the SSL stream. (Note that this requires a proxy that is aware of the PostgreSQL protocol handshake, -not just any SSL proxy.) However, SNI makes the destination host name -appear in cleartext in the network traffic, so it might be undesirable -in some cases. +not just any SSL proxy.) However, SNI makes the +destination host name appear in cleartext in the net
Re: Added schema level support for publication.
On Tue, Jun 22, 2021 at 9:45 AM vignesh C wrote: > I have removed the skip table patches to keep the focus on the main > patch, once this patch gets into committable shape, I will focus on > the skip table patch. IMO it's a good idea to start a new thread for the "skip table" feature so that we can discuss it separately. If required you can specify in that thread that the idea of the "skip table" can be applied to the "add schema level support" feature. With Regards, Bharath Rupireddy.
Re: fdatasync performance problem with large number of DB files
On Fri, Jun 18, 2021 at 06:18:59PM +0900, Fujii Masao wrote: > On 2021/06/04 23:39, Justin Pryzby wrote: >> You said switching to SIGHUP "would have zero effect"; but, actually it >> allows >> an admin who's DB took a long time in recovery/startup to change the >> parameter >> without shutting down the service. This mitigates the downtime if it crashes >> again. I think that's at least 50% of how this feature might end up being >> used. > > Yes, it would have an effect when the server is automatically restarted > after crash when restart_after_crash is enabled. At least for me +1 to > your proposed change. Good point about restart_after_crash, I did not consider that. Switching recovery_init_sync_method to SIGHUP could be helpful with that. -- Michael signature.asc Description: PGP signature
Re: fdatasync performance problem with large number of DB files
On Fri, Jun 18, 2021 at 1:11 PM Justin Pryzby wrote: > Thomas, could you comment on this ? Sorry, I missed that. It is initially a confusing proposal, but after trying it out (that is: making recovery_init_sync_method PGC_SIGHUP and testing a scenario where you want to make the next crash use it that way and without the change), I agree. +1 from me.
subscription/t/010_truncate.pl failure on desmoxytes in REL_13_STABLE
Hi, While scanning for assertion failures on the build farm that don't appear to have been discussed, I found this[1] in 010_truncate_publisher.log on the 13 branch: TRAP: FailedAssertion("tupdesc->tdrefcount <= 0", File: "/home/bf/build/buildfarm-desmoxytes/REL_13_STABLE/pgsql.build/../pgsql/src/backend/access/common/tupdesc.c", Line: 321) 2021-06-17 02:17:04.392 CEST [60ca947c.f7a43:4] LOG: server process (PID 1014658) was terminated by signal 11: Segmentation fault 2021-06-17 02:17:04.392 CEST [60ca947c.f7a43:5] DETAIL: Failed process was running: SELECT pg_catalog.set_config('search_path', '', false); The last thing the segfaulting process said was: 2021-06-17 02:17:03.847 CEST [60ca947f.f7b82:8] LOG: logical decoding found consistent point at 0/157D538 2021-06-17 02:17:03.847 CEST [60ca947f.f7b82:9] DETAIL: There are no running transactions. Unfortunately 13 doesn't log PIDs for assertion failures (hmm, commit 18c170a08ee could be back-patched?) so it's not clear which process that was, and there's no backtrace. I don't know if "pg_catalog.set_config('search_path', '', false)" is a clue that this is related to another recent crash[2] I mentioned, also from the 13 branch. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-16%2023:58:47 [2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BqdF6QE6rcj_Zj5h2qVARM--%2B92sqdmr-0DUSM_0Qu_g%40mail.gmail.com