Re: Conflict Detection and Resolution
On Mon, Jun 17, 2024 at 5:38 AM Tomas Vondra wrote: > > > The issue with using commit timestamps is that, when multiple nodes > > are involved, the commit timestamp won't accurately represent the > > actual order of operations. There's no reliable way to determine the > > perfect order of each operation happening on different nodes roughly > > simultaneously unless we use some globally synchronized counter. > > Generally, that order might not cause real issues unless one operation > > is triggered by a previous operation, and relying solely on physical > > timestamps would not detect that correctly. > > > This whole conflict detection / resolution proposal is based on using > commit timestamps. Aren't you suggesting it can't really work with > commit timestamps? > > FWIW there are ways to builds distributed consistency with timestamps, > as long as it's monotonic - e.g. clock-SI does that. It's not perfect, > but it shows it's possible. Hmm, I see that clock-SI does this by delaying the transaction when it detects the clock skew. > However, I'm not we have to go there - it depends on what the goal is. > For a one-directional replication (multiple nodes replicating to the > same target) it might be sufficient if the conflict resolution is > "deterministic" (e.g. not dependent on the order in which the changes > are applied). I'm not sure, but it's why I asked what's the goal in my > very first message in this thread. I'm not completely certain about this. Even in one directional replication if multiple nodes are sending data how can we guarantee determinism in the presence of clock skew if we are not using some other mechanism like logical counters or something like what clock-SI is doing? I don't want to insist on using any specific solution here. However, I noticed that we haven't addressed how we plan to manage clock skew, which is my primary concern. I believe that if multiple nodes are involved and we're receiving data from them with unsynchronized clocks, ensuring determinism about their order will require us to take some measures to handle that. > > We need some sort of logical counter, such as a vector clock, which > > might be an independent counter on each node but can perfectly track > > the causal order. For example, if NodeA observes an operation from > > NodeB with a counter value of X, NodeA will adjust its counter to X+1. > > This ensures that if NodeA has seen an operation from NodeB, its next > > operation will appear to have occurred after NodeB's operation. > > > > I admit that I haven't fully thought through how we could design such > > version tracking in our logical replication protocol or how it would > > fit into our system. However, my point is that we need to consider > > something beyond commit timestamps to achieve reliable ordering. > > > > I can't really respond to this as there's no suggestion how it would be > implemented in the patch discussed in this thread. > No worries, I'll consider whether finding such a solution is feasible for our situation. Thank you! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: RFC: adding pytest as a supported test framework
Hi Greg, Jelte, On Sat, 15 Jun 2024 at 23:53, Greg Sabino Mullane wrote: > > On Sat, Jun 15, 2024 at 12:48 PM Jelte Fennema-Nio wrote: >> >> Afaict, there's a significant part of our current community who feel the same way (and I'm pretty sure every sub-30 year old person who >> newly joins the community would feel the exact same way too). I feel I'm still relatively new (started in the past 4 years) and I have quite some time left before I hit 30 years of age. I don't specifically feel the way you describe, nor do I think I've ever really felt like that in the previous nearly 4 years of hacking. Then again, I'm not interested in testing frameworks, so I don't feel much at all about which test frameworks we use. > Those young-uns are also the same group who hold their nose when coding in C, and are always clamoring for rewriting Postgres in Rust. Could you point me to one occasion I have 'always' clamored for this, or any of "those young-uns" in the community? I may not be a huge fan of C, but rewriting PostgreSQL in [other language] is not on the list of things I'm clamoring for. I may have given off-hand mentions that [other language] would've helped in certain cases, sure, but I'd hardly call that clamoring. Kind regards, Matthias van de Meent
Re: Pgoutput not capturing the generated columns
Hi, here are my review comments for patch v7-0001. == 1. GENERAL - \dRs+ Shouldn't the new SUBSCRIPTION parameter be exposed via "describe" (e.g. \dRs+ mysub) the same as the other boolean parameters? == Commit message 2. When 'include_generated_columns' is false then the PUBLICATION col-list will ignore any generated cols even when they are present in a PUBLICATION col-list ~ Maybe you don't need to mention "PUBLICATION col-list" twice. SUGGESTION When 'include_generated_columns' is false, generated columns are not replicated, even when present in a PUBLICATION col-list. ~~~ 2. CREATE SUBSCRIPTION test1 connection 'dbname=postgres host=localhost port= 'publication pub1; ~ 2a. (I've questioned this one in previous reviews) What exactly is the purpose of this statement in the commit message? Was this supposed to demonstrate the usage of the 'include_generated_columns' parameter? ~ 2b. /publication/ PUBLICATION/ ~~~ 3. If the subscriber-side column is also a generated column then thisoption has no effect; the replicated data will be ignored and the subscriber column will be filled as normal with the subscriber-side computed or default data. ~ Missing space: /thisoption/this option/ == .../expected/decoding_into_rel.out 4. +-- When 'include-generated-columns' is not set +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); +data +- + BEGIN + table public.gencoltable: INSERT: a[integer]:1 b[integer]:2 + table public.gencoltable: INSERT: a[integer]:2 b[integer]:4 + table public.gencoltable: INSERT: a[integer]:3 b[integer]:6 + COMMIT +(5 rows) Why is the default value here equivalent to 'include-generated-columns' = '1' here instead of '0'? The default for the CREATE SUBSCRIPTION parameter 'include_generated_columns' is false, and IMO it seems confusing for these 2 defaults to be different. Here I think it should default to '0' *regardless* of what the previous functionality might have done -- e.g. this is a "test decoder" so the parameter should behave sensibly. == .../test_decoding/sql/decoding_into_rel.sql NITPICK - wrong comments. == doc/src/sgml/protocol.sgml 5. + + include_generated_columns + + +Boolean option to enable generated columns. This option controls +whether generated columns should be included in the string +representation of tuples during logical decoding in PostgreSQL. +The default is false. + + + + Does the protocol version need to be bumped to support this new option and should that be mentioned on this page similar to how all other version values are mentioned? == doc/src/sgml/ref/create_subscription.sgml NITPICK - some missing words/sentence. NITPICK - some missing tags. NITPICK - remove duplicated sentence. NITPICK - add another . == src/backend/commands/subscriptioncmds.c 6. #define SUBOPT_ORIGIN 0x8000 +#define SUBOPT_include_generated_columns 0x0001 Please use UPPERCASE for consistency with other macros. == .../libpqwalreceiver/libpqwalreceiver.c 7. + if (options->proto.logical.include_generated_columns && + PQserverVersion(conn->streamConn) >= 17) + appendStringInfoString(&cmd, ", include_generated_columns 'on'"); + IMO it makes more sense to say 'true' here instead of 'on'. It seems like this was just cut/paste from the above code (where 'on' was sensible). == src/include/catalog/pg_subscription.h 8. @@ -98,6 +98,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW * slots) in the upstream database are enabled * to be synchronized to the standbys. */ + bool subincludegencol; /* True if generated columns must be published */ + Not fixed as claimed. This field name ought to be plural. /subincludegencol/subincludegencols/ ~~~ 9. char*origin; /* Only publish data originating from the * specified origin */ + bool includegencol; /* publish generated column data */ } Subscription; Not fixed as claimed. This field name ought to be plural. /includegencol/includegencols/ == src/test/subscription/t/031_column_list.pl 10. +$node_publisher->safe_psql('postgres', + "CREATE TABLE tab2 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a * 2) STORED)" +); + +$node_publisher->safe_psql('postgres', + "CREATE TABLE tab3 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a + 10) STORED)" +); + $node_subscriber->safe_psql('postgres', "CREATE TABLE tab1 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a * 22) STORED, c int)" ); +$node_subscriber->safe_psql('postgres', + "CREATE TABLE tab2 (a int PRIMARY KEY, b int)" +); + +$node_subscriber->safe_psql('postgres', + "CREATE TABLE tab3 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a + 20) STORED)" +); IMO the test needs lots more comments to describe what it is doing: For example, the setu
Re: Using LibPq in TAP tests via FFI
On 2024-Jun-16, Andrew Dunstan wrote: > +sub query_oneval > +{ > + my $self = shift; > + my $sql = shift; > + my $missing_ok = shift; # default is not ok > + my $conn = $self->{conn}; > + my $result = PQexec($conn, $sql); > + my $ok = $result && (PQresultStatus($result) == PGRES_TUPLES_OK); > + unless ($ok) > + { > + PQclear($result) if $result; > + return undef; > + } > + my $ntuples = PQntuples($result); > + return undef if ($missing_ok && !$ntuples); > + my $nfields = PQnfields($result); > + die "$ntuples tuples != 1 or $nfields fields != 1" > + if $ntuples != 1 || $nfields != 1; > + my $val = PQgetvalue($result, 0, 0); > + if ($val eq "") > + { > + $val = undef if PGgetisnull($result, 0, 0); > + } > + PQclear($result); > + return $val; > +} Hmm, here you use PGgetisnull, is that a typo for PQgetisnull? If it is, then I wonder why doesn't this fail in some obvious way? Is this part dead code maybe? > +# return tuples like psql's -A -t mode. > + > +sub query_tuples > +{ > + my $self = shift; > + my @results; > + foreach my $sql (@_) > + { > + my $res = $self->query($sql); > + # join will render undef as an empty string here > + no warnings qw(uninitialized); > + my @tuples = map { join('|', @$_); } @{$res->{rows}}; > + push(@results, join("\n",@tuples)); > + } > + return join("\n",@results); > +} You made this function join the tuples from multiple queries together, but the output format doesn't show anything for queries that return empty. I think this strategy doesn't cater for the case of comparing results from multiple queries very well, because it might lead to sets of queries that return empty result for different queries reported as identical when they aren't. Maybe add a separator line between the results from each query, when there's more than one? (Perhaps just "join('--\n', @results)" in that last line does the trick?) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "The Postgresql hackers have what I call a "NASA space shot" mentality. Quite refreshing in a world of "weekend drag racer" developers." (Scott Marlowe)
Re: SQL/JSON query functions context_item doc entry and type requirement
On Mon, Jun 17, 2024 at 2:43 PM Amit Langote wrote: > > Hi, > > On Tue, Jun 4, 2024 at 12:11 AM jian he wrote: > > > > hi > > based on gram.y and function transformJsonValueExpr. > > > > gram.y: > > | JSON_QUERY '(' > > json_value_expr ',' a_expr json_passing_clause_opt > > json_returning_clause_opt > > json_wrapper_behavior > > json_quotes_clause_opt > > json_behavior_clause_opt > > ')' > > > > | JSON_EXISTS '(' > > json_value_expr ',' a_expr json_passing_clause_opt > > json_on_error_clause_opt > > ')' > > > > | JSON_VALUE '(' > > json_value_expr ',' a_expr json_passing_clause_opt > > json_returning_clause_opt > > json_behavior_clause_opt > > ')' > > > > json_format_clause_opt contains: > > | FORMAT_LA JSON > > { > > $$ = (Node *) makeJsonFormat(JS_FORMAT_JSON, JS_ENC_DEFAULT, @1); > > } > > > > > > That means, all the context_item can specify "FORMAT JSON" options, > > in the meantime, do we need to update these functions > > synopsis/signature in the doc? > > > > If I understand correctly, you're suggesting that we add a line to the > above paragraph to mention which types are appropriate for > context_item. How about we add the following: > > context_item expression can be a value of > any type that can be cast to jsonb. This includes types > such as char, text, bpchar, > character varying, and bytea (with > ENCODING UTF8), as well as any domains over these types. your wording looks ok to me. I want to add two sentences. so it becomes: + The context_item expression can be a value of +any type that can be cast to jsonb. This includes types + such as char, text, bpchar, +character varying, and bytea (with +ENCODING UTF8), as well as any domains over these types. +The context_item expression can also be followed with +FORMAT JSON, ENCODING UTF8. +These two options currently don't have actual meaning. +ENCODING UTF8 can only be specified when context_item type is bytea. imho, "These two options currently don't have actual meaning." is accurate, but still does not explain why we allow "FORMAT JSON ENCODING UTF8". I think we may need an explanation for "FORMAT JSON ENCODING UTF8". because json_array, json_object, json_serialize, json all didn't mention the meaning of "[ FORMAT JSON [ ENCODING UTF8 ] ] ". I added "[ FORMAT JSON [ ENCODING UTF8 ] ] " to the function signature/synopsis of json_exists, json_query, json_value. From c5e2b51d351c3987e96290363288499f1a369e49 Mon Sep 17 00:00:00 2001 From: jian he Date: Mon, 17 Jun 2024 17:35:01 +0800 Subject: [PATCH v1 1/1] minor SQL/JSON Query Functions doc fix main fix doc context_item description. --- doc/src/sgml/func.sgml | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 17c44bc3..18db4a5b 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -18663,7 +18663,17 @@ $.* ? (@ like_regex "^\\d+$") described in can be used to query JSON documents. Each of these functions apply a path_expression (the query) to a - context_item (the document); see + context_item (the document). + The context_item expression can be a value of +any type that can be cast to jsonb. This includes types + such as char, text, bpchar, +character varying, and bytea (with +ENCODING UTF8), as well as any domains over these types. +The context_item expression can also be followed with +FORMAT JSON, ENCODING UTF8. +These two options currently don't have actual meaning. +ENCODING UTF8 can only be specified when context_item type is bytea. + See for more details on what path_expression can contain. @@ -18689,7 +18699,8 @@ $.* ? (@ like_regex "^\\d+$") json_exists json_exists ( -context_item, path_expression PASSING { value AS varname } , ... +context_item FORMAT JSON ENCODING UTF8 , +path_expression PASSING { value AS varname } , ... { TRUE | FALSE | UNKNOWN | ERROR } ON ERROR ) @@ -18729,7 +18740,8 @@ ERROR: jsonpath array subscript is out of bounds json_query json_query ( -context_item, path_expression PASSING { value AS varname } , ... +context_item FORMAT JSON ENCODING UTF8 , +path_expression PASSING { value AS varname } , ... RETURNING data_type FORMAT JSON ENCODING UTF8 { WITHOUT | WITH { CONDITIONAL | UNCONDITIONAL } } ARRAY WRAPPER { KEEP | OMIT } QUOTES ON SCALAR STRING @@ -18806,7 +18818,8 @@ DETAIL: Missing "]" after array dimensions. json_value json_value ( -context_item, path_expression +context_item FORMAT JSON ENCODING UTF8 , +path_expression PASSING { value AS varname } , ... RETURNING data_type { ERROR | NULL | DEFAULT expression } ON EMPTY -- 2.34.1
Re: Conflict Detection and Resolution
On Wed, Jun 12, 2024 at 10:03 AM Dilip Kumar wrote: > > On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra > wrote: > > > > Yes, that's correct. However, many cases could benefit from the > > > update_deleted conflict type if it can be implemented reliably. That's > > > why we wanted to give it a try. But if we can't achieve predictable > > > results with it, I'm fine to drop this approach and conflict_type. We > > > can consider a better design in the future that doesn't depend on > > > non-vacuumed entries and provides a more robust method for identifying > > > deleted rows. > > > > > > > I agree having a separate update_deleted conflict would be beneficial, > > I'm not arguing against that - my point is actually that I think this > > conflict type is required, and that it needs to be detected reliably. > > > > When working with a distributed system, we must accept some form of > eventual consistency model. However, it's essential to design a > predictable and acceptable behavior. For example, if a change is a > result of a previous operation (such as an update on node B triggered > after observing an operation on node A), we can say that the operation > on node A happened before the operation on node B. Conversely, if > operations on nodes A and B are independent, we consider them > concurrent. > > In distributed systems, clock skew is a known issue. To establish a > consistency model, we need to ensure it guarantees the > "happens-before" relationship. Consider a scenario with three nodes: > NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and > subsequently NodeB makes changes, and then both NodeA's and NodeB's > changes are sent to NodeC, the clock skew might make NodeB's changes > appear to have occurred before NodeA's changes. However, we should > maintain data that indicates NodeB's changes were triggered after > NodeA's changes arrived at NodeB. This implies that logically, NodeB's > changes happened after NodeA's changes, despite what the timestamps > suggest. > > A common method to handle such cases is using vector clocks for > conflict resolution. > I think the unbounded size of the vector could be a problem to store for each event. However, while researching previous discussions, it came to our notice that we have discussed this topic in the past as well in the context of standbys. For recovery_min_apply_delay, we decided the clock skew is not a problem as the settings of this parameter are much larger than typical time deviations between servers as mentioned in docs. Similarly for casual reads [1], there was a proposal to introduce max_clock_skew parameter and suggesting the user to make sure to have NTP set up correctly. We have tried to check other databases (like Ora and BDR) where CDR is implemented but didn't find anything specific to clock skew. So, I propose to go with a GUC like max_clock_skew such that if the difference of time between the incoming transaction's commit time and the local time is more than max_clock_skew then we raise an ERROR. It is not clear to me that putting bigger effort into clock skew is worth especially when other systems providing CDR feature (like Ora or BDR) for decades have not done anything like vector clocks. It is possible that this is less of a problem w.r.t CDR and just detecting the anomaly in clock skew is good enough. [1] - https://www.postgresql.org/message-id/flat/CAEepm%3D1iiEzCVLD%3DRoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg%40mail.gmail.com -- With Regards, Amit Kapila.
Re: ecdh support causes unnecessary roundtrips
> On 17 Jun 2024, at 01:46, Andres Freund wrote: > When connecting with a libpq based client, the TLS establishment ends up like > this in many configurations; > > C->S: TLSv1 393 Client Hello > S->C: TLSv1.3 167 Hello Retry Request, Change Cipher Spec > C->S: TLSv1.3 432 Change Cipher Spec, Client Hello > S->C: TLSv1.3 1407 Server Hello, Application Data, Application Data, > Application Data, Application Data I wonder if this is the result of us still using TLS 1.2 as the default minimum protocol version. In 1.2 the ClientHello doesn't contain the extensions for cipher and curves, so the server must reply with a HRR (Hello Retry Request) message asking the client to set protocol, curve and cipher. The client will respond with a new Client Hello using 1.3 with the extensions. Using 1.3 as the default minimum has been on my personal TODO for a while, maybe that should be revisited for 18. > I.e. there are two clients hellos, because the server rejects the clients > "parameters". Or the client didn't send any. > This appears to be caused by ECDH support. The difference between the two > ClientHellos is > -Extension: key_share (len=38) x25519 > +Extension: key_share (len=71) secp256r1 > > I.e. the clients wanted to use x25519, but the server insists on secp256r1. Somewhat related, Erica Zhang has an open patch to make the server-side curves configuration take a list rather than a single curve [0], and modernizing the API used as a side effect (SSL_CTX_set_tmp_ecdh is documented as obsoleted by OpenSSL, but not deprecated with an API level). > I don't know if it's good that we're calling SSL_CTX_set_tmp_ecdh at all, To set the specified curve in ssl_ecdh_curve we have to don't we? > but if it is, shouldn't we at least do the same in libpq, so we don't > introduce > unnecessary roundtrips? If we don't set the curve in the client I believe OpenSSL will pass the set of supported curves the client has, which then should allow the server to pick the one it wants based on ssl_ecdh_curve, so ideally we shouldn't have to I think. > I did confirm that doing the same thing on the client side removes the > additional roundtrip. The roundtrip went away because the client was set to use secp256r1? I wonder if that made OpenSSL override the min protocol version and switch to a TLS1.3 ClientHello since it otherwise couldn't announce the curve. If you force the client min protocol to 1.3 in an unpatched client, do you see the same speedup? -- Daniel Gustafsson [0] tencent_1b368b07f4b5886f9822981189da736cf...@qq.com
Re: Vacuum statistics
I have written the documentary and attached the patch. On 08.06.2024 09:30, Alena Rybakina wrote: Iam currentlyworkingondividingthispatchintothreepartstosimplifythe reviewprocess:oneofthemwillcontaincodeforcollectingvacuumstatisticsontables,the secondonindexesandthe lastondatabases.I alsowritethe documentation. -- Regards, Alena Rybakina Postgres Professional:http://www.postgrespro.com The Russian Postgres Company From ba01e9b5b14a0b46482beea20127062e8ae7e067 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Mon, 17 Jun 2024 00:48:45 +0300 Subject: [PATCH] Add documentation about the system views that are used in the machinery of vacuum statistics. --- doc/src/sgml/system-views.sgml | 747 + 1 file changed, 747 insertions(+) diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index 67a42a1c4d4..dfef8ed44b8 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -5738,4 +5738,751 @@ ALTER DATABASE template0 ALLOW_CONNECTIONS off; + + pg_stats_vacuum_database + + + pg_stats_vacuum_database + + + + The view pg_stats_vacuum_database will contain + one row for each database in the current cluster, showing statistics about + vacuuming that database. + + + + pg_stats_vacuum_database Columns + + + + + Column Type + + + Description + + + + + + + + dbid oid + + + OID of a database + + + + + + total_blks_read int8 + + +Number of database blocks read by vacuum operations +performed on this database + + + + + + total_blks_hit int8 + + +Number of times database blocks were found in the +buffer cache by vacuum operations +performed on this database + + + + + + total_blks_dirtied int8 + + +Number of database blocks dirtied by vacuum operations +performed on this database + + + + + + total_blks_written int8 + + +Number of database blocks written by vacuum operations +performed on this database + + + + + + wal_records int8 + + +Total number of WAL records generated by vacuum operations +performed on this database + + + + + + wal_fpi int8 + + +Total number of WAL full page images generated by vacuum operations +performed on this database + + + + + + wal_bytes numeric + + +Total amount of WAL bytes generated by vacuum operations +performed on this database + + + + + + blk_read_time float8 + + +Time spent reading database blocks by vacuum operations performed on +this database, in milliseconds (if is enabled, +otherwise zero) + + + + + + blk_write_time float8 + + +Time spent writing database blocks by vacuum operations performed on +this database, in milliseconds (if is enabled, +otherwise zero) + + + + + + delay_time float8 + + +Time spent sleeping in a vacuum delay point by vacuum operations performed on +this database, in milliseconds (see +for details) + + + + + + system_time float8 + + +System CPU time of vacuuming this database, in milliseconds + + + + + + user_time float8 + + +User CPU time of vacuuming this database, in milliseconds + + + + + + total_time float8 + + +Total time of vacuuming this database, in milliseconds + + + + + + interrupts int4 + + +Number of times vacuum operations performed on this database +were interrupted on any errors + + + + + + + + + pg_stats_vacuum_indexes + + + pg_stats_vacuum_indexes + + + + The view pg_stats_vacuum_indexes will contain + one row for each index in the current database (including TOAST + table indexes), showing statistics about vacuuming that specific index. + + + + pg_stats_vacuum_indexes Columns + + + + + Column Type + + + Description + + + + + + + + relid oid + + + OID of an index + + + + + + schema name + + +Name of the schema this index is in + + + + + + relname name + + + Name of this index + + + + + + total_blks_read int8 + + +Number of database blocks read by vacuum operations +
Re: POC, WIP: OR-clause support for indexes
Hi, thank you for your work with this subject! On 14.06.2024 15:00, Alexander Korotkov wrote: On Mon, Apr 8, 2024 at 1:34 AM Alexander Korotkov wrote: I've revised the patch. Did some beautification, improvements for documentation, commit messages etc. I've pushed the 0001 patch without 0002. I think 0001 is good by itself given that there is the or_to_any_transform_limit GUC option. The more similar OR clauses are here the more likely grouping them into SOAP will be a win. But I've changed the default value to 5. This will make it less invasive and affect only queries with obvious repeating patterns. That also reduced the changes in the regression tests expected outputs. Regarding 0002, it seems questionable since it could cause a planning slowdown for SAOP's with large arrays. Also, it might reduce the win of transformation made by 0001. So, I think we should skip it for now. The patch has been reverted from pg17. Let me propose a new version for pg18 based on the valuable feedback from Tom Lane [1][2]. * The transformation is moved to the stage of adding restrictinfos to the base relation (in particular add_base_clause_to_rel()). This leads to interesting consequences. While this allows IndexScans to use transformed clauses, BitmapScans and SeqScans seem unaffected. Therefore, I wasn't able to find a planning regression. * As soon as there is no planning regression anymore, I've removed or_to_any_transform_limit GUC, which was a source of critics. * Now, not only Consts allowed in the SAOP's list, but also Params. * The criticized hash based on expression jumbling has been removed. Now, the plain list is used instead. * OrClauseGroup now gets a legal node tag. That allows to mix it in the list with other nodes without hacks. I think this patch shouldn't be as good as before for optimizing performance of large OR lists, given that BitmapScans and SeqScans still deal with ORs. However, it allows IndexScans to handle more, doesn't seem to cause planning regression and therefore introduce no extra GUC. Overall, this seems like a good compromise. This patch could use some polishing, but I'd like to first hear some feedback on general design. Links 1.https://www.postgresql.org/message-id/3604469.1712628736%40sss.pgh.pa.us 2.https://www.postgresql.org/message-id/3649287.1712642139%40sss.pgh.pa.us Inoticedthat7librarieshave beenaddedtosrc/backend/optimizer/plan/initsplan.c,andas faras Iremember,TomLanehas alreadyexpresseddoubtsaboutthe approachthatrequiresaddinga largenumberof libraries[0], but I'm afraid I'm out of ideas about alternative approach. In addition,Icheckedthe fixinthe previouscasesthatyouwroteearlier[1]andnoticedthatSeqScancontinuesto generate,unfortunately,withoutconvertingexpressions: with patch: create table test as (select (random()*10)::int x, (random()*1000) y from generate_series(1,100) i); create index test_x_1_y on test (y) where x = 1; create index test_x_2_y on test (y) where x = 2; vacuum analyze test; SELECT 100 CREATE INDEX CREATE INDEX VACUUM alena@postgres=# explain select * from test where (x = 1 or x = 2) and y = 100; QUERY PLAN -- Gather (cost=1000.00..12690.10 rows=1 width=12) Workers Planned: 2 -> Parallel Seq Scan on test (cost=0.00..11690.00 rows=1 width=12) Filter: (((x = 1) OR (x = 2)) AND (y = '100'::double precision)) (4 rows) alena@postgres=# set enable_seqscan =off; SET alena@postgres=# explain select * from test where (x = 1 or x = 2) and y = 100; QUERY PLAN - Seq Scan on test (cost=100.00..1020440.00 rows=1 width=12) Filter: (((x = 1) OR (x = 2)) AND (y = '100'::double precision)) (2 rows) without patch: -- Bitmap Heap Scan on test (cost=8.60..12.62 rows=1 width=12) Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y = '100'::double precision) AND (x = 2))) -> BitmapOr (cost=8.60..8.60 rows=1 width=0) -> Bitmap Index Scan on test_x_1_y (cost=0.00..4.30 rows=1 width=0) Index Cond: (y = '100'::double precision) -> Bitmap Index Scan on test_x_2_y (cost=0.00..4.30 rows=1 width=0) Index Cond: (y = '100'::double precision) (7 rows) [0] https://www.postgresql.org/message-id/3604469.1712628736%40sss.pgh.pa.us [1] https://www.postgresql.org/message-id/CAPpHfduJtO0s9E%3DSHUTzrCD88BH0eik0UNog1_q3XBF2wLmH6g%40mail.gmail.com -- Regards, Alena Rybakina Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Re: Avoid orphaned objects dependencies, take 3
Hi, On Thu, Jun 13, 2024 at 04:52:09PM +, Bertrand Drouvot wrote: > Hi, > > On Thu, Jun 13, 2024 at 10:49:34AM -0400, Robert Haas wrote: > > On Fri, Jun 7, 2024 at 4:41 AM Bertrand Drouvot > > wrote: > > > Do you still find the code hard to maintain with v9? > > > > I don't think it substantially changes my concerns as compared with > > the earlier version. > > Thanks for the feedback, I'll give it more thoughts. Please find attached v10 that puts the object locking outside of the dependency code. It's done that way except for: recordDependencyOnExpr() recordDependencyOnSingleRelExpr() makeConfigurationDependencies() The reason is that I think that it would need part of the logic that his inside the above functions to be duplicated and I'm not sure that's worth it. For example, we would probably need to: - make additional calls to find_expr_references_walker() - make additional scan on the config map It's also not done outside of recordDependencyOnCurrentExtension() as: 1. I think it is clear enough that way (as it is clear that the lock is taken on a ExtensionRelationId object class). 2. why to include "commands/extension.h" in more places (locking would depend of "creating_extension" and "CurrentExtensionObject"), while 1.? Remarks: 1. depLockAndCheckObject() and friends in v9 have been renamed to LockNotPinnedObject() and friends (as the vast majority of their calls are now done outside of the dependency code). 2. regarding the concern around RelationRelationId (discussed in [1]), v10 adds a comment "XXX Do we need a lock for RelationRelationId?" at the places we may want to lock this object class. I did not think about it yet (but will do), I only added this comment at multiple places. I think that v10 is easier to follow (as compare to v9) as we can easily see for which object class we'll put a lock on. Thoughts? [1]: https://www.postgresql.org/message-id/Zmv3TPfJAyQXhIdu%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From dc75e3255803617d21c55d77dc218904bd729d81 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Fri, 29 Mar 2024 15:43:26 + Subject: [PATCH v10] Avoid orphaned objects dependencies It's currently possible to create orphaned objects dependencies, for example: Scenario 1: session 1: begin; drop schema schem; session 2: create a function in the schema schem session 1: commit; With the above, the function created in session 2 would be linked to a non existing schema. Scenario 2: session 1: begin; create a function in the schema schem session 2: drop schema schem; session 1: commit; With the above, the function created in session 1 would be linked to a non existing schema. To avoid those scenarios, a new lock (that conflicts with a lock taken by DROP) has been put in place before the dependencies are being recorded. With this in place, the drop schema in scenario 2 would be locked. Also, after the new lock attempt, the patch checks that the object still exists: with this in place session 2 in scenario 1 would be locked and would report an error once session 1 committs (that would not be the case should session 1 abort the transaction). If the object is dropped before the new lock attempt is triggered then the patch would also report an error (but with less details). The patch adds a few tests for some dependency cases (that would currently produce orphaned objects): - schema and function (as the above scenarios) - alter a dependency (function and schema) - function and arg type - function and return type - function and function - domain and domain - table and type - server and foreign data wrapper --- src/backend/catalog/aclchk.c | 1 + src/backend/catalog/dependency.c | 109 ++- src/backend/catalog/heap.c| 8 ++ src/backend/catalog/index.c | 16 +++ src/backend/catalog/objectaddress.c | 57 src/backend/catalog/pg_aggregate.c| 9 ++ src/backend/catalog/pg_attrdef.c | 1 + src/backend/catalog/pg_cast.c | 5 + src/backend/catalog/pg_collation.c| 1 + src/backend/catalog/pg_constraint.c | 11 ++ src/backend/catalog/pg_conversion.c | 2 + src/backend/catalog/pg_depend.c | 27 +++- src/backend/catalog/pg_operator.c | 19 +++ src/backend/catalog/pg_proc.c | 17 ++- src/backend/catalog/pg_publication.c | 5 + src/backend/catalog/pg_range.c| 6 + src/backend/catalog/pg_type.c | 33 + src/backend/catalog/toasting.c| 1 + src/backend/commands/alter.c | 4 + src/backend/commands/amcmds.c | 1 + src/backend/commands/cluster.c| 5 + src/backend/commands/event_trigger.c |
Re: speed up a logical replica setup
On 07.06.24 05:49, Euler Taveira wrote: Here it is a patch series to fix the issues reported in recent discussions. The patches 0001 and 0003 aim to fix the buildfarm issues. The patch 0002 removes synchronized failover slots on subscriber since it has no use. I also included an optional patch 0004 that improves the usability by checking both servers if it already failed in any subscriber check. I have committed 0001, 0002, and 0003. Let's keep an eye on the buildfarm to see if that stabilizes things. So far it looks good. For 0004, I suggest inverting the result values from check_publisher() and create_subscriber() so that it returns true if the check is ok.
Re: speed up a logical replica setup
On 07.06.24 11:17, Hayato Kuroda (Fujitsu) wrote: Other minor comments are included by the attached diff file. It contains changes to follow conventions and pgindent/pgperltidy. I have included some of your changes in the patches from Euler that I committed today. The 0004 patch might get rerolled by Euler, so he could include your relevant changes there. I'm not sure about moving the sync_replication_slots setting around. It seems to add a lot of work for no discernible gain? The pgperltidy changes are probably better done separately, because they otherwise make quite a mess of the patch. Maybe we'll just do that once more before we branch.
Re: POC, WIP: OR-clause support for indexes
On Mon, Jun 17, 2024 at 1:33 PM Alena Rybakina wrote: > I noticed that 7 libraries have been added to > src/backend/optimizer/plan/initsplan.c, and as far as I remember, Tom Lane > has already expressed doubts about the approach that requires adding a large > number of libraries [0], but I'm afraid I'm out of ideas about alternative > approach. Thank you for pointing. Right, the number of extra headers included was one of points for criticism on this patch. I'll look to move this functionality elsewhere, while the stage of transformation could probably be the same. > In addition, I checked the fix in the previous cases that you wrote earlier > [1] and noticed that SeqScan continues to generate, unfortunately, without > converting expressions: I've rechecked and see I made wrong conclusion about this. The plan regression is still here. But I'm still looking to workaround this without extra GUC. I think we need to additionally do something like [1], but take further steps to avoid planning overhead when not necessary. In particular, I think we should only consider splitting SAOP for bitmap OR in the following cases: 1. There are partial indexes with predicates over target column. 2. There are multiple indexes covering target column and different subsets of other columns presented in restrictions. 3. There are indexes covreing target column without support of SAOP (amsearcharray == false). Hopefully this should skip generation of useless bitmap paths in majority cases. Thoughts? Links. 1. https://www.postgresql.org/message-id/67bd918d-285e-44d2-a207-f52d9a4c35e6%40postgrespro.ru -- Regards, Alexander Korotkov Supabase
Re: POC, WIP: OR-clause support for indexes
On Mon, Jun 17, 2024 at 7:02 AM Andrei Lepikhov wrote: > On 6/14/24 19:00, Alexander Korotkov wrote: > > This patch could use some polishing, but I'd like to first hear some > > feedback on general design. > Thanks for your time and efforts. I have skimmed through the code—there > is a minor fix in the attachment. > First and foremost, I think this approach can survive. > But generally, I'm not happy with manipulations over a restrictinfo clause: > 1. While doing that, we should remember the fields of the RestrictInfo > clause. It may need to be changed, too, or it can require such a change > in the future if someone adds new logic. > 2. We should remember the link to the RestrictInfo: see how the caller > of the distribute_restrictinfo_to_rels routine manipulates its fields > right after the distribution. > 3. Remember caches and cached decisions inside the RestrictInfo > structure: replacing the clause should we change these fields too? > > These were the key reasons why we shifted the code to the earlier stages > in the previous incarnation. So, going this way we should recheck all > the fields of this structure and analyse how the transformation can > [potentially] affect their values. I see your points. Making this at the stage of restrictinfos seems harder, and there are open questions in the patch. I'd like to hear how Tom feels about this. Is this the right direction, or should we try another way? -- Regards, Alexander Korotkov Supabase
Re: RFC: adding pytest as a supported test framework
Hi Jacob, > For the v18 cycle, I would like to try to get pytest [1] in as a > supported test driver, in addition to the current offerings. > > (I'm tempted to end the email there.) Huge +1 from me and many thanks for working on this. Two cents from me. I spent a fair part of my career writing in Perl. Personally I like the language and often find it more convenient for the tasks I'm working on than Python. This being said, there were several projects I was involved in where we had to choose a scripting language. In all the cases there was a strong push-back against Perl and Python always seemed to be a common denominator for everyone. So I ended up with the rule of thumb to use Perl for projects I'm working on alone and Python otherwise. Although the objective reality in the entire industry is unknown to me I spoke to many people whose observations were similar. We could speculate about the reasons why people seem to prefer Python (good IDE support*, unique** libraries like Matplotlib / NumPy / SciPy, ...) but honestly I don't think they are extremely relevant in this discussion. I believe supporting Python in our test infrastructure will attract more contributors and thus would be a good step for the project in the long run. * including PyTest integration ** citation needed -- Best regards, Aleksander Alekseev
Is creating logical replication slots in template databases useful at all?
Hi, While looking at the commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=29d0a77fa6606f9c01ba17311fc452dabd3f793d, I noticed that get_old_cluster_logical_slot_infos gets called for even template1 and template0 databases. Which means, pg_upgrade executes queries against the template databases to get replication slot information. I then realized that postgres allows one to connect to template1 database (or any other user-defined template databases for that matter), and create logical replication slots in it. If created, all the subsequent database creations will end up adding inactive logical replication slots in the postgres server. This might not be a problem in production servers as I assume the connections to template databases are typically restricted. Despite the connection restrictions, if at all one gets to connect to template databases in any way, it's pretty much possible to load the postgres server with inactive replication slots. This leads me to think why one would need logical replication slots in template databases at all. Can postgres restrict logical replication slots creation in template databases? If restricted, it may deviate from the fundamental principle of template databases in the sense that everything in the template database must be copied over to the new database created using it. Is it okay to do this? Am I missing something here? Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Using LibPq in TAP tests via FFI
On 2024-06-17 Mo 5:07 AM, Alvaro Herrera wrote: On 2024-Jun-16, Andrew Dunstan wrote: +sub query_oneval +{ + my $self = shift; + my $sql = shift; + my $missing_ok = shift; # default is not ok + my $conn = $self->{conn}; + my $result = PQexec($conn, $sql); + my $ok = $result && (PQresultStatus($result) == PGRES_TUPLES_OK); + unless ($ok) + { + PQclear($result) if $result; + return undef; + } + my $ntuples = PQntuples($result); + return undef if ($missing_ok && !$ntuples); + my $nfields = PQnfields($result); + die "$ntuples tuples != 1 or $nfields fields != 1" + if $ntuples != 1 || $nfields != 1; + my $val = PQgetvalue($result, 0, 0); + if ($val eq "") + { + $val = undef if PGgetisnull($result, 0, 0); + } + PQclear($result); + return $val; +} Hmm, here you use PGgetisnull, is that a typo for PQgetisnull? If it is, then I wonder why doesn't this fail in some obvious way? Is this part dead code maybe? It's not dead, just not exercised ATM. I should maybe include a test scripts for the two new modules. As you rightly suggest, it's a typo. If it had been called it would have aborted the test. +# return tuples like psql's -A -t mode. + +sub query_tuples +{ + my $self = shift; + my @results; + foreach my $sql (@_) + { + my $res = $self->query($sql); + # join will render undef as an empty string here + no warnings qw(uninitialized); + my @tuples = map { join('|', @$_); } @{$res->{rows}}; + push(@results, join("\n",@tuples)); + } + return join("\n",@results); +} You made this function join the tuples from multiple queries together, but the output format doesn't show anything for queries that return empty. I think this strategy doesn't cater for the case of comparing results from multiple queries very well, because it might lead to sets of queries that return empty result for different queries reported as identical when they aren't. Maybe add a separator line between the results from each query, when there's more than one? (Perhaps just "join('--\n', @results)" in that last line does the trick?) psql doesn't do that, and this is designed to mimic psql's behaviour. We could change that of course. I suspect none of the uses expect empty resultsets, so it's probably somewhat moot. Thanks for looking. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Introduce XID age and inactive timeout based replication slot invalidation
Hi, On Sat, Apr 13, 2024 at 9:36 AM Bharath Rupireddy wrote: > > There was a point raised by Amit > https://www.postgresql.org/message-id/CAA4eK1K8wqLsMw6j0hE_SFoWAeo3Kw8UNnMfhsWaYDF1GWYQ%2Bg%40mail.gmail.com > on when to do the XID age based invalidation - whether in checkpointer > or when vacuum is being run or whenever ComputeXIDHorizons gets called > or in autovacuum process. For now, I've chosen the design to do these > new invalidation checks in two places - 1) whenever the slot is > acquired and the slot acquisition errors out if invalidated, 2) during > checkpoint. However, I'm open to suggestions on this. Here are my thoughts on when to do the XID age invalidation. In all the patches sent so far, the XID age invalidation happens in two places - one during the slot acquisition, and another during the checkpoint. As the suggestion is to do it during the vacuum (manual and auto), so that even if the checkpoint isn't happening in the database for whatever reasons, a vacuum command or autovacuum can invalidate the slots whose XID is aged. An idea is to check for XID age based invalidation for all the slots in ComputeXidHorizons() before it reads replication_slot_xmin and replication_slot_catalog_xmin, and obviously before the proc array lock is acquired. A potential problem with this approach is that the invalidation check can become too aggressive as XID horizons are computed from many places. Another idea is to check for XID age based invalidation for all the slots in higher levels than ComputeXidHorizons(), for example in vacuum() which is an entry point for both vacuum command and autovacuum. This approach seems similar to vacuum_failsafe_age GUC which checks each relation for the failsafe age before vacuum gets triggered on it. Does anyone see any issues or risks with the above two approaches or have any other ideas? Thoughts? I attached v40 patches here. I reworded some of the ERROR messages, and did some code clean-up. Note that I haven't implemented any of the above approaches yet. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 7a920d4f1a4d6a10776ff597d6d931b10340417c Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Mon, 17 Jun 2024 06:55:27 + Subject: [PATCH v40 1/2] Add inactive_timeout based replication slot invalidation Till now, postgres has the ability to invalidate inactive replication slots based on the amount of WAL (set via max_slot_wal_keep_size GUC) that will be needed for the slots in case they become active. However, choosing a default value for max_slot_wal_keep_size is tricky. Because the amount of WAL a customer generates, and their allocated storage will vary greatly in production, making it difficult to pin down a one-size-fits-all value. It is often easy for developers to set a timeout of say 1 or 2 or 3 days, after which the inactive slots get invalidated. To achieve the above, postgres introduces a GUC allowing users set inactive timeout. The replication slots that are inactive for longer than specified amount of time get invalidated. The invalidation check happens at various locations to help being as latest as possible, these locations include the following: - Whenever the slot is acquired and the slot acquisition errors out if invalidated. - During checkpoint Note that this new invalidation mechanism won't kick-in for the slots that are currently being synced from the primary to the standby, because such synced slots are typically considered not active (for them to be later considered as inactive) as they don't perform logical decoding to produce the changes. Author: Bharath Rupireddy Reviewed-by: Bertrand Drouvot, Amit Kapila, Shveta Malik Discussion: https://www.postgresql.org/message-id/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com Discussion: https://www.postgresql.org/message-id/CA%2BTgmoZTbaaEjSZUG1FL0mzxAdN3qmXksO3O9_PZhEuXTkVnRQ%40mail.gmail.com Discussion: https://www.postgresql.org/message-id/202403260841.5jcv7ihniccy%40alvherre.pgsql --- doc/src/sgml/config.sgml | 33 ++ doc/src/sgml/system-views.sgml| 7 + .../replication/logical/logicalfuncs.c| 2 +- src/backend/replication/logical/slotsync.c| 11 +- src/backend/replication/slot.c| 188 +++- src/backend/replication/slotfuncs.c | 2 +- src/backend/replication/walsender.c | 4 +- src/backend/utils/adt/pg_upgrade_support.c| 2 +- src/backend/utils/misc/guc_tables.c | 12 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/replication/slot.h| 6 +- src/test/recovery/meson.build | 1 + src/test/recovery/t/050_invalidate_slots.pl | 286 ++ 13 files changed, 535 insertions(+), 20 deletions(-) create mode 100644 src/test/recovery/t/050_invalidate_slots.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/
Re: ON ERROR in json_query and the like
Hi, On 06/17/24 02:20, Amit Langote wrote: >>>Apparently, the functions expect JSONB so that a cast is implied >>>when providing TEXT. However, the errors during that cast are >>>not subject to the ON ERROR clause. >>> >>>17beta1=# SELECT JSON_QUERY('invalid', '$' NULL ON ERROR); >>>ERROR: invalid input syntax for type json >>>DETAIL: Token "invalid" is invalid. >>>CONTEXT: JSON data, line 1: invalid >>> >>>Oracle DB and Db2 (LUW) both return NULL in that case. I wonder, could prosupport rewriting be used to detect that the first argument is supplied by a cast, and rewrite the expression to apply the cast 'softly'? Or would that behavior be too magical? Regards, -Chap
Re: SQL/JSON query functions context_item doc entry and type requirement
Hi, On 06/17/24 02:43, Amit Langote wrote: > context_item expression can be a value of > any type that can be cast to jsonb. This includes types > such as char, text, bpchar, > character varying, and bytea (with > ENCODING UTF8), as well as any domains over these types. Reading this message in conjunction with [0] makes me think that we are really talking about a function that takes a first parameter of type jsonb, and behaves exactly that way (so any cast required is applied by the system ahead of the call). Under those conditions, this seems like an unusual sentence to add in the docs, at least until we have also documented that tan's argument can be of any type that can be cast to double precision. On the other hand, if the behavior of the functions were to be changed (perhaps using prosupport rewriting as suggested in [1]?) so that it was not purely describable as a function accepting exactly jsonb with a possible system-applied cast in front, then in that case such an added explanation in the docs might be very fitting. Regards, -Chap [0] https://www.postgresql.org/message-id/CA%2BHiwqGuqLfAEP-FwW3QHByfQOoUpyj6YZG6R6bScpQswvNYDA%40mail.gmail.com [1] https://www.postgresql.org/message-id/66703054.6040109%40acm.org
Re: ON ERROR in json_query and the like
> On 17.06.2024, at 08:20, Amit Langote wrote: > > Hi, > > (apologies for not replying to this thread sooner) > > On Tue, May 28, 2024 at 6:57 PM Pavel Stehule wrote: >> út 28. 5. 2024 v 11:29 odesílatel Markus Winand >> napsal: >>> >>> Hi! >>> >>> I’ve noticed two “surprising” (to me) behaviors related to >>> the “ON ERROR” clause of the new JSON query functions in 17beta1. >>> >>> 1. JSON parsing errors are not subject to ON ERROR >>> Apparently, the functions expect JSONB so that a cast is implied >>> when providing TEXT. However, the errors during that cast are >>> not subject to the ON ERROR clause. >>> >>> 17beta1=# SELECT JSON_QUERY('invalid', '$' NULL ON ERROR); >>> ERROR: invalid input syntax for type json >>> DETAIL: Token "invalid" is invalid. >>> CONTEXT: JSON data, line 1: invalid >>> >>> Oracle DB and Db2 (LUW) both return NULL in that case. >>> >>> I had a look on the list archive to see if that is intentional but >>> frankly speaking these functions came a long way. In case it is >>> intentional it might be worth adding a note to the docs. >> >> >> I remember a talk about this subject years ago. Originally the JSON_QUERY >> was designed in similar like Oracle, and casting to jsonb was done inside. >> If I remember this behave depends on the fact, so old SQL/JSON has not json >> type and it was based just on processing of plain text. But Postgres has >> JSON, and JSONB and then was more logical to use these types. And because >> the JSON_QUERY uses these types, and the casting is done before the >> execution of the function, then the clause ON ERROR cannot be handled. >> Moreover, until soft errors Postgres didn't allow handling input errors in >> common functions. >> >> I think so this difference should be mentioned in documentation. > > Agree that the documentation needs to be clear about this. I'll update > my patch at [1] to add a note next to table 9.16.3. SQL/JSON Query > Functions. Considering another branch of this thread [1] I think the "Supported Features” appendix of the docs should mention that as well. The way I see it is that the standards defines two overloaded JSON_QUERY functions, of which PostgreSQL will support only one. In case of valid JSON, the implied CAST makes it look as though the second variant of these function was supported as well but that illusion totally falls apart once the JSON is not valid anymore. I think it affects the following feature IDs: - T821, Basic SQL/JSON query operators For JSON_VALUE, JSON_TABLE and JSON_EXISTS - T828, JSON_QUERY Also, how hard would it be to add the functions that accept character strings? Is there, besides the effort, any thing else against it? I’m asking because I believe once released it might never be changed — for backward compatibility. [1] https://www.postgresql.org/message-id/CAKFQuwb50BAaj83Np%2B1O6xe3_T6DO8w2mxtFbgSbbUng%2BabrqA%40mail.gmail.com > >>> 2. EMPTY [ARRAY|OBJECT] ON ERROR implies ERROR ON EMPTY >>> >>> 17beta1=# SELECT JSON_QUERY('[]', '$[*]' EMPTY ARRAY ON ERROR) a; >>>a >>> >>>[] >>> (1 row) >>> >>> As NULL ON EMPTY is implied, it should give the same result as >>> explicitly adding NULL ON EMPTY: >>> >>> 17beta1=# SELECT JSON_QUERY('[]', '$[*]' NULL ON EMPTY EMPTY ARRAY ON >>> ERROR) a; >>>a >>> --- >>> >>> (1 row) >>> >>> Interestingly, Oracle DB gives the same (wrong) results. Db2 (LUW) >>> on the other hand returns NULL for both queries. >>> >>> I don’t think that PostgreSQL should follow Oracle DB's suit here >>> but again, in case this is intentional it should be made explicit >>> in the docs. > > This behavior is a bug and result of an unintentional change that I > made at some point after getting involved with this patch set. So I'm > going to fix this so that the empty results of jsonpath evaluation use > NULL ON EMPTY by default, ie, when the ON EMPTY clause is not present. > Attached a patch to do so. > Tested: works. Thanks :) -markus
Re: DROP OWNED BY fails to clean out pg_init_privs grants
> On 15 Jun 2024, at 01:46, Tom Lane wrote: > > Robert Haas writes: >> I think the only thing we absolutely have to fix here is the dangling >> ACL references. > > Here's a draft patch that takes care of Hannu's example, and I think > it fixes other potential dangling-reference scenarios too. > > I'm not sure whether I like the direction this is going, but I do > think we may not be able to do a lot more than this for v17. Agreed. > The core change is to install SHARED_DEPENDENCY_INITACL entries in > pg_shdepend for all references to non-pinned roles in pg_init_privs, > whether they are the object's owner or not. Doing that ensures that > we can't drop a role that is still referenced there, and it provides > a basis for shdepDropOwned and shdepReassignOwned to take some kind > of action for such references. I wonder if this will break any tools/scripts in prod which relies on the previous (faulty) behaviour. It will be interesting to see if anything shows up on -bugs. Off the cuff it seems like a good idea judging by where we are and what we can fix with it. > I'm not terribly thrilled with this, because it's still possible > to get into a state where a pg_init_privs entry is based on > an owning role that's no longer the owner: you just have to use > ALTER OWNER directly rather than via REASSIGN OWNED. While > I don't think the backend has much problem with that, it probably > will confuse pg_dump to some extent. However, such cases are > going to confuse pg_dump anyway for reasons discussed upthread, > namely that we don't really support dump/restore of extensions > where not all the objects are owned by the extension owner. > I'm content to leave that in the pile of unfinished work for now. +1 > An alternative definition could be that ALTER OWNER also replaces > old role with new in pg_init_privs entries. That would reduce > the surface for confusing pg_dump a little bit, but I don't think > that I like it better, for two reasons: > > * ALTER OWNER would have to probe pg_init_acl for potential > entries every time, which would be wasted work for most ALTERs. Unless it would magically fix all the pg_dump problems I'd prefer to avoid this. > * As this stands, updateInitAclDependencies() no longer pays any > attention to its ownerId argument, and in one place I depend on > that to skip doing a rather expensive lookup of the current object > owner. Perhaps we should remove that argument altogether, and > in consequence simplify some other callers too? However, that > would only help much if we were willing to revert 534287403's > changes to pass the object's owner ID to recordExtensionInitPriv, > which I'm hesitant to do because I suspect we'll end up wanting > to record the owner ID explicitly in pg_init_privs entries. > On the third hand, maybe we'll never do that, so perhaps we should > revert those changes for now; some of them add nontrivial lookup > costs. I wonder if it's worth reverting passing the owner ID for v17 and revisiting that in 18 if we work on recording the ID. Shaving a few catalog lookups is generally worthwhile, doing them without needing the result for the next five years might bite us. Re-reading 534287403 I wonder about this hunk in RemoveRoleFromInitPriv: + if (!isNull) + old_acl = DatumGetAclPCopy(oldAclDatum); + else + old_acl = NULL; /* this case shouldn't happen, probably */ I wonder if we should Assert() on old_acl being NULL? I can't imagine a case where it should legitimately be that and catching such in development might be useful for catching stray bugs? > * In shdepReassignOwned, I refactored to move the switch on > sdepForm->classid into a new subroutine. We could have left > it where it is, but it would need a couple more tab stops of > indentation which felt like too much. It's in the eye of > the beholder though. I prefer the new way. +1 on going ahead with this patch. There is more work to do but I agree that this about all that makes sense in v17 at this point. -- Daniel Gustafsson
Re: RFC: adding pytest as a supported test framework
On Sun, 16 Jun 2024 at 23:04, Robert Haas wrote: > > On Sat, Jun 15, 2024 at 6:00 PM Melanie Plageman > wrote: > > Writing a new test framework in a popular language that makes it more > > likely that more people will write more tests and test infrastructure > > is such a completely different thing than suggesting we rewrite > > Postgres in Rust that I feel that this comparison is unfair and, > > frankly, a distraction from the discussion at hand. > > I don't really agree with this. > > it's not *as* problematic if some tests are > written in one language and others in another as it would be if > different parts of the backend used different languages, and it > wouldn't be *as* hard if at some point we decided we wanted to convert > all remaining code to the new language. Honestly, it sounds like you actually do agree with each other. It seems you interpreted Melanie her use of "thing" as "people wanting to use Rust/Python in the Postgres codebase" while I believe she probably meant "all the problems and effort involved in the task making that possible''. And afaict from your response, you definitely agree that making it possible to use Rust in our main codebase is a lot more difficult than for Python for our testing code. > But, would I feel respected and valued as a participant in > that project? Would I have to use weird tools and follow arcane and > frustrating processes? If I did, *that* would make me give up. I don't > want to say that the choice of programming language doesn't matter at > all, but it seems to me that it might matter more because it's a > symptom of being unwilling to modernize things rather than for its own > sake. I can personally definitely relate to this (although I wouldn't frame it as strongly as you did). Postgres development definitely requires weird tools and arcane processes (imho) when compared to most other open source projects. The elephant in the room is of course the mailing list development flow. But we have some good reasons for using that[^1]. But most people have some limit on the amount of weirdness they are willing to accept when wanting to contribute, and the mailing list pushes us quite close to that limit for a bunch of people already. Any additional weird tools/arcane processes might push some people over that limit. We've definitely made big improvements in modernizing our development workflow over the last few years though: We now have CI (cfbot), a modern build system (meson), and working autoformatting (requiring pgindent on commit). These improvements have been very noticeable to me, and I think we should continue such efforts. I think allowing people to write tests in Python is one of the easier improvements that we can make. [^1]: Although I think those reasons apply much less to the documentation, maybe we could allow github contributions for just those.
Re: Using LibPq in TAP tests via FFI
On 2024-Jun-17, Andrew Dunstan wrote: > On 2024-06-17 Mo 5:07 AM, Alvaro Herrera wrote: > > You made this function join the tuples from multiple queries together, > > but the output format doesn't show anything for queries that return > > empty. I think this strategy doesn't cater for the case of comparing > > results from multiple queries very well, because it might lead to sets > > of queries that return empty result for different queries reported as > > identical when they aren't. Maybe add a separator line between the > > results from each query, when there's more than one? (Perhaps just > > "join('--\n', @results)" in that last line does the trick?) > > psql doesn't do that, and this is designed to mimic psql's behaviour. We > could change that of course. I suspect none of the uses expect empty > resultsets, so it's probably somewhat moot. True -- I guess my comment should really be directed to the original coding of the test in test_index_replay. I think adding the separator line makes it more trustworthy. Probably you're right that the current code of in-core tests don't care about this, but if we export this technique to the world, I'm sure somebody somewhere is going to care. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "The Gord often wonders why people threaten never to come back after they've been told never to return" (www.actsofgord.com)
Re: ON ERROR in json_query and the like
po 17. 6. 2024 v 15:07 odesílatel Markus Winand napsal: > > > On 17.06.2024, at 08:20, Amit Langote wrote: > > > > Hi, > > > > (apologies for not replying to this thread sooner) > > > > On Tue, May 28, 2024 at 6:57 PM Pavel Stehule > wrote: > >> út 28. 5. 2024 v 11:29 odesílatel Markus Winand < > markus.win...@winand.at> napsal: > >>> > >>> Hi! > >>> > >>> I’ve noticed two “surprising” (to me) behaviors related to > >>> the “ON ERROR” clause of the new JSON query functions in 17beta1. > >>> > >>> 1. JSON parsing errors are not subject to ON ERROR > >>> Apparently, the functions expect JSONB so that a cast is implied > >>> when providing TEXT. However, the errors during that cast are > >>> not subject to the ON ERROR clause. > >>> > >>> 17beta1=# SELECT JSON_QUERY('invalid', '$' NULL ON ERROR); > >>> ERROR: invalid input syntax for type json > >>> DETAIL: Token "invalid" is invalid. > >>> CONTEXT: JSON data, line 1: invalid > >>> > >>> Oracle DB and Db2 (LUW) both return NULL in that case. > >>> > >>> I had a look on the list archive to see if that is intentional but > >>> frankly speaking these functions came a long way. In case it is > >>> intentional it might be worth adding a note to the docs. > >> > >> > >> I remember a talk about this subject years ago. Originally the > JSON_QUERY was designed in similar like Oracle, and casting to jsonb was > done inside. If I remember this behave depends on the fact, so old SQL/JSON > has not json type and it was based just on processing of plain text. But > Postgres has JSON, and JSONB and then was more logical to use these types. > And because the JSON_QUERY uses these types, and the casting is done before > the execution of the function, then the clause ON ERROR cannot be handled. > Moreover, until soft errors Postgres didn't allow handling input errors in > common functions. > >> > >> I think so this difference should be mentioned in documentation. > > > > Agree that the documentation needs to be clear about this. I'll update > > my patch at [1] to add a note next to table 9.16.3. SQL/JSON Query > > Functions. > > Considering another branch of this thread [1] I think the > "Supported Features” appendix of the docs should mention that as well. > > The way I see it is that the standards defines two overloaded > JSON_QUERY functions, of which PostgreSQL will support only one. > In case of valid JSON, the implied CAST makes it look as though > the second variant of these function was supported as well but that > illusion totally falls apart once the JSON is not valid anymore. > > I think it affects the following feature IDs: > > - T821, Basic SQL/JSON query operators > For JSON_VALUE, JSON_TABLE and JSON_EXISTS > - T828, JSON_QUERY > > Also, how hard would it be to add the functions that accept > character strings? Is there, besides the effort, any thing else > against it? I’m asking because I believe once released it might > never be changed — for backward compatibility. > It is easy to add the function that accepts text, but when you have overloaded functions, then varchar or text is the preferred type, and then the arguments will be casted to text by default instead of json. You can have one function with argument of type "any", but then the execution is a little bit slower (outer cast is faster than cast inside function), and again the Postgres cannot deduce used argument types from function's argument types. Probably this can be solved if we introduce a new kind of type, where the preferred type will be json, or jsonb. So the problem is in the definition of implementation details about the mechanism of type deduction (when you use string literal or when you use string expression). So now, when you will write json_table(x1 || x2), then and x1 and x2 are of unknown type, then Postgres can know, so x1 and x2 will be jsonb, but when there will be secondary function json_table(text), then Postgres detects problem, and use preferred type (that is text). Generally, Postgres supports function overloading and it is working well between text, int and numeric where constants have different syntax, but when the constant literal has the same syntax, there can be problems with hidden casts to unwanted type, so not overloaded function is ideal. These issues can be solved in the analysis stage by special code, but again it increases code complexity. > > [1] > https://www.postgresql.org/message-id/CAKFQuwb50BAaj83Np%2B1O6xe3_T6DO8w2mxtFbgSbbUng%2BabrqA%40mail.gmail.com > > > > > >>> 2. EMPTY [ARRAY|OBJECT] ON ERROR implies ERROR ON EMPTY > >>> > >>> 17beta1=# SELECT JSON_QUERY('[]', '$[*]' EMPTY ARRAY ON ERROR) a; > >>>a > >>> > >>>[] > >>> (1 row) > >>> > >>> As NULL ON EMPTY is implied, it should give the same result as > >>> explicitly adding NULL ON EMPTY: > >>> > >>> 17beta1=# SELECT JSON_QUERY('[]', '$[*]' NULL ON EMPTY EMPTY ARRAY > ON ERROR) a; > >>>a > >>> --- > >>> > >>> (1 row) > >>> >
Re: Using LibPq in TAP tests via FFI
On 2024-06-16 Su 6:38 PM, Andres Freund wrote: Hi, On 2024-06-16 17:43:05 -0400, Andrew Dunstan wrote: On Fri, Jun 14, 2024 at 12:25 PM Andres Freund wrote: I guess it's a question of how widely available FFI::Platypus is. I know it's available pretty much out of the box on Strawberry Perl and Msys2' ucrt perl. FWIW I hacked a bit on CI, trying to make it work. Took a bit, partially because CI uses an older strawberry perl without FFI::Platypus. And FFI::Platypus didn't build with that. Updating that to 5.38 causes some complaints about LANG that I haven't hunted down, just avoided by unsetting LANG. As-is your patch didn't work, because it has "systempath => []", which caused libpq to not load, because it depended on things in the system path... What's the reason for that? Not sure, that code was written months ago. I just checked the FFI::CheckLib code and libpath is searched before systempath, so there shouldn't be any reason not to use the default load path. After commenting that out, all but one tests passed: [20:21:31.137] - 8< - [20:21:31.137] stderr: [20:21:31.137] # Failed test 'psql connect success' [20:21:31.137] # at C:/cirrus/src/test/recovery/t/041_checkpoint_at_promote.pl line 161. [20:21:31.137] # got: '2' [20:21:31.137] # expected: '0' [20:21:31.137] # Failed test 'psql select 1' [20:21:31.137] # at C:/cirrus/src/test/recovery/t/041_checkpoint_at_promote.pl line 162. [20:21:31.137] # got: '' [20:21:31.137] # expected: '1' [20:21:31.137] # Looks like you failed 2 tests of 6. [20:21:31.137] [20:21:31.137] (test program exited with status code 2) [20:21:31.137] -- [20:21:31.137] Yeah, the recovery tests were using poll_query_until in a rather funky way. That's fixed in this latest version. I agree with you that falling back on BackgroundPsql is not a terribly satisfactory solution. I'm somewhat doubtful we'll just agree on making FFI::Platypus a hard dependency, but if we agree to do so... Maybe not. If so your other suggestion of a small XS wrapper might make sense. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com diff --git a/contrib/amcheck/t/001_verify_heapam.pl b/contrib/amcheck/t/001_verify_heapam.pl index 9de3148277..c259d2ccfd 100644 --- a/contrib/amcheck/t/001_verify_heapam.pl +++ b/contrib/amcheck/t/001_verify_heapam.pl @@ -5,6 +5,7 @@ use strict; use warnings FATAL => 'all'; use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Session; use PostgreSQL::Test::Utils; use Test::More; @@ -18,7 +19,9 @@ $node = PostgreSQL::Test::Cluster->new('test'); $node->init; $node->append_conf('postgresql.conf', 'autovacuum=off'); $node->start; -$node->safe_psql('postgres', q(CREATE EXTENSION amcheck)); +my $session = PostgreSQL::Test::Session->new(node => $node); + +$session->do(q(CREATE EXTENSION amcheck)); # # Check a table with data loaded but no corruption, freezing, etc. @@ -49,7 +52,7 @@ detects_heap_corruption( # Check a corrupt table with all-frozen data # fresh_test_table('test'); -$node->safe_psql('postgres', q(VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test)); +$session->do(q(VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test)); detects_no_corruption("verify_heapam('test')", "all-frozen not corrupted table"); corrupt_first_page('test'); @@ -81,7 +84,7 @@ sub relation_filepath my ($relname) = @_; my $pgdata = $node->data_dir; - my $rel = $node->safe_psql('postgres', + my $rel = $session->query_oneval( qq(SELECT pg_relation_filepath('$relname'))); die "path not found for relation $relname" unless defined $rel; return "$pgdata/$rel"; @@ -92,8 +95,8 @@ sub get_toast_for { my ($relname) = @_; - return $node->safe_psql( - 'postgres', qq( + return $session->query_oneval( + qq( SELECT 'pg_toast.' || t.relname FROM pg_catalog.pg_class c, pg_catalog.pg_class t WHERE c.relname = '$relname' @@ -105,8 +108,8 @@ sub fresh_test_table { my ($relname) = @_; - return $node->safe_psql( - 'postgres', qq( + return $session->do( + qq( DROP TABLE IF EXISTS $relname CASCADE; CREATE TABLE $relname (a integer, b text); ALTER TABLE $relname SET (autovacuum_enabled=false); @@ -130,8 +133,8 @@ sub fresh_test_sequence { my ($seqname) = @_; - return $node->safe_psql( - 'postgres', qq( + return $session->do( + qq( DROP SEQUENCE IF EXISTS $seqname CASCADE; CREATE SEQUENCE $seqname INCREMENT BY 13 @@ -147,8 +150,8 @@ sub advance_test_sequence { my ($seqname) = @_; - return $node->safe_psql( - 'postgres', qq( + return $session->query_oneval( + qq( SELECT nextval('$seqname'); )); } @@ -158,7 +161,7 @@ sub set_test_sequence { my ($seqname) = @_; - return $node->safe_psql( + return $session->query_oneval( 'postgres', qq( SELECT setval('$seqname', 102); )); @@ -169,8 +172,8 @@ su
Re: Direct SSL connection and ALPN loose ends
On Mon, Apr 29, 2024 at 8:24 AM Heikki Linnakangas wrote: > I was mostly worried about the refactoring of the > retry logic in libpq (and about the pre-existing logic too to be honest, > it was complicated before these changes already). Some changes in the v17 negotiation fallback order caught my eye: 1. For sslmode=prefer, a modern v3 error during negotiation now results in a fallback to plaintext. For v16 this resulted in an immediate failure. (v2 errors retain the v16 behavior.) 2. For gssencmode=prefer, a legacy v2 error during negotiation now results in an immediate failure. In v16 it allowed fallback to SSL or plaintext depending on sslmode. Are both these changes intentional/desirable? Change #1 seems to partially undo the decision made in a49fbaaf: > Don't assume that "E" response to NEGOTIATE_SSL_CODE means pre-7.0 server. > > These days, such a response is far more likely to signify a server-side > problem, such as fork failure. [...] > > Hence, it seems best to just eliminate the assumption that backing off > to non-SSL/2.0 protocol is the way to recover from an "E" response, and > instead treat the server error the same as we would in non-SSL cases. Thanks, --Jacob
Inconsistency between try_mergejoin_path and create_mergejoin_plan
Hi. There's the following inconsistency between try_mergejoin_path() and create_mergejoin_plan(). When clause operator has no commutator, we can end up with mergejoin path. Later create_mergejoin_plan() will call get_switched_clauses(). This function can error out with ERROR: could not find commutator for operator XXX The similar behavior seems to be present also for hash join. Attaching a test case (in patch) and a possible fix. -- Best regards, Alexander Pyhalov, Postgres ProfessionalFrom ea9497c9f62b3613482d5b267c7907070ba8fcd4 Mon Sep 17 00:00:00 2001 From: Alexander Pyhalov Date: Mon, 17 Jun 2024 10:33:10 +0300 Subject: [PATCH] Merge and hash joins need more checks with custom operators --- src/backend/optimizer/path/joinpath.c| 46 +++- src/test/regress/expected/equivclass.out | 45 +++ src/test/regress/sql/equivclass.sql | 20 +++ 3 files changed, 110 insertions(+), 1 deletion(-) diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c index 5be8da9e095..dae06117569 100644 --- a/src/backend/optimizer/path/joinpath.c +++ b/src/backend/optimizer/path/joinpath.c @@ -95,7 +95,7 @@ static void generate_mergejoin_paths(PlannerInfo *root, Path *inner_cheapest_total, List *merge_pathkeys, bool is_partial); - +static bool all_clauses_compatible(Relids outerrelids, List *clauses); /* * add_paths_to_joinrel @@ -972,6 +972,10 @@ try_mergejoin_path(PlannerInfo *root, return; } + /* Sometimes can't create merjejoin plan if we have non-commutable clauses */ + if (!all_clauses_compatible(outer_path->parent->relids, mergeclauses)) + return; + /* * If the given paths are already well enough ordered, we can skip doing * an explicit sort. @@ -1036,6 +1040,10 @@ try_partial_mergejoin_path(PlannerInfo *root, { JoinCostWorkspace workspace; + /* Sometimes can't create merjejoin plan if we have non-commutable clauses */ + if (!all_clauses_compatible(outer_path->parent->relids, mergeclauses)) + return; + /* * See comments in try_partial_hashjoin_path(). */ @@ -1129,6 +1137,10 @@ try_hashjoin_path(PlannerInfo *root, return; } + /* Sometimes can't create hashjoin plan if we have non-commutable clauses */ + if (!all_clauses_compatible(outer_path->parent->relids, hashclauses)) + return; + /* * See comments in try_nestloop_path(). Also note that hashjoin paths * never have any output pathkeys, per comments in create_hashjoin_path. @@ -2431,3 +2443,35 @@ select_mergejoin_clauses(PlannerInfo *root, return result_list; } + +/* + * all_clauses_compatible + * + * Determine that all clauses' operators have commutator + * when necessary. While constructing merge join plan or + * hash join plan, optimizer can call get_switched_clauses(), + * which can error out if clauses are not commutable. + * The logic here should match one in get_switched_clauses(). + */ +static bool +all_clauses_compatible(Relids outer_relids, List *clauses) +{ + ListCell *l; + + foreach(l, clauses) + { + RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(l); + OpExpr *clause = (OpExpr *) restrictinfo->clause; + Oid opoid; + + if (bms_is_subset(restrictinfo->right_relids, outer_relids)) + { + opoid = get_commutator(clause->opno); + + if (!OidIsValid(opoid)) +return false; + } + } + + return true; +} diff --git a/src/test/regress/expected/equivclass.out b/src/test/regress/expected/equivclass.out index 126f7047fed..c8c85fd6c2d 100644 --- a/src/test/regress/expected/equivclass.out +++ b/src/test/regress/expected/equivclass.out @@ -323,6 +323,51 @@ explain (costs off) Index Cond: (ff = '42'::bigint) (19 rows) +-- merge join should check if conditions are commutable +explain (costs off) +with ss1 as materialized ( + select ff + 1 as x from + (select ff + 2 as ff from ec1 +union all +select ff + 3 as ff from ec1) ss0 + union all + select ff + 4 as x from ec1 +) + select * from ec1, +(select ff + 1 as x from + (select ff + 2 as ff from ec1 +union all +select ff + 3 as ff from ec1) ss0 + union all + select ff + 4 as x from ec1) as ss2, + ss1 + where ss1.x = ec1.f1 and ss1.x = ss2.x and ec1.ff = 42::int8; +QUERY PLAN +--- + Merge Join + Merge Cond: ec1_1.ff + 2) + 1)) = ss1.x) + CTE ss1 + -> Append + -> Seq Scan on ec1 ec1_4 + -> Seq Scan on ec1 ec1_5 + -> Seq Scan on ec1 ec1_6 + -> Merge Append + Sort Key: (((ec1_1.ff + 2) + 1)) + -> Index Scan using ec1_expr2 on ec1 ec1_1 + -> Index Scan using ec1_expr3 on ec1 ec1_2 + -> Index Scan using ec1_expr4 on ec1 ec1_3 + -> Materialize + -> Merge Join + Merge Cond: (ss1.x = ec1.f1) + -> Sort +
Re: Show WAL write and fsync stats in pg_stat_io
On Thu, Jun 13, 2024 at 5:24 AM Nazir Bilal Yavuz wrote: > > On Sun, 9 Jun 2024 at 18:05, Nitin Jadhav > wrote: > > > > > If possible, let's have all the I/O stats (even for WAL) in > > > pg_stat_io. Can't we show the WAL data we get from buffers in the hits > > > column and then have read_bytes or something like that to know the > > > amount of data read? > > > > The ‘hits’ column in ‘pg_stat_io’ is a vital indicator for adjusting a > > database. It signifies the count of cache hits, or in other words, the > > instances where data was located in the ‘shared_buffers’. As a result, > > keeping an eye on the ‘hits’ column in ‘pg_stat_io’ can offer useful > > knowledge about the buffer cache’s efficiency and assist users in > > making educated choices when fine-tuning their database. However, if > > we include the hit count of WAL buffers in this, it may lead to > > misleading interpretations for database tuning. If there’s something > > I’ve overlooked that’s already been discussed, please feel free to > > correct me. > > I think counting them as a hit makes sense. We read data from WAL > buffers instead of reading them from disk. And, WAL buffers are stored > in shared memory so I believe they can be counted as hits in the > shared buffers. Could you please explain how this change can 'lead to > misleading interpretations for database tuning' a bit more? Perhaps Nitin was thinking of a scenario in which WAL hits are counted as hits on the same IOObject as shared buffer hits. Since this thread has been going on for awhile and we haven't recently had a schema overview, I could understand if there was some confusion. For clarity, I will restate that the current proposal is to count WAL buffer hits for IOObject WAL, which means they will not be mixed in with shared buffer hits. And I think it makes sense to count WAL IOObject hits since increasing wal_buffers can lead to more hits, right? - Melanie
Re: DROP OWNED BY fails to clean out pg_init_privs grants
Daniel Gustafsson writes: > On 15 Jun 2024, at 01:46, Tom Lane wrote: >> The core change is to install SHARED_DEPENDENCY_INITACL entries in >> pg_shdepend for all references to non-pinned roles in pg_init_privs, >> whether they are the object's owner or not. Doing that ensures that >> we can't drop a role that is still referenced there, and it provides >> a basis for shdepDropOwned and shdepReassignOwned to take some kind >> of action for such references. > I wonder if this will break any tools/scripts in prod which relies on the > previous (faulty) behaviour. It will be interesting to see if anything shows > up on -bugs. Off the cuff it seems like a good idea judging by where we are > and what we can fix with it. Considering that SHARED_DEPENDENCY_INITACL has existed for less than two months, it's hard to believe that any outside code has grown any dependencies on it, much less that it couldn't be adjusted readily. > I wonder if it's worth reverting passing the owner ID for v17 and revisiting > that in 18 if we work on recording the ID. Shaving a few catalog lookups is > generally worthwhile, doing them without needing the result for the next five > years might bite us. Yeah, that was the direction I was leaning in, too. I'll commit the revert of that separately, so that un-reverting it shouldn't be too painful if we eventually decide to do so. > Re-reading 534287403 I wonder about this hunk in RemoveRoleFromInitPriv: > + if (!isNull) > + old_acl = DatumGetAclPCopy(oldAclDatum); > + else > + old_acl = NULL; /* this case shouldn't > happen, probably */ > I wonder if we should Assert() on old_acl being NULL? I can't imagine a case > where it should legitimately be that and catching such in development might be > useful for catching stray bugs? Hmm, yeah. I was trying to be agnostic about whether it's okay for a pg_init_privs ACL to be NULL ... but I can't imagine a real use for that either, and the new patch does add some code that's effectively assuming it isn't. Agreed, let's be uniform about insisting !isNull. > +1 on going ahead with this patch. There is more work to do but I agree that > this about all that makes sense in v17 at this point. Thanks for reviewing! regards, tom lane
Re: Separate HEAP WAL replay logic into its own file
On Mon, Jun 17, 2024 at 2:20 AM Li, Yong wrote: > > Hi PostgreSQL hackers, > > For most access methods in PostgreSQL, the implementation of the access > method itself and the implementation of its WAL replay logic are organized in > separate source files. However, the HEAP access method is an exception. > Both the access method and the WAL replay logic are collocated in the same > heapam.c. To follow the pattern established by other access methods and to > improve maintainability, I made the enclosed patch to separate HEAP’s replay > logic into its own file. The changes are straightforward. Move the replay > related functions into the new heapam_xlog.c file, push the common > heap_execute_freeze_tuple() helper function into the heapam.h header, and > adjust the build files. I'm not against this change, but I am curious at what inspired this. Were you looking at Postgres code and simply noticed that there isn't a heapam_xlog.c (like there is a nbtxlog.c etc) and thought that you wanted to change that? Or is there some specific reason this would help you as a Postgres developer, user, or ecosystem member? - Melanie
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Mon, Jun 17, 2024 at 05:55:04PM +0530, Bharath Rupireddy wrote: > Here are my thoughts on when to do the XID age invalidation. In all > the patches sent so far, the XID age invalidation happens in two > places - one during the slot acquisition, and another during the > checkpoint. As the suggestion is to do it during the vacuum (manual > and auto), so that even if the checkpoint isn't happening in the > database for whatever reasons, a vacuum command or autovacuum can > invalidate the slots whose XID is aged. +1. IMHO this is a principled choice. The similar max_slot_wal_keep_size parameter is considered where it arguably matters most: when we are trying to remove/recycle WAL segments. Since this parameter is intended to prevent the server from running out of space, it makes sense that we'd apply it at the point where we are trying to free up space. The proposed max_slot_xid_age parameter is intended to prevent the server from running out of transaction IDs, so it follows that we'd apply it at the point where we reclaim them, which happens to be vacuum. > An idea is to check for XID age based invalidation for all the slots > in ComputeXidHorizons() before it reads replication_slot_xmin and > replication_slot_catalog_xmin, and obviously before the proc array > lock is acquired. A potential problem with this approach is that the > invalidation check can become too aggressive as XID horizons are > computed from many places. > > Another idea is to check for XID age based invalidation for all the > slots in higher levels than ComputeXidHorizons(), for example in > vacuum() which is an entry point for both vacuum command and > autovacuum. This approach seems similar to vacuum_failsafe_age GUC > which checks each relation for the failsafe age before vacuum gets > triggered on it. I don't presently have any strong opinion on where this logic should go, but in general, I think we should only invalidate slots if invalidating them would allow us to advance the vacuum cutoff. If the cutoff is held back by something else, I don't see a point in invalidating slots because we'll just be breaking replication in return for no additional reclaimed transaction IDs. -- nathan
Re: Allow logical failover slots to wait on synchronous replication
Hi, On Tue, Jun 11, 2024 at 10:00:46AM +, Bertrand Drouvot wrote: > Hi, > > On Mon, Jun 10, 2024 at 09:25:10PM -0500, Nathan Bossart wrote: > > On Mon, Jun 10, 2024 at 03:51:05PM -0700, John H wrote: > > > The existing 'standby_slot_names' isn't great for users who are running > > > clusters with quorum-based synchronous replicas. For instance, if > > > the user has synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' it's a > > > bit tedious to have to reconfigure the standby_slot_names to set it to > > > the most updated 3 sync replicas whenever different sync replicas start > > > lagging. In the event that both GUCs are set, 'standby_slot_names' takes > > > precedence. > > > > Hm. IIUC you'd essentially need to set standby_slot_names to "A,B,C,D,E" > > to get the desired behavior today. That might ordinarily be okay, but it > > could cause logical replication to be held back unnecessarily if one of the > > replicas falls behind for whatever reason. A way to tie standby_slot_names > > to synchronous replication instead does seem like it would be useful in > > this case. > > FWIW, I have the same understanding and also think your proposal would be > useful in this case. A few random comments about v1: 1 + int mode = SyncRepWaitMode; It's set to SyncRepWaitMode and then never change. Worth to get rid of "mode"? 2 + static XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE] = {InvalidXLogRecPtr}; I did some testing and saw that the lsn[] values were not always set to InvalidXLogRecPtr right after. It looks like that, in that case, we should avoid setting the lsn[] values at compile time. Then, what about? 1. remove the "static". or 2. keep the static but set the lsn[] values after its declaration. 3 - /* -* Return false if not all the standbys have caught up to the specified -* WAL location. -*/ - if (caught_up_slot_num != standby_slot_names_config->nslotnames) - return false; + if (!XLogRecPtrIsInvalid(lsn[mode]) && lsn[mode] >= wait_for_lsn) + return true; lsn[] values are/(should have been, see 2 above) just been initialized to InvalidXLogRecPtr so that XLogRecPtrIsInvalid(lsn[mode]) will always return true. I think this check is not needed then. 4 > > > I did some very brief pgbench runs to compare the latency. Client instance > > > was running pgbench and 10 logical clients while the Postgres box hosted > > > the writer and 5 synchronous replicas. > > > There's a hit to TPS Out of curiosity, did you compare with standby_slot_names_from_syncrep set to off and standby_slot_names not empty? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: POC, WIP: OR-clause support for indexes
On 17.06.2024 15:11, Alexander Korotkov wrote: On Mon, Jun 17, 2024 at 1:33 PM Alena Rybakina wrote: I noticed that 7 libraries have been added to src/backend/optimizer/plan/initsplan.c, and as far as I remember, Tom Lane has already expressed doubts about the approach that requires adding a large number of libraries [0], but I'm afraid I'm out of ideas about alternative approach. Thank you for pointing. Right, the number of extra headers included was one of points for criticism on this patch. I'll look to move this functionality elsewhere, while the stage of transformation could probably be the same. Yes, I thing so. In addition, I checked the fix in the previous cases that you wrote earlier [1] and noticed that SeqScan continues to generate, unfortunately, without converting expressions: I've rechecked and see I made wrong conclusion about this. The plan regression is still here. But I'm still looking to workaround this without extra GUC. I think we need to additionally do something like [1], but take further steps to avoid planning overhead when not necessary. Iagreewithyoutoreconsiderthisplacein detailonceagain,becauseotherwiseit lookslike we're likelyto runinto aperformanceissue. In particular, I think we should only consider splitting SAOP for bitmap OR in the following cases: 1. There are partial indexes with predicates over target column. Frankly, I see that we will need to split SAOP anyway to check it, right? 2. There are multiple indexes covering target column and different subsets of other columns presented in restrictions. I see two cases in one. First, we need to check whether there is an index for the columns specified in the restrictlist, and secondly, the index ranges for which the conditions fall into the "OR" expressions. 3. There are indexes covreing target column without support of SAOP (amsearcharray == false). Hopefully this should skip generation of useless bitmap paths in majority cases. Thoughts? I'm notsureIfullyunderstandhowusefulthiscanbe.Couldyouexplainit to mein more detail? Links. 1.https://www.postgresql.org/message-id/67bd918d-285e-44d2-a207-f52d9a4c35e6%40postgrespro.ru -- Regards, Alena Rybakina Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Re: Conflict Detection and Resolution
On Mon, Jun 17, 2024 at 1:42 AM Amit Kapila wrote: > The difference w.r.t the existing mechanisms for holding deleted data > is that we don't know whether we need to hold off the vacuum from > cleaning up the rows because we can't say with any certainty whether > other nodes will perform any conflicting operations in the future. > Using the example we discussed, > Node A: > T1: INSERT INTO t (id, value) VALUES (1,1); > T2: DELETE FROM t WHERE id = 1; > > Node B: > T3: UPDATE t SET value = 2 WHERE id = 1; > > Say the order of receiving the commands is T1-T2-T3. We can't predict > whether we will ever get T-3, so on what basis shall we try to prevent > vacuum from removing the deleted row? The problem arises because T2 and T3 might be applied out of order on some nodes. Once either one of them has been applied on every node, no further conflicts are possible. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Direct SSL connection and ALPN loose ends
On 17/06/2024 17:11, Jacob Champion wrote: On Mon, Apr 29, 2024 at 8:24 AM Heikki Linnakangas wrote: I was mostly worried about the refactoring of the retry logic in libpq (and about the pre-existing logic too to be honest, it was complicated before these changes already). Some changes in the v17 negotiation fallback order caught my eye: 1. For sslmode=prefer, a modern v3 error during negotiation now results in a fallback to plaintext. For v16 this resulted in an immediate failure. (v2 errors retain the v16 behavior.) 2. For gssencmode=prefer, a legacy v2 error during negotiation now results in an immediate failure. In v16 it allowed fallback to SSL or plaintext depending on sslmode. Are both these changes intentional/desirable? Change #1 seems to partially undo the decision made in a49fbaaf: Don't assume that "E" response to NEGOTIATE_SSL_CODE means pre-7.0 server. These days, such a response is far more likely to signify a server-side problem, such as fork failure. [...] Hence, it seems best to just eliminate the assumption that backing off to non-SSL/2.0 protocol is the way to recover from an "E" response, and instead treat the server error the same as we would in non-SSL cases. They were not intentional. Let me think about the desirable part :-). By "negotiation", which part of the protocol are we talking about exactly? In the middle of the TLS handshake? After sending the startup packet? I think the behavior with v2 and v3 errors should be the same. And I think an immediate failure is appropriate on any v2/v3 error during negotiation, assuming we don't use those errors for things like "TLS not supported", which would warrant a fallback. -- Heikki Linnakangas Neon (https://neon.tech)
FYI: LLVM Runtime Crash
Hackers, FYI, I wanted to try using PostgreSQL with LLVM on my Mac, but the backend repeatedly crashes during `make check`. I found the same issue in master and REL_16_STABLE. The crash message is: FATAL: fatal llvm error: Unsupported stack probing method server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. connection to server was lost Same thing appears in the log output. I poked around and found a similar bug[1] was fixed in LLVM in December, but as I’m getting it with the latest release from 2 weeks ago, something else might be going on. I’ve opened a new LLVM bug report[2] for the issue. I don’t *think* it’s something that can be fixed in Postgres core. This is mostly in FYI in case anyone else runs into this issue or knows something more about it. In the meantime I’ll be building without --with-llvm. Best, David [1]: https://github.com/llvm/llvm-project/issues/57171 [2]: https://github.com/llvm/llvm-project/issues/95804
[PATCH] Improve error message when trying to lock virtual tuple.
Hello, When currently trying to lock a virtual tuple the returned error will be a misleading `could not read block 0`. This patch adds a check for the tuple table slot being virtual to produce a clearer error. This can be triggered by extensions returning virtual tuples. While this is of course an error in those extensions the resulting error is very misleading. -- Regards, Sven Klemm From c5ccbf2e9cb401a7e63e93cd643b4dfec8d92ab8 Mon Sep 17 00:00:00 2001 From: Sven Klemm Date: Mon, 17 Jun 2024 17:38:27 +0200 Subject: [PATCH] Improve error message when trying to lock virtual tuple. When currently trying to lock a virtual tuple the returned error will be a misleading `could not read block 0`. This patch adds a check for the tuple table slot being virtual to produce a clearer error. --- src/backend/executor/nodeLockRows.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c index 41754ddfea..cb3abbd348 100644 --- a/src/backend/executor/nodeLockRows.c +++ b/src/backend/executor/nodeLockRows.c @@ -65,6 +65,13 @@ lnext: return NULL; } + /* + * If the slot is virtual, we can't lock it. This should never happen, but + * this will lead to a misleading could not read block error later otherwise. + */ + if (TTS_IS_VIRTUAL(slot)) + elog(ERROR, "cannot lock virtual tuple"); + /* We don't need EvalPlanQual unless we get updated tuple version(s) */ epq_needed = false; -- 2.45.2
Re: [PATCH] Improve error message when trying to lock virtual tuple.
Hi, > When currently trying to lock a virtual tuple the returned error > will be a misleading `could not read block 0`. This patch adds a > check for the tuple table slot being virtual to produce a clearer > error. > > This can be triggered by extensions returning virtual tuples. > While this is of course an error in those extensions the resulting > error is very misleading. ``` +/* + * If the slot is virtual, we can't lock it. This should never happen, but + * this will lead to a misleading could not read block error later otherwise. + */ ``` I suggest dropping or rephrasing the "this should never happen" part. If this never happened we didn't need this check. Maybe "If the slot is virtual, we can't lock it. Fail early in order to provide an appropriate error message", or just "If the slot is virtual, we can't lock it". ``` elog(ERROR, "cannot lock virtual tuple"); ``` For some reason I thought that ereport() is the preferred way of throwing errors, but I see elog() used many times in ExecLockRows() so this is probably fine. -- Best regards, Aleksander Alekseev
Re: RFC: adding pytest as a supported test framework
On 2024-06-17 Mo 4:27 AM, Matthias van de Meent wrote: Hi Greg, Jelte, On Sat, 15 Jun 2024 at 23:53, Greg Sabino Mullane wrote: > Those young-uns are also the same group who hold their nose when coding in C, and are always clamoring for rewriting Postgres in Rust. Could you point me to one occasion I have 'always' clamored for this, or any of "those young-uns" in the community? I may not be a huge fan of C, but rewriting PostgreSQL in [other language] is not on the list of things I'm clamoring for. I may have given off-hand mentions that [other language] would've helped in certain cases, sure, but I'd hardly call that clamoring. Greg was being a but jocular here. I didn't take him seriously. But there's maybe a better case to make the point he was making. Back in the dark ages we used a source code control system called CVS. It's quite unlike git and has a great many limitations and uglinesses, and there was some pressure for us to move off it. If we had done so when it was first suggested, we would probably have moved to using Subversion, which is rather like CVS with many of the warts knocked off. Before long, some distributed systems like Mercurial and git came along, and we, like most of the world, chose git. Thus by waiting and not immediately doing what was suggested we got a better solution. Moving twice would have been ... painful. I have written Python in the past. Not a huge amount, but it doesn't feel like a foreign country to me, just the next town over instead of my immediate neighbourhood. We even have a python script in the buildfarm server code (not written by me). I'm sure if we started writing tests in Python I would adjust. But I think we need to know what the advantages are, beyond simple language preference. And to get to an equivalent place for Python that we are at with perl will involve some work. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Direct SSL connection and ALPN loose ends
On Mon, Jun 17, 2024 at 8:24 AM Heikki Linnakangas wrote: > By "negotiation", which part of the protocol are we talking about > exactly? In the middle of the TLS handshake? After sending the startup > packet? By "negotiation" I mean the server's response to the startup packet. I.e. "supported"/"not supported"/"error". > I think the behavior with v2 and v3 errors should be the same. And I > think an immediate failure is appropriate on any v2/v3 error during > negotiation, assuming we don't use those errors for things like "TLS not > supported", which would warrant a fallback. For GSS encryption, it was my vague understanding that older servers respond with an error rather than the "not supported" indication. For TLS, though, the decision in a49fbaaf (immediate failure) seemed reasonable. Thanks, --Jacob
Re: Avoid orphaned objects dependencies, take 3
On Fri, Jun 14, 2024 at 3:54 AM Bertrand Drouvot wrote: > > Ah, right. So, I was assuming that, with either this version of your > > patch or the earlier version, we'd end up locking the constraint > > itself. Was I wrong about that? > > The child contraint itself is not locked when going through > ConstraintSetParentConstraint(). > > While at it, let's look at a full example and focus on your concern. I'm not at the point of having a concern yet, honestly. I'm trying to understand the design ideas. The commit message just says that we take a conflicting lock, but it doesn't mention which object types that principle does or doesn't apply to. I think the idea of skipping it for cases where it's redundant with the relation lock could be the right idea, but if that's what we're doing, don't we need to explain the principle somewhere? And shouldn't we also apply it across all object types that have the same property? Along the same lines: + /* + * Those don't rely on LockDatabaseObject() when being dropped (see + * AcquireDeletionLock()). Also it looks like they can not produce + * orphaned dependent objects when being dropped. + */ + if (object->classId == RelationRelationId || object->classId == AuthMemRelationId) + return; "It looks like X cannot happen" is not confidence-inspiring. At the very least, a better comment is needed here. But also, that relation has no exception for AuthMemRelationId, only for RelationRelationId. And also, the exception for RelationRelationId doesn't imply that we don't need a conflicting lock here: the special case for RelationRelationId in AcquireDeletionLock() is necessary because the lock tag format is different for relations than for other database objects, not because we don't need a lock at all. If the handling here were really symmetric with what AcquireDeletionLock(), the coding would be to call either LockRelationOid() or LockDatabaseObject() depending on whether classid == RelationRelationId. Now, that isn't actually necessary, because we already have relation-locking calls elsewhere, but my point is that the rationale this commit gives for WHY it isn't necessary seems to me to be wrong in multiple ways. So to try to sum up here: I'm not sure I agree with this design. But I also feel like the design is not as clear and consistently implemented as it could be. So even if I just ignored the question of whether it's the right design, it feels like we're a ways from having something potentially committable here, because of issues like the ones I mentioned in the last paragraph. -- Robert Haas EDB: http://www.enterprisedb.com
Re: DROP OWNED BY fails to clean out pg_init_privs grants
> On 17 Jun 2024, at 16:56, Tom Lane wrote: > Daniel Gustafsson writes: >> I wonder if this will break any tools/scripts in prod which relies on the >> previous (faulty) behaviour. It will be interesting to see if anything shows >> up on -bugs. Off the cuff it seems like a good idea judging by where we are >> and what we can fix with it. > > Considering that SHARED_DEPENDENCY_INITACL has existed for less than > two months, it's hard to believe that any outside code has grown any > dependencies on it, much less that it couldn't be adjusted readily. Doh, I was thinking about it backwards, clearly not a worry =) >> I wonder if it's worth reverting passing the owner ID for v17 and revisiting >> that in 18 if we work on recording the ID. Shaving a few catalog lookups is >> generally worthwhile, doing them without needing the result for the next five >> years might bite us. > > Yeah, that was the direction I was leaning in, too. I'll commit the > revert of that separately, so that un-reverting it shouldn't be too > painful if we eventually decide to do so. Sounds good. -- Daniel Gustafsson
Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY
Hello, everyone. > I think It is even possible to see !alive index in the same situation (it is worse case), but I was unable to reproduce it so far. Fortunately, it is not possible. So, seems like I have found the source of the problem: 1) infer_arbiter_indexes calls RelationGetIndexList to get the list of candidates. It does no lock selected indexes in any additional way which prevents index_concurrently_swap changing them (set and clear validity). RelationGetIndexList relcache.c:4857 infer_arbiter_indexes plancat.c:780 make_modifytable createplan.c:7097 -- node->arbiterIndexes = infer_arbiter_indexes(root); create_modifytable_plan createplan.c:2826 create_plan_recurse createplan.c:532 create_plan createplan.c:349 standard_planner planner.c:421 planner planner.c:282 pg_plan_query postgres.c:904 pg_plan_queries postgres.c:996 exec_simple_query postgres.c:1193 2) other backend marks some index as invalid and commits index_concurrently_swap index.c:1600 ReindexRelationConcurrently indexcmds.c:4115 ReindexIndex indexcmds.c:2814 ExecReindex indexcmds.c:2743 ProcessUtilitySlow utility.c:1567 standard_ProcessUtility utility.c:1067 ProcessUtility utility.c:523 PortalRunUtility pquery.c:1158 PortalRunMulti pquery.c:1315 PortalRun pquery.c:791 exec_simple_query postgres.c:1274 3) first backend invalidates catalog snapshot because transactional snapshot InvalidateCatalogSnapshot snapmgr.c:426 GetTransactionSnapshot snapmgr.c:278 PortalRunMulti pquery.c:1244 PortalRun pquery.c:791 exec_simple_query postgres.c:1274 4) first backend copies indexes selected using previous catalog snapshot ExecInitModifyTable nodeModifyTable.c:4499 resultRelInfo->ri_onConflictArbiterIndexes = node->arbiterIndexes; ExecInitNode execProcnode.c:177 InitPlan execMain.c:966 standard_ExecutorStart execMain.c:261 ExecutorStart execMain.c:137 ProcessQuery pquery.c:155 PortalRunMulti pquery.c:1277 PortalRun pquery.c:791 exec_simple_query postgres.c:1274 5) then reads indexes using new fresh snapshot RelationGetIndexList relcache.c:4816 ExecOpenIndices execIndexing.c:175 ExecInsert nodeModifyTable.c:792 - ExecOpenIndices(resultRelInfo, onconflict != ONCONFLICT_NONE); ExecModifyTable nodeModifyTable.c:4059 ExecProcNodeFirst execProcnode.c:464 ExecProcNode executor.h:274 ExecutePlan execMain.c:1646 standard_ExecutorRun execMain.c:363 ExecutorRun execMain.c:304 ProcessQuery pquery.c:160 PortalRunMulti pquery.c:1277 PortalRun pquery.c:791 exec_simple_query postgres.c:1274 5) and uses arbiter selected with stale snapshot with new index view (marked as invalid) ExecInsert nodeModifyTable.c:1016 -- arbiterIndexes = resultRelInfo->ri_onConflictArbiterIndexes; ExecInsert nodeModifyTable.c:1048 ---if (!ExecCheckIndexConstraints(resultRelInfo, slot, estate, conflictTid, arbiterIndexes)) ExecModifyTable nodeModifyTable.c:4059 ExecProcNodeFirst execProcnode.c:464 ExecProcNode executor.h:274 ExecutePlan execMain.c:1646 standard_ExecutorRun execMain.c:363 ExecutorRun execMain.c:304 ProcessQuery pquery.c:160 PortalRunMulti pquery.c:1277 PortalRun pquery.c:791 exec_simple_query postgres.c:1274 I have attached an updated test for the issue (it fails on assert quickly and uses only 2 backends). The same issue may happen in case of CREATE/DROP INDEX CONCURRENTLY as well. The simplest possible fix is to use ShareLock instead ShareUpdateExclusiveLock in the index_concurrently_swap oldClassRel = relation_open(oldIndexId, ShareLock); newClassRel = relation_open(newIndexId, ShareLock); But this is not a "concurrent" way. But such update should be fast enough as far as I understand. Best regards, Mikhail. Subject: [PATCH] better test error report in case of invalid index used as arbiter test for issue with upsert fail --- Index: src/test/modules/test_misc/t/006_concurrently_unique_fail.pl IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP <+>UTF-8 === diff --git a/src/test/mo
Re: ecdh support causes unnecessary roundtrips
Hi, On 2024-06-17 12:00:30 +0200, Daniel Gustafsson wrote: > > On 17 Jun 2024, at 01:46, Andres Freund wrote: > > > When connecting with a libpq based client, the TLS establishment ends up > > like > > this in many configurations; > > > > C->S: TLSv1 393 Client Hello > > S->C: TLSv1.3 167 Hello Retry Request, Change Cipher Spec > > C->S: TLSv1.3 432 Change Cipher Spec, Client Hello > > S->C: TLSv1.3 1407 Server Hello, Application Data, Application Data, > > Application Data, Application Data > > I wonder if this is the result of us still using TLS 1.2 as the default > minimum > protocol version. In 1.2 the ClientHello doesn't contain the extensions for > cipher and curves, so the server must reply with a HRR (Hello Retry Request) > message asking the client to set protocol, curve and cipher. The client will > respond with a new Client Hello using 1.3 with the extensions. I'm pretty sure it's not that, given that a) removing the server side SSL_CTX_set_tmp_ecdh() call b) adding a client side SSL_CTX_set_tmp_ecdh() call avoid the roundtrip. I've now also confirmed that ssl_min_protocol_version=TLSv1.3 doesn't change anything (relevant, of course the indicated version support changes). > > This appears to be caused by ECDH support. The difference between the two > > ClientHellos is > > -Extension: key_share (len=38) x25519 > > +Extension: key_share (len=71) secp256r1 > > > > I.e. the clients wanted to use x25519, but the server insists on secp256r1. > > Somewhat related, Erica Zhang has an open patch to make the server-side curves > configuration take a list rather than a single curve [0], and modernizing the > API used as a side effect (SSL_CTX_set_tmp_ecdh is documented as obsoleted by > OpenSSL, but not deprecated with an API level). That does seem nicer. Fun coincidence in timing. > > I don't know if it's good that we're calling SSL_CTX_set_tmp_ecdh at all, > > To set the specified curve in ssl_ecdh_curve we have to don't we? Sure, but it's not obvious to me why we actually want to override openssl's defaults here. There's not even a parameter to opt out of forcing a specific choice on the server side. > > but if it is, shouldn't we at least do the same in libpq, so we don't > > introduce > > unnecessary roundtrips? > > If we don't set the curve in the client I believe OpenSSL will pass the set of > supported curves the client has, which then should allow the server to pick > the > one it wants based on ssl_ecdh_curve, so ideally we shouldn't have to I think. Afaict the client sends exactly one: Transport Layer Security TLSv1.3 Record Layer: Handshake Protocol: Client Hello Content Type: Handshake (22) Version: TLS 1.0 (0x0301) Length: 320 Handshake Protocol: Client Hello ... Extension: supported_versions (len=5) TLS 1.3, TLS 1.2 Type: supported_versions (43) Length: 5 Supported Versions length: 4 Supported Version: TLS 1.3 (0x0304) Supported Version: TLS 1.2 (0x0303) Extension: psk_key_exchange_modes (len=2) Type: psk_key_exchange_modes (45) Length: 2 PSK Key Exchange Modes Length: 1 PSK Key Exchange Mode: PSK with (EC)DHE key establishment (psk_dhe_ke) (1) Extension: key_share (len=38) x25519 Type: key_share (51) Length: 38 Key Share extension Extension: compress_certificate (len=5) Type: compress_certificate (27) ... Note key_share being set to x25519. The HRR says: Extension: key_share (len=2) secp256r1 Type: key_share (51) Length: 2 Key Share extension Selected Group: secp256r1 (23) > > I did confirm that doing the same thing on the client side removes the > > additional roundtrip. > > The roundtrip went away because the client was set to use secp256r1? Yes. Or if I change the server to not set the ecdh curve. > I wonder if that made OpenSSL override the min protocol version and switch > to a TLS1.3 ClientHello since it otherwise couldn't announce the curve. The client seems to announce the curve in the initial ClientHello even with 1.3 as the minimum version. What *does* make the HRR go away is setting ssl_max_protocol_version=TLSv1.2 on the client side. https://wiki.openssl.org/index.php/TLS1.3 says: > In practice most clients will use X25519 or P-256 for their initial > key_share. For maximum performance it is recommended that servers are > configured to support at least those two groups and clients use one of those > two for its initial key_share. This is the default case (OpenSSL clients > will use X25519). We're not allowing both groups and the client defaults to X25519, hence the HRR. > If you force the client min protocol to 1.3 in an unpatched client, do you > see the same speedup? Nope. Greetings, Andres Freund
Re: ecdh support causes unnecessary roundtrips
On Mon, Jun 17, 2024 at 10:01 AM Andres Freund wrote: > On 2024-06-17 12:00:30 +0200, Daniel Gustafsson wrote: > > To set the specified curve in ssl_ecdh_curve we have to don't we? > > Sure, but it's not obvious to me why we actually want to override openssl's > defaults here. There's not even a parameter to opt out of forcing a specific > choice on the server side. I had exactly the same question in the context of the other thread, and found https://www.openssl.org/blog/blog/2022/10/21/tls-groups-configuration/index.html My initial takeaway was that our default is more restrictive than it should be, but the OpenSSL default is more permissive than what they recommend in practice, due to denial of service concerns: > A general recommendation is to limit the groups to those that meet the > required security level and that all the potential TLS clients support. --Jacob
Re: ecdh support causes unnecessary roundtrips
> On 17 Jun 2024, at 19:01, Andres Freund wrote: > On 2024-06-17 12:00:30 +0200, Daniel Gustafsson wrote: >>> On 17 Jun 2024, at 01:46, Andres Freund wrote: >>> I don't know if it's good that we're calling SSL_CTX_set_tmp_ecdh at all, >> >> To set the specified curve in ssl_ecdh_curve we have to don't we? > > Sure, but it's not obvious to me why we actually want to override openssl's > defaults here. There's not even a parameter to opt out of forcing a specific > choice on the server side. I agree that the GUC is a bit rough around the edges, maybe leavint it blank or something should be defined as "OpenSSL defaults". Let's bring that to Erica's patch for allowing a list of curves. >>> I did confirm that doing the same thing on the client side removes the >>> additional roundtrip. >> >> The roundtrip went away because the client was set to use secp256r1? > > Yes. Or if I change the server to not set the ecdh curve. Configuring the server to use x25519 instead of secp256r1 should achieve the same thing. >> I wonder if that made OpenSSL override the min protocol version and switch >> to a TLS1.3 ClientHello since it otherwise couldn't announce the curve. > > The client seems to announce the curve in the initial ClientHello even with > 1.3 as the minimum version. With 1.3 it should announce it in ClientHello, do you mean that it's announced when 1.2 is the minimum version as well? It does make sense since a 1.2 server is defined to disregard all extensions. > What *does* make the HRR go away is setting ssl_max_protocol_version=TLSv1.2 > on the client side. Makes sense, that would remove the curve and there is no change required. > https://wiki.openssl.org/index.php/TLS1.3 says: > >> In practice most clients will use X25519 or P-256 for their initial >> key_share. For maximum performance it is recommended that servers are >> configured to support at least those two groups and clients use one of those >> two for its initial key_share. This is the default case (OpenSSL clients >> will use X25519). > > We're not allowing both groups and the client defaults to X25519, hence > the HRR. So this would be solved by the curve-list patch referenced above, especially if allow it to have an opt-out to use OpenSSL defaults. -- Daniel Gustafsson
tls 1.3: sending multiple tickets
Hi, To investigate an unrelated issue, I set up key logging in the backend (we probably should build that in) and looked at the decrypted data. And noticed that just after TLS setup finished the server sends three packets in a row: C->S: TLSv1.3: finished C->S: TLSv1.3: application data (startup message) S->C: TLSv1.3: new session ticket S->C: TLSv1.3: new session ticket S->C: TLSv1.3: application data (authentication ok, parameter status+) We try to turn off session resumption, but not completely enough for 1.3: SSL_OP_NO_TICKET SSL/TLS supports two mechanisms for resuming sessions: session ids and stateless session tickets. When using session ids a copy of the session information is cached on the server and a unique id is sent to the client. When the client wishes to resume it provides the unique id so that the server can retrieve the session information from its cache. When using stateless session tickets the server uses a session ticket encryption key to encrypt the session information. This encrypted data is sent to the client as a "ticket". When the client wishes to resume it sends the encrypted data back to the server. The server uses its key to decrypt the data and resume the session. In this way the server can operate statelessly - no session information needs to be cached locally. The TLSv1.3 protocol only supports tickets and does not directly support session ids. However, OpenSSL allows two modes of ticket operation in TLSv1.3: stateful and stateless. Stateless tickets work the same way as in TLSv1.2 and below. Stateful tickets mimic the session id behaviour available in TLSv1.2 and below. The session information is cached on the server and the session id is wrapped up in a ticket and sent back to the client. When the client wishes to resume, it presents a ticket in the same way as for stateless tickets. The server can then extract the session id from the ticket and retrieve the session information from its cache. By default OpenSSL will use stateless tickets. The SSL_OP_NO_TICKET option will cause stateless tickets to not be issued. In TLSv1.2 and below this means no ticket gets sent to the client at all. In TLSv1.3 a stateful ticket will be sent. This is a server-side option only. In TLSv1.3 it is possible to suppress all tickets (stateful and stateless) from being sent by calling SSL_CTX_set_num_tickets(3) or SSL_set_num_tickets(3). Note the second to last paragraph: Because we use SSL_OP_NO_TICKET we trigger use of stateful tickets. Which afaict are never going to be useful, because we don't share the necessary state. I guess openssl really could have inferred this from the fact that we *do* call SSL_CTX_set_session_cache_mode(SSL_SESS_CACHE_OFF), b Seems we ought to use SSL_CTX_set_num_tickets() to prevent issuing the useless tickets? It seems like a buglet in openssl that it forces each session tickets to be sent in its own packet (it does an explicit BIO_flush(), so even if we buffered between openssl and OS, as I think we should, we'd still send it separately), but I don't really understand most of this stuff. Greetings, Andres Freund
Re: ecdh support causes unnecessary roundtrips
Hi, On 2024-06-17 19:29:47 +0200, Daniel Gustafsson wrote: > >> I wonder if that made OpenSSL override the min protocol version and switch > >> to a TLS1.3 ClientHello since it otherwise couldn't announce the curve. > > > > The client seems to announce the curve in the initial ClientHello even with > > 1.3 as the minimum version. > > With 1.3 it should announce it in ClientHello, do you mean that it's announced > when 1.2 is the minimum version as well? It does make sense since a 1.2 > server > is defined to disregard all extensions. Yes, it's announced even when 1.2 is the minimum: Extension: supported_versions (len=5) TLS 1.3, TLS 1.2 Type: supported_versions (43) Length: 5 Supported Versions length: 4 Supported Version: TLS 1.3 (0x0304) Supported Version: TLS 1.2 (0x0303) ... Extension: key_share (len=38) x25519 Type: key_share (51) Length: 38 Key Share extension > Let's bring that to Erica's patch for allowing a list of curves. I'm kinda wondering if we ought to do something about this in the backbranches. Forcing unnecessary roundtrips onto everyone for the next five years due to an oversight on our part isn't great. Once you're not local, the roundtrip does measurably increase the "time to first query". Greetings, Andres Freund
Re: ecdh support causes unnecessary roundtrips
> On 17 Jun 2024, at 19:44, Andres Freund wrote: >> Let's bring that to Erica's patch for allowing a list of curves. > > I'm kinda wondering if we ought to do something about this in the > backbranches. Forcing unnecessary roundtrips onto everyone for the next five > years due to an oversight on our part isn't great. Once you're not local, the > roundtrip does measurably increase the "time to first query". I don't disagree, but wouldn't it be the type of behavioural change which we typically try to avoid in backbranches? Changing the default of the ecdh GUC would perhaps be doable? (assuming that's a working solution to avoid the roundtrip). Amending the documentation is the one thing we certainly can do but 99.9% of affected users won't know they are affected so won't look for that section. -- Daniel Gustafsson
Re: ecdh support causes unnecessary roundtrips
Hi, On 2024-06-17 19:51:45 +0200, Daniel Gustafsson wrote: > > On 17 Jun 2024, at 19:44, Andres Freund wrote: > > >> Let's bring that to Erica's patch for allowing a list of curves. > > > > I'm kinda wondering if we ought to do something about this in the > > backbranches. Forcing unnecessary roundtrips onto everyone for the next five > > years due to an oversight on our part isn't great. Once you're not local, > > the > > roundtrip does measurably increase the "time to first query". > > I don't disagree, but wouldn't it be the type of behavioural change which we > typically try to avoid in backbranches? Yea, it's not great. Not sure what the right thing is here. > Changing the default of the ecdh GUC would perhaps be doable? I was wondering whether we could change the default so that it accepts both x25519 and secp256r1. Unfortunately that seems to requires changing what we use to set the parameter... > (assuming that's a working solution to avoid the roundtrip). It is. > Amending the documentation is the one thing we certainly can do but 99.9% of > affected users won't know they are affected so won't look for that section. Yea. It's also possible that some other bindings changed their default to match ours... Greetings, Andres Freund
Re: IPC::Run accepts bug reports
On Sat, Jun 15, 2024 at 7:48 PM Noah Misch wrote: > Separating this from the pytest thread: > > On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote: > > The one > > thing I know about that *I* think is a pretty big problem about Perl > > is that IPC::Run is not really maintained. > > I don't see in https://github.com/cpan-authors/IPC-Run/issues anything > affecting PostgreSQL. If you know of IPC::Run defects, please report them. > If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on > it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175 > NetBSD-10-specific behavior coping. I'm not concerned about any specific open issue; my concern is about the health of that project. https://metacpan.org/pod/IPC::Run says that this module is seeking new maintainers, and it looks like the people listed as current maintainers are mostly inactive. Instead, you're fixing stuff. That's great, but we ideally want PostgreSQL's dependencies to be things that are used widely enough that we don't end up maintaining them ourselves. I apologize if my comment came across as disparaging your efforts; that was not my intent. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Avoid orphaned objects dependencies, take 3
Hi, On Mon, Jun 17, 2024 at 12:24:46PM -0400, Robert Haas wrote: > On Fri, Jun 14, 2024 at 3:54 AM Bertrand Drouvot > wrote: > > > Ah, right. So, I was assuming that, with either this version of your > > > patch or the earlier version, we'd end up locking the constraint > > > itself. Was I wrong about that? > > > > The child contraint itself is not locked when going through > > ConstraintSetParentConstraint(). > > > > While at it, let's look at a full example and focus on your concern. > > I'm not at the point of having a concern yet, honestly. I'm trying to > understand the design ideas. The commit message just says that we take > a conflicting lock, but it doesn't mention which object types that > principle does or doesn't apply to. I think the idea of skipping it > for cases where it's redundant with the relation lock could be the > right idea, but if that's what we're doing, don't we need to explain > the principle somewhere? And shouldn't we also apply it across all > object types that have the same property? Yeah, I still need to deeply study this area and document it. > Along the same lines: > > + /* > + * Those don't rely on LockDatabaseObject() when being dropped (see > + * AcquireDeletionLock()). Also it looks like they can not produce > + * orphaned dependent objects when being dropped. > + */ > + if (object->classId == RelationRelationId || object->classId == > AuthMemRelationId) > + return; > > "It looks like X cannot happen" is not confidence-inspiring. Yeah, it is not. It is just a "feeling" that I need to work on to remove any ambiguity and/or adjust the code as needed. > At the > very least, a better comment is needed here. But also, that relation > has no exception for AuthMemRelationId, only for RelationRelationId. > And also, the exception for RelationRelationId doesn't imply that we > don't need a conflicting lock here: the special case for > RelationRelationId in AcquireDeletionLock() is necessary because the > lock tag format is different for relations than for other database > objects, not because we don't need a lock at all. If the handling here > were really symmetric with what AcquireDeletionLock(), the coding > would be to call either LockRelationOid() or LockDatabaseObject() > depending on whether classid == RelationRelationId. Agree. > Now, that isn't > actually necessary, because we already have relation-locking calls > elsewhere, but my point is that the rationale this commit gives for > WHY it isn't necessary seems to me to be wrong in multiple ways. Agree. I'm not done with that part yet (should have made it more clear). > So to try to sum up here: I'm not sure I agree with this design. But I > also feel like the design is not as clear and consistently implemented > as it could be. So even if I just ignored the question of whether it's > the right design, it feels like we're a ways from having something > potentially committable here, because of issues like the ones I > mentioned in the last paragraph. > Agree. I'll now move on with the "XXX Do we need a lock for RelationRelationId?" comments that I put in v10 (see [1]) and study all the cases around them. Once done, I think that it will easier to 1.remove ambiguity, 2.document and 3.do the "right" thing regarding the RelationRelationId object class. [1]: https://www.postgresql.org/message-id/ZnAVEBhlGvpDDVOD%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: improve predefined roles documentation
On Thu, Jun 13, 2024 at 3:48 PM Nathan Bossart wrote: > I think we could improve matters by abandoning the table and instead > documenting these roles more like we document GUCs, i.e., each one has a > section below it where we can document it in as much detail as we want. > Some of these roles should probably be documented together (e.g., > pg_read_all_data and pg_write_all_data), so the ordering is unlikely to be > perfect, but I'm hoping it would still be a net improvement. +1. I'm not sure about all of the details, but I like the general idea. -- Robert Haas EDB: http://www.enterprisedb.com
Re: IPC::Run accepts bug reports
Hi, On 2024-06-15 16:48:24 -0700, Noah Misch wrote: > Separating this from the pytest thread: > > On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote: > > The one > > thing I know about that *I* think is a pretty big problem about Perl > > is that IPC::Run is not really maintained. > > I don't see in https://github.com/cpan-authors/IPC-Run/issues anything > affecting PostgreSQL. If you know of IPC::Run defects, please report them. > If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on > it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175 > NetBSD-10-specific behavior coping. 1) Sometimes hangs hard on windows if started processes have not been shut down before script exits. I've mostly encountered this via the buildfarm / CI, so I never had a good way of narrowing this down. It's very painful because things seem to often just get stuck once that happens. 2) If a subprocess dies in an inopportune moment, IPC::Run dies with "ack Broken pipe:" (in _do_filters()). There's plenty reports of this on the list, and I've hit this several times personally. It seems to be timing dependent, I've encountered it after seemingly irrelevant ordering changes. I suspect I could create a reproducer with a bit of time. 3) It's very slow on windows (in addition to the windows process slowness). That got to be fixable to some degree. Greetings, Andres Freund
Re: Document NULL
On Sat, May 11, 2024 at 11:00 AM David G. Johnston < david.g.johns...@gmail.com> wrote: > Though I haven’t settled on a phrasing I really like. But I’m trying to > avoid a parenthetical. > > Settled on this: The cardinal rule, a null value is neither equal nor unequal to any value, including other null values. I've been tempted to just say, "to any value.", but cannot quite bring myself to do it... David J.
Re: Direct SSL connection and ALPN loose ends
Hi, On 2024-04-29 18:24:04 +0300, Heikki Linnakangas wrote: > All reported bugs have now been fixed. We now enforce ALPN in all the right > places. Please let me know if I missed something. Very minor and not really your responsibility: If provided with the necessary key information, wireshark can decode TLS exchanges when using sslnegotiation=postgres but not with direct. Presumably it needs to be taught postgres' ALPN id or something. Example with direct: 476 6513.310308457 192.168.0.113 → 192.168.0.200 48978 5432 142 TLSv1.3 Finished 477 6513.310341492 192.168.0.113 → 192.168.0.200 48978 5432 151 TLSv1.3 Application Data 478 6513.320730295 192.168.0.200 → 192.168.0.113 5432 48978 147 TLSv1.3 New Session Ticket 479 6513.320745684 192.168.0.200 → 192.168.0.113 5432 48978 147 TLSv1.3 New Session Ticket 480 6513.321175713 192.168.0.113 → 192.168.0.200 48978 5432 68 TCP 48978 → 5432 [ACK] Seq=915 Ack=1665 Win=62848 Len=0 TSval=3779915421 TSecr=3469016093 481 6513.323161553 192.168.0.200 → 192.168.0.113 5432 48978 518 TLSv1.3 Application Data 482 6513.323626180 192.168.0.113 → 192.168.0.200 48978 5432 125 TLSv1.3 Application Data 483 6513.333977769 192.168.0.200 → 192.168.0.113 5432 48978 273 TLSv1.3 Application Data 484 6513.334581920 192.168.0.113 → 192.168.0.200 48978 5432 95 TLSv1.3 Application Data 485 6513.334666116 192.168.0.113 → 192.168.0.200 48978 5432 92 TLSv1.3 Alert (Level: Warning, Description: Close Notify) Example with postgres: 502 6544.752799560 192.168.0.113 → 192.168.0.200 46300 5432 142 TLSv1.3 Finished 503 6544.752842863 192.168.0.113 → 192.168.0.200 46300 5432 151 PGSQL >? 504 6544.76315 192.168.0.200 → 192.168.0.113 5432 46300 147 TLSv1.3 New Session Ticket 505 6544.763163155 192.168.0.200 → 192.168.0.113 5432 46300 147 TLSv1.3 New Session Ticket 506 6544.763587595 192.168.0.113 → 192.168.0.200 46300 5432 68 TCP 46300 → 5432 [ACK] Seq=923 Ack=1666 Win=62848 Len=0 TSval=3779946864 TSecr=3469047536 507 6544.765024827 192.168.0.200 → 192.168.0.113 5432 46300 518 PGSQL Q 509 6544.776974164 192.168.0.200 → 192.168.0.113 5432 46300 273 PGSQL X 511 6544.777631520 192.168.0.113 → 192.168.0.200 46300 5432 92 TLSv1.3 Alert (Level: Warning, Description: Close Notify) Note that in the second one it knows what's inside the "Application Data" messages and decodes them (S: startup, authentication ok, parameters, cancel key, ready for query, C: simple query, S: description, 10 rows, command complete, ready for query). In the GUI you can obviously go into the "postgres messages" in more detail than I know how to do on the console. A second aspect is that I'm not super happy about the hack of stashing data into Port. I think medium term we'd be better off separating out the buffering for unencrypted and encrypted data properly. It turns out that not having any buffering *below* openssl (i.e. the encrypted data) hurts both for the send and receive side, due to a) increased number of syscalls b) too many small packets being sent, as we use TCP_NODELAY c) kernel memory copies being slower due to the small increments. Greetings, Andres Freund
Re: Refactoring backend fork+exec code
While looking into [0], I noticed that main() still only checks for the --fork prefix, but IIUC commit aafc05d removed all --fork* options except for --forkchild. I've attached a patch to strengthen the check in main(). This is definitely just a nitpick. [0] https://postgr.es/m/CAKAnmmJkZtZAiSryho%3DgYpbvC7H-HNjEDAh16F3SoC9LPu8rqQ%40mail.gmail.com -- nathan diff --git a/src/backend/main/main.c b/src/backend/main/main.c index bfd0c5ed65..4672aab837 100644 --- a/src/backend/main/main.c +++ b/src/backend/main/main.c @@ -185,7 +185,7 @@ main(int argc, char *argv[]) else if (argc > 1 && strcmp(argv[1], "--boot") == 0) BootstrapModeMain(argc, argv, false); #ifdef EXEC_BACKEND - else if (argc > 1 && strncmp(argv[1], "--fork", 6) == 0) + else if (argc > 1 && strncmp(argv[1], "--forkchild", 11) == 0) SubPostmasterMain(argc, argv); #endif else if (argc > 1 && strcmp(argv[1], "--describe-config") == 0)
Re: Add support to TLS 1.3 cipher suites and curves lists
Hi, This thread was referenced by https://www.postgresql.org/message-id/48F0A1F8-E0B4-41F8-990F-41E6BA2A6185%40yesql.se On 2024-06-13 14:34:27 +0800, Erica Zhang wrote: > diff --git a/src/backend/libpq/be-secure-openssl.c > b/src/backend/libpq/be-secure-openssl.c > index 39b1a66236..d097e81407 100644 > --- a/src/backend/libpq/be-secure-openssl.c > +++ b/src/backend/libpq/be-secure-openssl.c > @@ -1402,30 +1402,30 @@ static bool > initialize_ecdh(SSL_CTX *context, bool isServerStart) > { > #ifndef OPENSSL_NO_ECDH > - EC_KEY *ecdh; > - int nid; > + char*curve_list = strdup(SSLECDHCurve); ISTM we'd want to eventually rename the GUC variable to indicate it's a list? I think the "ecdh" portion is actually not accurate anymore either, it's used outside of ecdh if I understand correctly (probably I am not)? > + char*saveptr; > + char*token = strtok_r(curve_list, ":", &saveptr); > + int nid; > > - nid = OBJ_sn2nid(SSLECDHCurve); > - if (!nid) > + while (token != NULL) It'd be good to have a comment explaining why we're parsing the list ourselves instead of using just the builtin SSL_CTX_set1_groups_list(). > { > - ereport(isServerStart ? FATAL : LOG, > + nid = OBJ_sn2nid(token); > + if (!nid) > + {ereport(isServerStart ? FATAL : LOG, > (errcode(ERRCODE_CONFIG_FILE_ERROR), > - errmsg("ECDH: unrecognized curve name: %s", > SSLECDHCurve))); > + errmsg("ECDH: unrecognized curve name: %s", > token))); > return false; > + } > + token = strtok_r(NULL, ":", &saveptr); > } > > - ecdh = EC_KEY_new_by_curve_name(nid); > - if (!ecdh) > + if(SSL_CTX_set1_groups_list(context, SSLECDHCurve) !=1) > { > ereport(isServerStart ? FATAL : LOG, > (errcode(ERRCODE_CONFIG_FILE_ERROR), > - errmsg("ECDH: could not create key"))); > + errmsg("ECDH: failed to set curve names"))); Probably worth including the value of the GUC here? This also needs to change the documentation for the GUC. Once we have this parameter we probably should add at least x25519 to the allowed list, as that's the client side default these days. But that can be done in a separate patch. Greetings, Andres Freund
Re: Virtual generated columns
On Wed, 2024-05-22 at 19:22 +0200, Peter Eisentraut wrote: > On 29.04.24 10:23, Peter Eisentraut wrote: > > Here is a patch set to implement virtual generated columns. > > > The main feature patch (0005 here) generally works but has a number > > of > > open corner cases that need to be thought about and/or fixed, many > > of > > which are marked in the code or the tests. I'll continue working > > on > > that. But I wanted to see if I can get some feedback on the test > > structure, so I don't have to keep changing it around later. > > Here is an updated patch set. It needed some rebasing, especially > around the reverting of the catalogued not-null constraints. I have > also fixed up the various incomplete or "fixme" pieces of code > mentioned > above. I have in most cases added "not supported yet" error messages > for now, with the idea that some of these things can be added in > later, > as incremental features. > This is not (yet) full review. Patches applied cleanly on 76618097a6c027ec603a3dd143f61098e3fb9794 from 2024-06-14. I've run ./configure && make world-bin && make check && make check-world on 0001, then 0001+0002, then 0001+0002+0003, up to applying all 5 patches. All cases passed on Debian unstable on aarch64 (ARM64) on gcc (Debian 13.2.0-25) 13.2.0. v1-0001-Rename-regress-test-generated-to-generated_stored.patch: no objections here, makes sense as preparation for future changes v1-0002-Put-generated_stored-test-objects-in-a-schema.patch: also no objections. OTOH other tests (like publication.out, rowsecurity.out, stats_ext.out, tablespace.out) are creating schemas and later dropping them - so here it might also make sense to drop schema at the end of testing. v1-0003-Remove-useless-initializations.patch: All other cases (I checked directory src/backend/utils/cache) calling MemoryContextAllocZero only initialize fields when they are non-zero, so removing partial initialization with false brings consistency to the code. v1-0004-Remove-useless-code.patch: Patch removes filling in of constraints from function BuildDescForRelation. This function is only called from file view.c and tablecmds.c (twice). In none of those cases result->constr is used, so proposed change makes sense. While I do not know code well, so might be wrong here, I would apply this patch. I haven't looked at the most important (and biggest) file yet, v1-0005-Virtual-generated-columns.patch; hope to look at it this week. At the same I believe 0001-0004 can be applied - even backported if it'll make maintenance of future changes easier. But that should be commiter's decision. Best regards -- Tomasz Rybak, Debian Developer GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C signature.asc Description: This is a digitally signed message part
cost delay brainstorming
Hi, As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest problem with autovacuum as it exists today is that the cost delay is sometimes too low to keep up with the amount of vacuuming that needs to be done. I sketched a solution during the talk, but it was very complicated, so I started to try to think of simpler ideas that might still solve the problem, or at least be better than what we have today. I think we might able to get fairly far by observing that if the number of running autovacuum workers is equal to the maximum allowable number of running autovacuum workers, that may be a sign of trouble, and the longer that situation persists, the more likely it is that we're in trouble. So, a very simple algorithm would be: If the maximum number of workers have been running continuously for more than, say, 10 minutes, assume we're falling behind and exempt all workers from the cost limit for as long as the situation persists. One could criticize this approach on the grounds that it causes a very sudden behavior change instead of, say, allowing the rate of vacuuming to gradually increase. I'm curious to know whether other people think that would be a problem. I think it might be OK, for a couple of reasons: 1. I'm unconvinced that the vacuum_cost_delay system actually prevents very many problems. I've fixed a lot of problems by telling users to raise the cost limit, but virtually never by lowering it. When we lowered the delay by an order of magnitude a few releases ago - equivalent to increasing the cost limit by an order of magnitude - I didn't personally hear any complaints about that causing problems. So disabling the delay completely some of the time might just be fine. 1a. Incidentally, when I have seen problems because of vacuum running "too fast", it's not been because it was using up too much I/O bandwidth, but because it's pushed too much data out of cache too quickly. A long overnight vacuum can evict a lot of pages from the system page cache by morning - the ring buffer only protects our shared_buffers, not the OS cache. I don't think this can be fixed by rate-limiting vacuum, though: to keep the cache eviction at a level low enough that you could be certain of not causing trouble, you'd have to limit it to an extremely low rate which would just cause vacuuming not to keep up. The cure would be worse than the disease at that point. 2. If we decided to gradually increase the rate of vacuuming instead of just removing the throttling all at once, what formula would we use and why would that be the right idea? We'd need a lot of state to really do a calculation of how fast we would need to go in order to keep up, and that starts to rapidly turn into a very complicated project along the lines of what I mooted in Vancouver. Absent that, the only other idea I have is to gradually ramp up the cost limit higher and higher, which we could do, but we would have no idea how fast to ramp it up, so anything we do here feels like it's just picking random numbers and calling them an algorithm. If you like this idea, I'd like to know that, and hear any further thoughts you have about how to improve or refine it. If you don't, I'd like to know that, too, and any alternatives you can propose, especially alternatives that don't require crazy amounts of new infrastructure to implement. -- Robert Haas EDB: http://www.enterprisedb.com
Re: FYI: LLVM Runtime Crash
On Jun 17, 2024, at 11:52, David E. Wheeler wrote: > I don’t *think* it’s something that can be fixed in Postgres core. This is > mostly in FYI in case anyone else runs into this issue or knows something > more about it. Okay, a response to the issue[1] says the bug is in Postgres: > The error message is LLVM reporting the backend can't handle the particular > form of "probe-stack" attribute in the input LLVM IR. So this is likely a bug > in the way postgres is generating LLVM IR: please file a bug against > Postgres. (Feel free to reopen if you have some reason to believe the issue > is on the LLVM side.) Would it be best for me to send a report to pgsql-bugs? Best, David [1]: https://github.com/llvm/llvm-project/issues/95804#issuecomment-2174310977
Re: [PATCH] Improve error message when trying to lock virtual tuple.
(now send a copy to -hackers, too) On Mon, 17 Jun 2024 at 17:55, Sven Klemm wrote: > > Hello, > > When currently trying to lock a virtual tuple the returned error > will be a misleading `could not read block 0`. This patch adds a > check for the tuple table slot being virtual to produce a clearer > error. > > This can be triggered by extensions returning virtual tuples. > While this is of course an error in those extensions the resulting > error is very misleading. I think you're solving the wrong problem here, as I can't think of a place where both virtual tuple slots and tuple locking are allowed at the same time in core code. I mean, in which kind of situation could we get a Relation's table slot which is not lockable by said relation's AM? Assuming the "could not read block 0" error comes from the heap code, why does the assertion in heapam_tuple_lock that checks for a BufferHeapTupleTableSlot not fire before this `block 0` error? If the error is not in the heapam code, could you show an example of the code that breaks with that error code? Kind regards, Matthias van de Meent Neon (https://neon.tech)
Xact end leaves CurrentMemoryContext = TopMemoryContext
AtCommit_Memory and friends have done $SUBJECT for at least a couple of decades, but in the wake of analyzing bug #18512 [1], I'm feeling like that's a really bad idea. There is too much code running around the system that assumes that it's fine to leak stuff in CurrentMemoryContext. If we execute any such thing between AtCommit_Memory and the next AtStart_Memory, presto: we have a session-lifespan memory leak. I'm almost feeling that we should have a policy that CurrentMemoryContext should never point at TopMemoryContext. As to what to do about it: I'm imagining that instead of resetting CurrentMemoryContext to TopMemoryContext, we set it to some special context that we expect we can reset every so often, like at the start of the next transaction. The existing TransactionAbortContext is a very similar thing, and maybe could be repurposed/shared with this usage. Thoughts? regards, tom lane [1] https://www.postgresql.org/message-id/18512-6e89f654d7da884d%40postgresql.org
Re: FYI: LLVM Runtime Crash
Hi, On 2024-06-17 16:07:49 -0400, David E. Wheeler wrote: > On Jun 17, 2024, at 11:52, David E. Wheeler wrote: > > > I don’t *think* it’s something that can be fixed in Postgres core. This is > > mostly in FYI in case anyone else runs into this issue or knows something > > more about it. > > Okay, a response to the issue[1] says the bug is in Postgres: > > > The error message is LLVM reporting the backend can't handle the particular > > form of "probe-stack" attribute in the input LLVM IR. So this is likely a > > bug in the way postgres is generating LLVM IR: please file a bug against > > Postgres. (Feel free to reopen if you have some reason to believe the issue > > is on the LLVM side.) I suspect the issue might be that the version of clang and LLVM are diverging too far. Does it work if you pass CLANG=/opt/homebrew/opt/llvm/bin/clang to configure? Greetings, Andres Freund
Re: FYI: LLVM Runtime Crash
On Jun 17, 2024, at 16:37, Andres Freund wrote: > I suspect the issue might be that the version of clang and LLVM are diverging > too far. Does it work if you pass CLANG=/opt/homebrew/opt/llvm/bin/clang to > configure? It does! It didn’t occur to me this would be the issue, but I presumes /usr/bin/clang is not compatible with the latest LLVM installed from Homebrew. Interesting! I’ll update that issue. Thanks, David
Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions
Hi Ashutosh, Thinking about this more, could you clarify the problem/issue at hand? I think it's still not clear to me. Yes, CREATE EXTENSION can create functions that lead to unexpected privilege escalation, regardless if they are SECURITY DEFINER or SECURITY INVOKER (if the function is inadvertently executed by superuser). But that's also true for a general CREATE FUNCTION call outside of extensions. If I read the commit message I see: > This flag controls PostgreSQL's behavior in setting the implicit > search_path within the proconfig for functions created by an extension > that does not have the search_path explicitly set in proconfig If that's the "general" issue you're trying to solve, I would wonder why we we wouldn't for instance be extending the functionality to normal CREATE FUNCTION calls (ie, schema qualify based off search_path) Thanks, John H On Thu, Jun 13, 2024 at 1:09 AM Ashutosh Sharma wrote: > > Hi, > > Please find the attached patch addressing all the comments. I kindly > request your review and feedback. Your thoughts are greatly > appreciated. > > -- > With Regards, > Ashutosh Sharma. -- John Hsu - Amazon Web Services
Re: cost delay brainstorming
Hi, On 2024-06-17 15:39:27 -0400, Robert Haas wrote: > As I mentioned in my talk at 2024.pgconf.dev, I think that the biggest > problem with autovacuum as it exists today is that the cost delay is > sometimes too low to keep up with the amount of vacuuming that needs > to be done. I agree it's a big problem, not sure it's *the* problem. But I'm happy to see it improved anyway, so it doesn't really matter. One issue around all of this is that we pretty much don't have the tools to analyze autovacuum behaviour across a larger number of systems in a realistic way :/. I find my own view of what precisely the problem is being heavily swayed by the last few problematic cases I've looked t. > I think we might able to get fairly far by observing that if the > number of running autovacuum workers is equal to the maximum allowable > number of running autovacuum workers, that may be a sign of trouble, > and the longer that situation persists, the more likely it is that > we're in trouble. So, a very simple algorithm would be: If the maximum > number of workers have been running continuously for more than, say, > 10 minutes, assume we're falling behind and exempt all workers from > the cost limit for as long as the situation persists. One could > criticize this approach on the grounds that it causes a very sudden > behavior change instead of, say, allowing the rate of vacuuming to > gradually increase. I'm curious to know whether other people think > that would be a problem. Another issue is that it's easy to fall behind due to cost limits on systems where autovacuum_max_workers is smaller than the number of busy tables. IME one common situation is to have a single table that's being vacuumed too slowly due to cost limits, with everything else keeping up easily. > I think it might be OK, for a couple of reasons: > > 1. I'm unconvinced that the vacuum_cost_delay system actually prevents > very many problems. I've fixed a lot of problems by telling users to > raise the cost limit, but virtually never by lowering it. When we > lowered the delay by an order of magnitude a few releases ago - > equivalent to increasing the cost limit by an order of magnitude - I > didn't personally hear any complaints about that causing problems. So > disabling the delay completely some of the time might just be fine. I have seen disabling cost limits cause replication setups to fall over because the amount of WAL increases beyond what can be replicated/archived/replayed. It's very easy to reach the issue when syncrep is enabled. > 1a. Incidentally, when I have seen problems because of vacuum running > "too fast", it's not been because it was using up too much I/O > bandwidth, but because it's pushed too much data out of cache too > quickly. A long overnight vacuum can evict a lot of pages from the > system page cache by morning - the ring buffer only protects our > shared_buffers, not the OS cache. I don't think this can be fixed by > rate-limiting vacuum, though: to keep the cache eviction at a level > low enough that you could be certain of not causing trouble, you'd > have to limit it to an extremely low rate which would just cause > vacuuming not to keep up. The cure would be worse than the disease at > that point. I've seen versions of this too. Ironically it's often made way worse by ringbuffers, because even if there is space is shared buffers, we'll not move buffers there, instead putting a lot of pressure on the OS page cache. > If you like this idea, I'd like to know that, and hear any further > thoughts you have about how to improve or refine it. If you don't, I'd > like to know that, too, and any alternatives you can propose, > especially alternatives that don't require crazy amounts of new > infrastructure to implement. I unfortunately don't know what to do about all of this with just the existing set of metrics :/. One reason that cost limit can be important is that we often schedule autovacuums on tables in useless ways, over and over. Without cost limits providing *some* protection against using up all IO bandwidth / WAL volume, we could end up doing even worse. Common causes of such useless vacuums I've seen: - Longrunning transaction prevents increasing relfrozenxid, we run autovacuum over and over on the same relation, using up the whole cost budget. This is particularly bad because often we'll not autovacuum anything else, building up a larger and larger backlog of actual work. - Tables, where on-access pruning works very well, end up being vacuumed far too frequently, because our autovacuum scheduling doesn't know about tuples having been cleaned up by on-access pruning. - Larger tables with occasional lock conflicts cause autovacuum to be cancelled and restarting from scratch over and over. If that happens before the second table scan, this can easily eat up the whole cost budget without making forward progress. Greetings, Andres Freund
Re: Reducing the log spam
On Thu, May 02, 2024 at 12:47:45PM +0200, Laurenz Albe wrote: > On Mon, 2024-03-11 at 09:33 +0100, Jelte Fennema-Nio wrote: > > - the subscriber's server log. > > + the subscriber's server log if you remove 23505 from > > + . > > > > This seems like a pretty big regression. Being able to know why your > > replication got closed seems pretty critical. > > Yes. But I'd argue that that is a shortcoming of logical replication: > there should be a ways to get this information via SQL. Having to look into > the log file is not a very useful option. > > The feature will become much less useful if unique voilations keep getting > logged. Uh, to be clear, your patch is changing the *defaults*, which I found surprising, even after reaading the thread. Evidently, the current behavior is not what you want, and you want to change it, but I'm *sure* that whatever default you want to use at your site/with your application is going to make someone else unhappy. I surely want unique violations to be logged, for example. > @@ -6892,6 +6892,41 @@ local0.*/var/log/postgresql > > > > + xreflabel="log_suppress_errcodes"> > + log_suppress_errcodes (string) > + > + log_suppress_errcodes configuration > parameter > + > + > + > + > +Causes ERROR and FATAL messages > +from client backend processes with certain error codes to be excluded > +from the log. > +The value is a comma-separated list of five-character error codes as > +listed in . An error code that > +represents a class of errors (ends with three zeros) suppresses > logging > +of all error codes within that class. For example, the entry > +08000 (connection_exception) > +would suppress an error with code 08P01 > +(protocol_violation). The default setting is > +23505,3D000,3F000,42601,42704,42883,42P01,57P03. > +Only superusers and users with the appropriate SET > +privilege can change this setting. > + > + > + > +This setting is useful to exclude error messages from the log that > are > +frequent but irrelevant. I think you should phrase the feature as ".. *allow* skipping error logging for messages ... that are frequent but irrelevant for a given site/role/DB/etc." I suggest that this patch should not change the default behavior at all: its default should be empty. That you, personally, would plan to exclude this or that error code is pretty uninteresting. I think the idea of changing the default behavior will kill the patch, and even if you want to propose to do that, it should be a separate discussion. Maybe you should make it an 002 patch. > + {"log_suppress_errcodes", PGC_SUSET, LOGGING_WHEN, > + gettext_noop("ERROR and FATAL messages with these error > codes don't get logged."), > + NULL, > + GUC_LIST_INPUT > + }, > + &log_suppress_errcodes, > + DEFAULT_LOG_SUPPRESS_ERRCODES, > + check_log_suppress_errcodes, assign_log_suppress_errcodes, NULL > +/* > + * Default value for log_suppress_errcodes. ERROR or FATAL messages with > + * these error codes are never logged. Error classes (error codes ending > with > + * three zeros) match all error codes in the class. The idea is to suppress > + * messages that usually don't indicate a serious problem but tend to pollute > + * the log file. > + */ > + > +#define DEFAULT_LOG_SUPPRESS_ERRCODES > "23505,3D000,3F000,42601,42704,42883,42P01,57P03" > + ../src/backend/utils/errcodes.txt:23505EERRCODE_UNIQUE_VIOLATION unique_violation ../src/backend/utils/errcodes.txt:3D000EERRCODE_INVALID_CATALOG_NAME invalid_catalog_name ../src/backend/utils/errcodes.txt:3F000EERRCODE_INVALID_SCHEMA_NAME invalid_schema_name ../src/backend/utils/errcodes.txt:42601EERRCODE_SYNTAX_ERROR syntax_error ../src/backend/utils/errcodes.txt:3D000EERRCODE_UNDEFINED_DATABASE ../src/backend/utils/errcodes.txt:42883EERRCODE_UNDEFINED_FUNCTION undefined_function ../src/backend/utils/errcodes.txt:3F000EERRCODE_UNDEFINED_SCHEMA ../src/backend/utils/errcodes.txt:42P01EERRCODE_UNDEFINED_TABLE undefined_table ../src/backend/utils/errcodes.txt:42704EERRCODE_UNDEFINED_OBJECT undefined_object ../src/backend/utils/errcodes.txt:57P03EERRCODE_CANNOT_CONNECT_NOW cannot_connect_now -- Justin
Re: Xact end leaves CurrentMemoryContext = TopMemoryContext
Hi, On 2024-06-17 16:37:05 -0400, Tom Lane wrote: > As to what to do about it: I'm imagining that instead of resetting > CurrentMemoryContext to TopMemoryContext, we set it to some special > context that we expect we can reset every so often, like at the start > of the next transaction. The existing TransactionAbortContext is > a very similar thing, and maybe could be repurposed/shared with this > usage. One issue is that that could lead to hard to find use-after-free issues in currently working code. Right now allocations made "between transactions" live forever, if we just use a different context it won't anymore. Particularly if the reset is only occasional, we'd make it hard to find buggy allocations. I wonder if we ought to set CurrentMemoryContext to NULL in that timeframe, forcing code to explicitly choose what lifetime is needed, rather than just defaulting such code into changed semantics. Greetings, Andres Freund
Re: cost delay brainstorming
On Mon, Jun 17, 2024 at 3:39 PM Robert Haas wrote: > So, a very simple algorithm would be: If the maximum number of workers > have been running continuously for more than, say, > 10 minutes, assume we're falling behind Hmm, I don't know about the validity of this. I've seen plenty of cases where we hit the max workers but all is just fine. On the other hand, I don't have an alternative trigger point yet. But I do overall like the idea of dynamically changing the delay. And agree it is pretty conservative. > 2. If we decided to gradually increase the rate of vacuuming instead of > just removing the throttling all at once, what formula would we use > and why would that be the right idea? Well, since the idea of disabling the delay is on the table, we could raise the cost every minute by X% until we effectively reach an infinite cost / zero delay situation. I presume this would only affect currently running vacs, and future ones would get the default cost until things get triggered again? Cheers, Greg
Re: may be a buffer overflow problem
> On 14 Jun 2024, at 17:18, Tom Lane wrote: > > I wrote: >> Seeing that this code is exercised thousands of times a day in the >> regression tests and has had a failure rate of exactly zero (and >> yes, the tests do check the output), there must be some reason >> why it's okay. > > After looking a little closer, I think the reason why it works in > practice is that there's always a few bytes of zero padding at the > end of struct sqlca_t. > > I don't see any value in changing individual code sites that are > depending on that, because there are surely many more, both in > our own code and end users' code. What I suggest we ought to do > is formalize the existence of that zero pad. Perhaps like this: > > char sqlstate[5]; > + char sqlstatepad; /* nul terminator for sqlstate */ > }; > > Another way could be to change > > - char sqlstate[5]; > + char sqlstate[6]; > > but I fear that could have unforeseen consequences in code that > is paying attention to sizeof(sqlstate). Since sqlca is, according to our docs, present in other database systems we should probably keep it a 5-char array for portability reasons. Adding a padding character should be fine though. The attached adds padding and adjust the tests and documentation to match. I kept the fprintf using %.*s to match other callers. I don't know ECPG well enough to have strong feelings wrt this being the right fix or not, and the age of incorrect assumptions arounf that fprintf hints at this not being a huge problem in reality (should still be fixed of course). -- Daniel Gustafsson ecgp_sqlstate-v2.diff Description: Binary data
Re: jsonpath: Missing regex_like && starts with Errors?
On Jun 16, 2024, at 11:52, David E. Wheeler wrote: > I think that’s how it should be; I prefer that it raises errors by default > but you can silence them: > > david=# select jsonb_path_query(target => '{"x": "hi"}', path => > '$.integer()', silent => false); > ERROR: jsonpath item method .integer() can only be applied to a string or > numeric value > > david=# select jsonb_path_query(target => '{"x": "hi"}', path => > '$.integer()', silent => true); > jsonb_path_query > -- > (0 rows) > > I suggest that the same behavior be adopted for `like_regex` and `starts > with`. Okay, I think I’ve figured this out, and the key is that I am, once again, comparing predicate path queries to SQL standard queries. If I update the first example to use a comparison I no longer get an error: david=# select jsonb_path_query('{"x": "hi"}', '$.integer() == 1'); jsonb_path_query -- null So I think that’s the key: There’s not a difference between the behavior of `like_regex` and `starts with` vs other predicate expressions. This dichotomy continues to annoy. I would very much like some way to have jsonb_path_query() raise an error (or even a warning!) if passed a predate expression, and for jsonb_path_match() to raise an error or warning if its path is not a predicate expression. Because I keep confusing TF out of myself. Best, David
Re: Proposal: Document ABI Compatibility
On Jun 12, 2024, at 11:20, Andres Freund wrote: >> The PostgreSQL core team maintains two application binary interface (ABI) >> guarantees: one for major releases and one for minor releases. > > I.e. for major versions it's "there is none"? Is it? ISTM that there is the intention not to break things that don’t need to be broken, though that doesn’t rule out interface improvements. >> In such cases the incompatible changes will be listed in the Release Notes. > > I don't think we actually exhaustively list all of them. Should they be? I can maybe see the argument not to for major releases. But I’ve also has the experience of a new failure on a major release and having to go find the reason for it and the fix, often requiring the attention of someone on a mailing list who might rather tap the “compatibility changes” sign in the latest change log. :-) > s/every/a reasonable/ or just s/every/an/ ✅ > The padding case doesn't affect sizeof() fwiw. ✅ > I think there's too often not an alternative to using sizeof(), potentially > indirectly (via makeNode() or such. So this sounds a bit too general. Is there some other way to avoid the issue? Or would constructor APIs need to be added to core? Updated with your suggestions: ``` md ABI Policy == The PostgreSQL core team maintains two application binary interface (ABI) guarantees: one for major releases and one for minor releases. Major Releases -- Applications that use the PostgreSQL APIs must be compiled for each major release supported by the application. The inclusion of `PG_MODULE_MAGIC` ensures that code compiled for one major version will rejected by other major versions. Furthermore, new releases may make API changes that require code changes. Use the `PG_VERSION_NUM` constant to adjust code in a backwards compatible way: ``` c #if PG_VERSION_NUM >= 16 #include "varatt.h" #endif ``` PostgreSQL avoids unnecessary API changes in major releases, but usually ships a few necessary API changes, including deprecation, renaming, and argument variation. In such cases the incompatible changes will be listed in the Release Notes. Minor Releases -- PostgreSQL makes an effort to avoid both API and ABI breaks in minor releases. In general, an application compiled against any minor release will work with any other minor release, past or future. When a change *is* required, PostgreSQL will choose the least invasive way possible, for example by squeezing a new field into padding space or appending it to the end of a struct. This sort of change should not impact dependent applications unless, they use `sizeof(the struct)` on a struct with an appended field, or create their own instances of such structs --- patterns best avoided. In rare cases, however, even such non-invasive changes may be impractical or impossible. In such an event, the change will be documented in the Release Notes, and details on the issue will also be posted to [TBD; mail list? Blog post? News item?]. The project strongly recommends that developers adopt continuous integration testing at least for the latest minor release all major versions of Postgres they support. ``` Best, David
Re: Proposal: Document ABI Compatibility
On Jun 12, 2024, at 11:30, Peter Geoghegan wrote: > I'm a little surprised that we don't seem to have all that many > problems with ABI breakage, though. Although we theoretically have a > huge number of APIs that extension authors might choose to use, that > isn't really true in practical terms. The universe of theoretically > possible problems is vastly larger than the areas where we see > problems in practice. You have to be pragmatic about it. Things go wrong far less often than one might fear! Given this relative stability, I think it’s reasonable to document what heretofore assumed the policy is so that the fears can largely be put to rest by clear expectations. Best, David
Re: Proposal: Document ABI Compatibility
On Jun 12, 2024, at 11:57, Peter Geoghegan wrote: > That having been said, it would be useful if there was a community web > resource for this -- something akin to coverage.postgresql.org, but > with differential ABI breakage reports. You can see an example report > here: > > https://postgr.es/m/CAH2-Wzm-W6hSn71sUkz0Rem=qdeu7tnfmc7_jg2djrlfef_...@mail.gmail.com > > Theoretically anybody can do this themselves. In practice they don't. > So something as simple as providing automated reports about ABI > changes might well move the needle here. What would be required to make such a thing? Maybe it’d make a good Summer of Code project. Best, David
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.
> This is extremely workload dependent, it's not hard to find workloads with > lots of very small record and very few big ones... What you observed might > have "just" been the warmup behaviour where more full page writes have to > be written. Can you tell me how to avoid capturing this "warm-up" so that the numbers are more accurate? > There a very frequent call computing COMP_CRC32C over just 20 bytes, while > holding a crucial lock. If we were to do introduce something like this > AVX-512 algorithm, it'd probably be worth to dispatch differently in case of > compile-time known small lengths. So are you suggesting that we be able to directly call into the 64/32 bit based algorithm directly from these known small byte cases in the code? I think that we can do that with a separate API being exposed. > How does the latency of the AVX-512 algorithm compare to just using the > CRC32C instruction? I think I need more information on this one as I am not sure I understand the use case? The same function pointer indirect methods are used with or without the AVX-512 algorithm? Paul
Re: cost delay brainstorming
On Tue, 18 Jun 2024 at 07:39, Robert Haas wrote: > I think we might able to get fairly far by observing that if the > number of running autovacuum workers is equal to the maximum allowable > number of running autovacuum workers, that may be a sign of trouble, > and the longer that situation persists, the more likely it is that > we're in trouble. So, a very simple algorithm would be: If the maximum > number of workers have been running continuously for more than, say, > 10 minutes, assume we're falling behind and exempt all workers from > the cost limit for as long as the situation persists. One could > criticize this approach on the grounds that it causes a very sudden > behavior change instead of, say, allowing the rate of vacuuming to > gradually increase. I'm curious to know whether other people think > that would be a problem. I think a nicer solution would implement some sort of unified "urgency level" and slowly ramp up the vacuum_cost_limit according to that urgency rather than effectively switching to an infinite vacuum_cost_limit when all workers have been going for N mins. If there is nothing else that requires a vacuum while all 3 workers have been busy for an hour or two, it seems strange to hurry them up so they can more quickly start their next task -- being idle. An additional feature that having this unified "urgency level" will provide is the ability to prioritise auto-vacuum so that it works on the most urgent tables first. I outlined some ideas in [1] of how this might be done. David [1] https://postgr.es/m/CAApHDvo8DWyt4CWhF=NPeRstz_78SteEuuNDfYO7cjp=7yt...@mail.gmail.com
Re: jsonpath: Missing regex_like && starts with Errors?
On 06/17/24 18:14, David E. Wheeler wrote: > So I think that’s the key: There’s not a difference between the behavior of > `like_regex` and `starts with` vs other predicate expressions. The current implementation seems to have made each of our s responsible for swallowing its own errors, which is one perfectly cromulent way to satisfy the SQL standard behavior saying all errors within a should be swallowed. The standard says nothing on how they should behave outside of a , because as far as the standard's concerned, they can't appear there. Ours currently behave the same way, and swallow their errors. It would have been possible to write them in such a way as to raise errors, but not when inside a , and that would also satisfy the standard, but it would also give us the errors you would like from our nonstandard "predicate check expressions". And then you could easily use silent => true if you wanted them silent. I'd be leery of changing that, though, as we've already documented that a "predicate check expression" returns true, false, or unknown, so having it throw by default seems like a change of documented behavior. The current situation can't make much use of 'silent', since it's already false by default; you can't make it any falser to make predicate-check errors show up. Would it be a thinkable thought to change the 'silent' default to null? That could have the same effect as false for SQL standard expressions, and the same effect seen now for "predicate check expressions", and you could pass it explicitly false if you wanted errors from the predicate checks. If that's no good, I don't see an obvious solution other than adding another nonstandard construct to what's nonstandard already, and allowing something like nonsilent(predicate check expression). Regards, -Chap
Re: jsonpath: Missing regex_like && starts with Errors?
On Jun 17, 2024, at 6:44 PM, Chapman Flack wrote: > The current implementation seems to have made each of our > s responsible for swallowing its own errors, which > is one perfectly cromulent way to satisfy the SQL standard behavior saying > all errors within a should be swallowed. Naw, executePredicate does it for all of them, as for the left operand here[1]. > The standard says nothing on how they should behave outside of a > , because as far as the standard's concerned, > they can't appear there. > > Ours currently behave the same way, and swallow their errors. Yes, and they’re handled consistently, at least. > It would have been possible to write them in such a way as to raise errors, > but not when inside a , and that would also satisfy > the standard, but it would also give us the errors you would like from our > nonstandard "predicate check expressions". And then you could easily use > silent => true if you wanted them silent. I’m okay without the errors, as long as the behaviors are consistent. I mean it might be cool to have a way to get them, but the consistency I thought I saw was the bit that seemed like a bug. > I'd be leery of changing that, though, as we've already documented that > a "predicate check expression" returns true, false, or unknown, so having > it throw by default seems like a change of documented behavior. Right, same for using jsonb_path_match(). > The current situation can't make much use of 'silent', since it's already > false by default; you can't make it any falser to make predicate-check > errors show up. EXTREAMLY FALSE! 😂 > Would it be a thinkable thought to change the 'silent' default to null? > That could have the same effect as false for SQL standard expressions, and > the same effect seen now for "predicate check expressions", and you could > pass it explicitly false if you wanted errors from the predicate checks. Thaat seems like it’d be confusing TBH. > If that's no good, I don't see an obvious solution other than adding > another nonstandard construct to what's nonstandard already, and allowing > something like nonsilent(predicate check expression). The only options I can think of are a GUC to turn on SUPER STRICT mode or something (yuck, action at a distance) or introduce new functions with the new behavior. I advocate for neither (at this point). Best, David [1]: https://github.com/postgres/postgres/blob/82ed67a/src/backend/utils/adt/jsonpath_exec.c#L2058-L2059
Re: Inval reliability, especially for inplace updates
On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote: > On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote: > > https://postgr.es/m/20240512232923.aa.nmi...@google.com wrote: > > > Separable, nontrivial things not fixed in the attached patch stack: > > > > > > - Inplace update uses transactional CacheInvalidateHeapTuple(). ROLLBACK > > > of > > > CREATE INDEX wrongly discards the inval, leading to the relhasindex=t > > > loss > > > still seen in inplace-inval.spec. CacheInvalidateRelmap() does this > > > right. > > > > I plan to fix that like CacheInvalidateRelmap(): send the inval immediately, > > inside the critical section. Send it in heap_xlog_inplace(), too. > > > a. Within logical decoding, cease processing invalidations for inplace > > I'm attaching the implementation. This applies atop the v3 patch stack from > https://postgr.es/m/20240614003549.c2.nmi...@google.com, but the threads are > mostly orthogonal and intended for independent review. Translating a tuple > into inval messages uses more infrastructure than relmapper, which needs just > a database ID. Hence, this ended up more like a miniature of inval.c's > participation in the transaction commit sequence. > > I waffled on whether to back-patch inplace150-inval-durability-atcommit That inplace150 patch turned out to be unnecessary. Contrary to the "noncritical resource releasing" comment some lines above AtEOXact_Inval(true), the actual behavior is already to promote ERROR to PANIC. An ERROR just before or after sending invals becomes PANIC, "cannot abort transaction %u, it was already committed". Since inplace130-AtEOXact_RelationCache-comments existed to clear the way for inplace150, inplace130 also becomes unnecessary. I've removed both from the attached v2 patch stack. > The patch changes xl_heap_inplace of XLOG_HEAP_INPLACE. For back branches, we > could choose between: > > - Same change, no WAL version bump. Standby must update before primary. This > is best long-term, but the transition is more disruptive. I'm leaning > toward this one, but the second option isn't bad: > > - heap_xlog_inplace() could set the shared-inval-queue overflow signal on > every backend. This is more wasteful, but inplace updates might be rare > enough (~once per VACUUM) to make it tolerable. > > - Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE. This isn't > correct if one ends recovery between the two records, but you'd need to be > unlucky to notice. Noticing would need a procedure like the following. A > hot standby backend populates a relcache entry, then does DDL on the rel > after recovery ends. That still holds. Author: Noah Misch Commit: Noah Misch Remove comment about xl_heap_inplace "AT END OF STRUCT". Commit 2c03216d831160bedd72d45f712601b6f7d03f1c moved the tuple data from there to the buffer-0 data. Back-patch to v12 (all supported versions), the plan for the next change to this struct. Reviewed by FIXME. Discussion: https://postgr.es/m/20240523000548.58.nmi...@google.com diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h index 22a1747..42736f3 100644 --- a/src/include/access/heapam_xlog.h +++ b/src/include/access/heapam_xlog.h @@ -425,7 +425,6 @@ typedef struct xl_heap_confirm typedef struct xl_heap_inplace { OffsetNumber offnum;/* updated tuple's offset on page */ - /* TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_inplace; #define SizeOfHeapInplace (offsetof(xl_heap_inplace, offnum) + sizeof(OffsetNumber)) Author: Noah Misch Commit: Noah Misch For inplace update, send nontransactional invalidations. The inplace update survives ROLLBACK. The inval didn't, so another backend's DDL could then update the row without incorporating the inplace update. In the test this fixes, a mix of CREATE INDEX and ALTER TABLE resulted in a table with an index, yet relhasindex=f. That is a source of index corruption. Core code no longer needs XLOG_INVALIDATIONS, but this leaves removing it for future work. Extensions could be relying on that mechanism, so that removal would not be back-patch material. Back-patch to v12 (all supported versions). Reviewed by FIXME. Discussion: https://postgr.es/m/20240523000548.58.nmi...@google.com diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 797bddf..d7e417f 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -6305,6 +6305,9 @@ heap_inplace_update_finish(void *state, HeapTuple tuple) Relationrelation = scan->heap_rel; uint32 oldlen; uint32 newlen; + int nmsgs = 0; + SharedInvalidationMessage *invalMessages = NULL; + boolRelcacheInitFileInval = false; Assert(ItemPointer
Re: Separate HEAP WAL replay logic into its own file
> On Jun 17, 2024, at 23:01, Melanie Plageman wrote: > > External Email > > On Mon, Jun 17, 2024 at 2:20 AM Li, Yong wrote: >> >> Hi PostgreSQL hackers, >> >> For most access methods in PostgreSQL, the implementation of the access >> method itself and the implementation of its WAL replay logic are organized >> in separate source files. However, the HEAP access method is an exception. >> Both the access method and the WAL replay logic are collocated in the same >> heapam.c. To follow the pattern established by other access methods and to >> improve maintainability, I made the enclosed patch to separate HEAP’s replay >> logic into its own file. The changes are straightforward. Move the replay >> related functions into the new heapam_xlog.c file, push the common >> heap_execute_freeze_tuple() helper function into the heapam.h header, and >> adjust the build files. > > I'm not against this change, but I am curious at what inspired this. > Were you looking at Postgres code and simply noticed that there isn't > a heapam_xlog.c (like there is a nbtxlog.c etc) and thought that you > wanted to change that? Or is there some specific reason this would > help you as a Postgres developer, user, or ecosystem member? > > - Melanie As a newcomer, when I was walking through the code looking for WAL replay related code, it was relatively easy for me to find them for the B-Tree access method because of the “xlog” hint in the file names. It took me a while to find the same for the heap access method. When I finally found them (via text search), it was a small surprise. Having different file organizations for different access methods gives me this urge to make everything consistent. I think it will make it easier for newcomers, and it will reduce the mental load for everyone to remember that heap replay is inside the heapam.c not some “???xlog.c”. Yong
Re: Check lateral references within PHVs for memoize cache keys
On Mon, Mar 18, 2024 at 4:36 PM Richard Guo wrote: > Here is another rebase over master so it applies again. I also added a > commit message to help review. Nothing else has changed. AFAIU currently we do not add Memoize nodes on top of join relation paths. This is because the ParamPathInfos for join relation paths do not maintain ppi_clauses, as the set of relevant clauses varies depending on how the join is formed. In addition, joinrels do not maintain lateral_vars. So we do not have a way to extract cache keys from joinrels. (Besides, there are places where the code doesn't cope with Memoize path on top of a joinrel path, such as in get_param_path_clause_serials.) Therefore, when extracting lateral references within PlaceHolderVars, there is no need to consider those that are due to be evaluated at joinrels. Hence, here is v7 patch for that. In passing, this patch also includes a comment explaining that Memoize nodes are currently not added on top of join relation paths (maybe we should have a separate patch for this?). Thanks Richard v7-0001-Check-lateral-refs-within-PHVs-for-memoize-cache-keys.patch Description: Binary data
Re: Inval reliability, especially for inplace updates
Hi, On 2024-06-17 16:58:54 -0700, Noah Misch wrote: > On Sat, Jun 15, 2024 at 03:37:18PM -0700, Noah Misch wrote: > > On Wed, May 22, 2024 at 05:05:48PM -0700, Noah Misch wrote: > > > https://postgr.es/m/20240512232923.aa.nmi...@google.com wrote: > > > > Separable, nontrivial things not fixed in the attached patch stack: > > > > > > > > - Inplace update uses transactional CacheInvalidateHeapTuple(). > > > > ROLLBACK of > > > > CREATE INDEX wrongly discards the inval, leading to the relhasindex=t > > > > loss > > > > still seen in inplace-inval.spec. CacheInvalidateRelmap() does this > > > > right. > > > > > > I plan to fix that like CacheInvalidateRelmap(): send the inval > > > immediately, > > > inside the critical section. Send it in heap_xlog_inplace(), too. I'm worried this might cause its own set of bugs, e.g. if there are any places that, possibly accidentally, rely on the invalidation from the inplace update to also cover separate changes. Have you considered instead submitting these invalidations during abort as well? > > > a. Within logical decoding, cease processing invalidations for inplace > > > > I'm attaching the implementation. This applies atop the v3 patch stack from > > https://postgr.es/m/20240614003549.c2.nmi...@google.com, but the threads are > > mostly orthogonal and intended for independent review. Translating a tuple > > into inval messages uses more infrastructure than relmapper, which needs > > just > > a database ID. Hence, this ended up more like a miniature of inval.c's > > participation in the transaction commit sequence. > > > > I waffled on whether to back-patch inplace150-inval-durability-atcommit > > That inplace150 patch turned out to be unnecessary. Contrary to the > "noncritical resource releasing" comment some lines above > AtEOXact_Inval(true), the actual behavior is already to promote ERROR to > PANIC. An ERROR just before or after sending invals becomes PANIC, "cannot > abort transaction %u, it was already committed". Relying on that, instead of explicit critical sections, seems fragile to me. IIRC some of the behaviour around errors around transaction commit/abort has changed a bunch of times. Tying correctness into something that could be changed for unrelated reasons doesn't seem great. I'm not sure it holds true even today - what if the transaction didn't have an xid? Then RecordTransactionAbort() wouldn't trigger "cannot abort transaction %u, it was already committed" I think? > > - Same change, no WAL version bump. Standby must update before primary. > > This > > is best long-term, but the transition is more disruptive. I'm leaning > > toward this one, but the second option isn't bad: Hm. The inplace record doesn't use the length of the "main data" record segment for anything, from what I can tell. If records by an updated primary were replayed by an old standby, it'd just ignore the additional data, afaict? I think with the code as-is, the situation with an updated standby replaying an old primary's record would actually be worse - it'd afaict just assume the now-longer record contained valid fields, despite those just pointing into uninitialized memory. I think the replay routine would have to check the length of the main data and execute the invalidation conditionally. > > - heap_xlog_inplace() could set the shared-inval-queue overflow signal on > > every backend. This is more wasteful, but inplace updates might be rare > > enough (~once per VACUUM) to make it tolerable. We already set that surprisingly frequently, as a) The size of the sinval queue is small b) If a backend is busy, it does not process catchup interrupts (i.e. executing queries, waiting for a lock prevents processing) c) There's no deduplication of invals, we often end up sending the same inval over and over. So I suspect this might not be too bad, compared to the current badness. At least for core code. I guess there could be extension code triggering inplace updates more frequently? But I'd hope they'd do it not on catalog tables... Except that we wouldn't know that that's the case during replay, it's not contained in the record. > > - Use LogStandbyInvalidations() just after XLOG_HEAP_INPLACE. This isn't > > correct if one ends recovery between the two records, but you'd need to be > > unlucky to notice. Noticing would need a procedure like the following. A > > hot standby backend populates a relcache entry, then does DDL on the rel > > after recovery ends. Hm. The problematic cases presumably involves an access exclusive lock? If so, could we do LogStandbyInvalidations() *before* logging the WAL record for the inplace update? The invalidations can't be processed by other backends until the exclusive lock has been released, which should avoid the race? Greetings, Andres Freund
Re: Logical Replication of sequences
On Fri, Jun 14, 2024 at 4:04 PM Amit Kapila wrote: > > On Thu, Jun 13, 2024 at 6:14 PM Masahiko Sawada wrote: > > > > On Thu, Jun 13, 2024 at 7:06 PM Amit Kapila wrote: > > > > > > On Thu, Jun 13, 2024 at 1:09 PM Masahiko Sawada > > > wrote: > > > > > > > > On Wed, Jun 12, 2024 at 6:59 PM Amit Kapila > > > > wrote: > > > > > > > > > > > > > > > Yeah, starting with a single worker sounds good for now. Do you think > > > > > we should sync all the sequences in a single transaction or have some > > > > > threshold value above which a different transaction would be required > > > > > or maybe a different sequence sync worker altogether? Now, having > > > > > multiple sequence-sync workers requires some synchronization so that > > > > > only a single worker is allocated for one sequence. > > > > > > > > > > The simplest thing is to use a single sequence sync worker that syncs > > > > > all sequences in one transaction but with a large number of sequences, > > > > > it could be inefficient. OTOH, I am not sure if it would be a problem > > > > > in reality. > > > > > > > > I think that we can start with using a single worker and one > > > > transaction, and measure the performance with a large number of > > > > sequences. > > > > > > > > > > Fair enough. However, this raises the question Dilip and Vignesh are > > > discussing whether we need a new relfilenode for sequence update even > > > during initial sync? As per my understanding, the idea is that similar > > > to tables, the CREATE SUBSCRIPTION command (with copy_data = true) > > > will create the new sequence entries in pg_subscription_rel with the > > > state as 'i'. Then the sequence-sync worker would start a transaction > > > and one-by-one copy the latest sequence values for each sequence (that > > > has state as 'i' in pg_subscription_rel) and mark its state as ready > > > 'r' and commit the transaction. Now if there is an error during this > > > operation it will restart the entire operation. The idea of creating a > > > new relfilenode is to handle the error so that if there is a rollback, > > > the sequence state will be rolled back to 'i' and the sequence value > > > will also be rolled back. The other option could be that we update the > > > sequence value without a new relfilenode and if the transaction rolled > > > back then only the sequence's state will be rolled back to 'i'. This > > > would work with a minor inconsistency that sequence values will be > > > up-to-date even when the sequence state is 'i' in pg_subscription_rel. > > > I am not sure if that matters because anyway, they can quickly be > > > out-of-sync with the publisher again. > > > > I think it would be fine in many cases even if the sequence value is > > up-to-date even when the sequence state is 'i' in pg_subscription_rel. > > But the case we would like to avoid is where suppose the sequence-sync > > worker does both synchronizing sequence values and updating the > > sequence states for all sequences in one transaction, and if there is > > an error we end up retrying the synchronization for all sequences. > > > > The one idea to avoid this is to update sequences in chunks (say 100 > or some threshold number of sequences in one transaction). Then we > would only redo the sync for the last and pending set of sequences. That could be one idea. > > > > > > > Now, say we don't want to maintain the state of sequences for initial > > > sync at all then after the error how will we detect if there are any > > > pending sequences to be synced? One possibility is that we maintain a > > > subscription level flag 'subsequencesync' in 'pg_subscription' to > > > indicate whether sequences need sync. This flag would indicate whether > > > to sync all the sequences in pg_susbcription_rel. This would mean that > > > if there is an error while syncing the sequences we will resync all > > > the sequences again. This could be acceptable considering the chances > > > of error during sequence sync are low. The benefit is that both the > > > REFRESH PUBLICATION SEQUENCES and CREATE SUBSCRIPTION can use the same > > > idea and sync all sequences without needing a new relfilenode. Users > > > can always refer 'subsequencesync' flag in 'pg_subscription' to see if > > > all the sequences are synced after executing the command. > > > > I think that REFRESH PUBLICATION {SEQUENCES} can be executed even > > while the sequence-sync worker is synchronizing sequences. In this > > case, the worker might not see new sequences added by the concurrent > > REFRESH PUBLICATION {SEQUENCES} command since it's already running. > > The worker could end up marking the subsequencesync as completed while > > not synchronizing these new sequences. > > > > This is possible but we could avoid REFRESH PUBLICATION {SEQUENCES} by > not allowing to change the subsequencestate during the time > sequence-worker is syncing the sequences. This could be restrictive > but there doesn't seem to be cases where user would like to
RE: Conflict Detection and Resolution
On Thursday, June 13, 2024 2:11 PM Masahiko Sawada wrote: Hi, > On Wed, Jun 5, 2024 at 3:32 PM Zhijie Hou (Fujitsu) > wrote: > > > > This time at PGconf.dev[1], we had some discussions regarding this > > project. The proposed approach is to split the work into two main > > components. The first part focuses on conflict detection, which aims > > to identify and report conflicts in logical replication. This feature > > will enable users to monitor the unexpected conflicts that may occur. > > The second part involves the actual conflict resolution. Here, we will > > provide built-in resolutions for each conflict and allow user to > > choose which resolution will be used for which conflict(as described > > in the initial email of this thread). > > I agree with this direction that we focus on conflict detection (and > logging) first and then develop conflict resolution on top of that. Thanks for your reply ! > > > > > Of course, we are open to alternative ideas and suggestions, and the > > strategy above can be changed based on ongoing discussions and > > feedback received. > > > > Here is the patch of the first part work, which adds a new parameter > > detect_conflict for CREATE and ALTER subscription commands. This new > > parameter will decide if subscription will go for conflict detection. > > By default, conflict detection will be off for a subscription. > > > > When conflict detection is enabled, additional logging is triggered in > > the following conflict scenarios: > > > > * updating a row that was previously modified by another origin. > > * The tuple to be updated is not found. > > * The tuple to be deleted is not found. > > > > While there exist other conflict types in logical replication, such as > > an incoming insert conflicting with an existing row due to a primary > > key or unique index, these cases already result in constraint violation > > errors. > > What does detect_conflict being true actually mean to users? I understand that > detect_conflict being true could introduce some overhead to detect conflicts. > But in terms of conflict detection, even if detect_confict is false, we detect > some conflicts such as concurrent inserts with the same key. Once we > introduce the complete conflict detection feature, I'm not sure there is a > case > where a user wants to detect only some particular types of conflict. > > > Therefore, additional conflict detection for these cases is currently > > omitted to minimize potential overhead. However, the pre-detection for > > conflict in these error cases is still essential to support automatic > > conflict resolution in the future. > > I feel that we should log all types of conflict in an uniform way. For > example, > with detect_conflict being true, the update_differ conflict is reported as > "conflict %s detected on relation "%s"", whereas concurrent inserts with the > same key is reported as "duplicate key value violates unique constraint "%s"", > which could confuse users. Do you mean it's ok to add a pre-check before applying the INSERT, which will verify if the remote tuple violates any unique constraints, and if it violates then we log a conflict message ? I thought about this but was slightly worried about the extra cost it would bring. OTOH, if we think it's acceptable, we could do that since the cost is there only when detect_conflict is enabled. I also thought of logging such a conflict message in pg_catch(), but I think we lack some necessary info(relation, index name, column name) at the catch block. Best Regards, Hou zj
Re: jsonpath: Missing regex_like && starts with Errors?
On 06/17/24 19:17, David E. Wheeler wrote: > [1]: > https://github.com/postgres/postgres/blob/82ed67a/src/backend/utils/adt/jsonpath_exec.c#L2058-L2059 Huh, I just saw something peculiar, skimming through the code: https://github.com/postgres/postgres/blob/82ed67a/src/backend/utils/adt/jsonpath_exec.c#L1385 We allow .boolean() applied to a jbvBool, a jbvString (those are the only two possibilities allowed by the standard), or to a jbvNumeric (!), but only if it can be serialized and then parsed as an int4, otherwise we say ERRCODE_NON_NUMERIC_SQL_JSON_ITEM, or if it survived all that we call it true if it isn't zero. I wonder what that alternative is doing there. It also looks like the expected errcode (in the standard, if the item was not boolean or string) would be 2202V "non-boolean SQL/JSON item" ... which isn't in our errcodes.txt. Regards, -Chap
Re: assertion failure at cost_memoize_rescan()
2024年6月17日(月) 8:27 David Rowley : > > On Mon, 17 Jun 2024 at 10:23, Tomas Vondra > wrote: > > Interesting. Seems like a bug due to the two places clamping the values > > inconsistently. It probably does not matter in other contexts because we > > don't subtract the values like this, but here it triggers the assert. > > > > I guess the simplest fix would be to clamp "calls" the same way before > > calculating hit_ratio. That makes the ">= 0" part of the assert somewhat > > pointless, though. > > "calls" comes from the value passed as the final parameter in > create_memoize_path(). > > There's really only one call to that function and that's in > get_memoize_path(). > > return (Path *) create_memoize_path(root, > innerrel, > inner_path, > param_exprs, > hash_operators, > extra->inner_unique, > binary_mode, > outer_path->rows); > > It would be good to know what type of Path outer_path is. Normally > we'll clamp_row_est() on that field. I suspect we must have some Path > type that isn't doing that. > > KaiGai-san, what type of Path is outer_path? > It is CustomPath with rows = 3251872.91667. (I'm not certain whether the non-integer value in the estimated rows is legal or not.) According to the crash dump, try_partial_nestloop_path() takes this two paths, (gdb) up #7 0x00860fc5 in try_partial_nestloop_path (root=0x133a968, joinrel=0x15258d8, outer_path=0x1513c98, inner_path=0x15264c8, pathkeys=0x0, jointype=JOIN_INNER, extra=0x7ffc494cddf0) at joinpath.c:887 /home/kaigai/source/pgsql-16/src/backend/optimizer/path/joinpath.c:887:30713:beg:0x860fc5 (gdb) p *(CustomPath *)outer_path $13 = {path = {type = T_CustomPath, pathtype = T_CustomScan, parent = 0x1513058, pathtarget = 0x1513268, param_info = 0x0, parallel_aware = true, parallel_safe = true, parallel_workers = 2, rows = 3251872.91667, startup_cost = 41886.75250002, total_cost = 12348693.48861, pathkeys = 0x0}, flags = 4, custom_paths = 0x1514788, custom_private = 0x1514ee8, methods = 0x7f45211feca0 } (gdb) p *(MemoizePath *)inner_path $14 = {path = {type = T_MemoizePath, pathtype = T_Memoize, parent = 0x14dc800, pathtarget = 0x14dca10, param_info = 0x150d8a8, parallel_aware = false, parallel_safe = true, parallel_workers = 0, rows = 1, startup_cost = 0.44501, total_cost = 8.4284913207446568, pathkeys = 0x150d5c8}, subpath = 0x150d148, hash_operators = 0x1526428, param_exprs = 0x1526478, singlerow = true, binary_mode = false, calls = 3251872.91667, est_entries = 3251873} Best regards, -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei
Missing docs for new enable_group_by_reordering GUC
This commit added enable_group_by_reordering: commit 0452b461bc4 Author: Alexander Korotkov Date: Sun Jan 21 22:21:36 2024 +0200 Explore alternative orderings of group-by pathkeys during optimization. When evaluating a query with a multi-column GROUP BY clause, we can minimize sort operations or avoid them if we synchronize the order of GROUP BY clauses with the ORDER BY sort clause or sort order, which comes from the underlying query tree. Grouping does not imply any ordering, so we can compare the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg, we simply compared keys in the order specified in the query. This commit explores alternative ordering of the keys, trying to find a cheaper one. The ordering of group keys may interact with other parts of the query, some of which may not be known while planning the grouping. For example, there may be an explicit ORDER BY clause or some other ordering-dependent operation higher up in the query, and using the same ordering may allow using either incremental sort or even eliminating the sort entirely. The patch always keeps the ordering specified in the query, assuming the user might have additional insights. This introduces a new GUC enable_group_by_reordering so that the optimization may be disabled if needed. Discussion: https://postgr.es/m/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru Author: Andrei Lepikhov, Teodor Sigaev Reviewed-by: Tomas Vondra, Claudio Freire, Gavin Flower, Dmitry Dolgov Reviewed-by: Robert Haas, Pavel Borisov, David Rowley, Zhihong Yu Reviewed-by: Tom Lane, Alexander Korotkov, Richard Guo, Alena Rybakina It mentions it was added as a GUC to postgresql.conf, but I see no SGML docs for this new GUC value. Would someone please add docs for this? Thanks. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: may be a buffer overflow problem
Hi, On 2024-06-17 23:52:54 +0200, Daniel Gustafsson wrote: > Since sqlca is, according to our docs, present in other database systems we > should probably keep it a 5-char array for portability reasons. Adding a > padding character should be fine though. How about, additionally, adding __attribute__((nonstring))? Wrapped in an attribute, of course. That'll trigger warning for many unsafe uses, like strlen(). It doesn't quite detect the problematic case in ecpg_log() though, seems it doesn't understand fprintf() well enough (it does trigger in simple printf() cases, because they get reduced to puts(), which it understands). Adding nonstring possibly allow us to re-enable -Wstringop-truncation, it triggers a bunch on ../../../../../home/andres/src/postgresql/src/interfaces/ecpg/ecpglib/misc.c: In function ‘ECPGset_var’: ../../../../../home/andres/src/postgresql/src/interfaces/ecpg/ecpglib/misc.c:575:17: warning: ‘__builtin_strncpy’ output truncated before terminating nul copying 5 bytes from a string of the same length [-Wstringop-truncation] 575 | strncpy(sqlca->sqlstate, "YE001", sizeof(sqlca->sqlstate)); The only other -Wstringop-truncation warnings are in ecpg tests and at least the first one doesn't look bogus: ../../../../../home/andres/src/postgresql/src/interfaces/ecpg/test/compat_oracle/char_array.pgc: In function 'main': ../../../../../home/andres/src/postgresql/src/interfaces/ecpg/test/compat_oracle/char_array.pgc:54:5: warning: '__builtin_strncpy' output truncated before terminating nul copying 5 bytes from a string of the same length [-Wstringop-truncation] 54 | strncpy(shortstr, p, sizeof shortstr); Which seems like a valid complaint, given that shortstr is a char[5], p is "X" and thatshortstr is printed: printf("\"%s\": \"%s\" %d\n", bigstr, shortstr, shstr_ind); Greetings, Andres Freund
Re: may be a buffer overflow problem
Andres Freund writes: > On 2024-06-17 23:52:54 +0200, Daniel Gustafsson wrote: >> Since sqlca is, according to our docs, present in other database systems we >> should probably keep it a 5-char array for portability reasons. Adding a >> padding character should be fine though. > How about, additionally, adding __attribute__((nonstring))? Wrapped in an > attribute, of course. That'll trigger warning for many unsafe uses, like > strlen(). What I was advocating for is that we make it *safe* for strlen, not that we double down on awkward, non-idiomatic, unsafe coding practices. Admittedly, I'm not sure how we could persuade compilers that a char[5] followed by a char field is a normal C string ... regards, tom lane
Re: Better error message when --single is not the first arg to postgres executable
On Wed, Jun 05, 2024 at 11:38:48PM -0400, Greg Sabino Mullane wrote: > On Wed, Jun 5, 2024 at 3:18 PM Nathan Bossart > wrote: >> Could we remove the requirement that --single must be first? I'm not >> thrilled about adding a list of "must be first" options that needs to stay >> updated, but given this list probably doesn't change too frequently, maybe >> that's still better than a more invasive patch to allow specifying these >> options in any order... > > It would be nice, and I briefly looked into removing the "first" > requirement, but src/backend/tcop/postgres.c for one assumes that --single > is always argv[1], and it seemed not worth the extra effort to make it work > for argv[N] instead of argv[1]. I don't mind it being the first argument, > but that confusing error message needs to go. I spent some time trying to remove the must-be-first requirement and came up with the attached draft-grade patch. However, there's a complication: the "database" option for single-user mode must still be listed last, at least on systems where getopt() doesn't move non-options to the end of the array. My previous research [0] indicated that this is pretty common, and I noticed it because getopt() on macOS doesn't seem to reorder non-options. I thought about changing these to getopt_long(), which we do rely on to reorder non-options, but that conflicts with our ParseLongOption() "long argument simulation" that we use to allow specifying arbitrary GUCs via the command-line. This remaining discrepancy might be okay, but I was really hoping to reduce the burden on users to figure out the correct ordering of options. The situations in which I've had to use single-user mode are precisely the situations in which I'd rather not have to spend time learning these kinds of details. [0] https://postgr.es/m/20230609232257.GA121461%40nathanxps13 -- nathan >From 44f8747a75197b3842c41f77503fa8389ba8db3e Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Mon, 17 Jun 2024 21:20:13 -0500 Subject: [PATCH 1/1] allow --single, etc. in any order --- src/backend/bootstrap/bootstrap.c | 12 ++-- src/backend/main/main.c | 101 -- src/backend/tcop/postgres.c | 15 ++--- 3 files changed, 77 insertions(+), 51 deletions(-) diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c index 986f6f1d9c..53011b4300 100644 --- a/src/backend/bootstrap/bootstrap.c +++ b/src/backend/bootstrap/bootstrap.c @@ -210,13 +210,6 @@ BootstrapModeMain(int argc, char *argv[], bool check_only) /* Set defaults, to be overridden by explicit options below */ InitializeGUCOptions(); - /* an initial --boot or --check should be present */ - Assert(argc > 1 - && (strcmp(argv[1], "--boot") == 0 - || strcmp(argv[1], "--check") == 0)); - argv++; - argc--; - while ((flag = getopt(argc, argv, "B:c:d:D:Fkr:X:-:")) != -1) { switch (flag) @@ -230,6 +223,11 @@ BootstrapModeMain(int argc, char *argv[], bool check_only) char *name, *value; + /* --boot and --check were already processed in main() */ + if (strcmp(optarg, "boot") == 0 || + strcmp(optarg, "check") == 0) + break; + ParseLongOption(optarg, &name, &value); if (!value) { diff --git a/src/backend/main/main.c b/src/backend/main/main.c index bfd0c5ed65..b893a9f298 100644 --- a/src/backend/main/main.c +++ b/src/backend/main/main.c @@ -57,7 +57,17 @@ static void check_root(const char *progname); int main(int argc, char *argv[]) { - booldo_check_root = true; + boolcheck = false; + boolboot = false; +#ifdef EXEC_BACKEND + boolforkchild = false; +#endif + booldescribe_config = false; + boolsingle = false; + boolshow_help = false; + boolversion = false; + boolcheck_guc = false; + int modes_set = 0; reached_main = true; @@ -136,61 +146,86 @@ main(int argc, char *argv[]) */ unsetenv("LC_ALL"); + for (int i = 1; i < argc; i++) + { + modes_set++; + + if (strcmp(argv[i], "--check") == 0) + check = true; + else if (strcmp(argv[i], "--boot") == 0) + boot = true; +#ifdef EXEC_BACKEND + else if (strncmp(argv[i], "--forkchild", 11) == 0) + forkchild = true; +#endif + else if (strcmp(argv[i], "--
Re: assertion failure at cost_memoize_rescan()
On Tue, 18 Jun 2024 at 14:23, Kohei KaiGai wrote: > > 2024年6月17日(月) 8:27 David Rowley : > > It would be good to know what type of Path outer_path is. Normally > > we'll clamp_row_est() on that field. I suspect we must have some Path > > type that isn't doing that. > > > > KaiGai-san, what type of Path is outer_path? > > > It is CustomPath with rows = 3251872.91667. I suspected this might have been a CustomPath. > (I'm not certain whether the non-integer value in the estimated rows > is legal or not.) I guess since it's not documented that Path.rows is always clamped, it's probably bad to assume that it is. Since clamp_row_est() will ensure the value is clamped >= 1.0 && <= MAXIMUM_ROWCOUNT (which ensures non-zero), I tried looking around the codebase for anything that divides by Path.rows to see if we ever assume that we can divide without first checking if Path.rows != 0. Out of the places I saw, it seems we do tend to code things so that we don't assume the value has been clamped. E.g. adjust_limit_rows_costs() does if (*rows < 1) *rows = 1; I think the best solution is to apply the attached. I didn't test, but it should fix the issue you reported and also ensure that MemoizePath.calls is never zero, which would also cause issues in the hit_ratio calculation in cost_memoize_rescan(). David clamp_row_est_memoize_calls.patch Description: Binary data
Re: New GUC autovacuum_max_threshold ?
I didn't see a commitfest entry for this, so I created one to make sure we don't lose track of this: https://commitfest.postgresql.org/48/5046/ -- nathan
Re: assertion failure at cost_memoize_rescan()
On Tue, Jun 18, 2024 at 10:53 AM David Rowley wrote: > Out of the places I saw, it seems we do tend to code things so that we > don't assume the value has been clamped. E.g. > adjust_limit_rows_costs() does if (*rows < 1) *rows = 1; Agreed. In costsize.c I saw a few instances where we have /* Protect some assumptions below that rowcounts aren't zero */ if (inner_path_rows <= 0) inner_path_rows = 1; > I think the best solution is to apply the attached. I didn't test, > but it should fix the issue you reported and also ensure that > MemoizePath.calls is never zero, which would also cause issues in the > hit_ratio calculation in cost_memoize_rescan(). +1. Thanks Richard
Re: State of pg_createsubscriber
On Mon, May 20, 2024 at 12:12 PM Amit Kapila wrote: > > On Sun, May 19, 2024 at 11:20 PM Euler Taveira wrote: > > > > On Sun, May 19, 2024, at 2:30 PM, Tom Lane wrote: > > > > I'm fairly disturbed about the readiness of pg_createsubscriber. > > The 040_pg_createsubscriber.pl TAP test is moderately unstable > > in the buildfarm [1], and there are various unaddressed complaints > > at the end of the patch thread (pretty much everything below [2]). > > I guess this is good enough to start beta with, but it's far from > > being good enough to ship, IMO. If there were active work going > > on to fix these things, I'd feel better, but neither the C code > > nor the test script have been touched since 1 April. > > > > > > My bad. :( I'll post patches soon to address all of the points. > > > > Just to summarize, apart from BF failures for which we had some > discussion, I could recall the following open points: > > 1. After promotion, the pre-existing replication objects should be > removed (either optionally or always), otherwise, it can lead to a new > subscriber not being able to restart or getting some unwarranted data. > [1][2]. > > 2. Retaining synced slots on new subscribers can lead to unnecessary > WAL retention and dead rows [3]. > > 3. We need to consider whether some of the messages displayed in > --dry-run mode are useful or not [4]. > The recent commits should silence BF failures and resolve point number 2. But we haven't done anything yet for 1 and 3. For 3, we have a patch in email [1] (v3-0005-Avoid*) which can be reviewed and committed but point number 1 needs discussion. Apart from that somewhat related to point 1, Kuroda-San has raised a point in an email [2] for replication slots. Shlok has presented a case in this thread [3] where the problem due to point 1 can cause ERRORs or can cause data inconsistency. Now, the easiest way out here is that we accept the issues with the pre-existing subscriptions and replication slots cases and just document them for now with the intention to work on those in the next version. OTOH, if there are no major challenges, we can try to implement a patch for them as well as see how it goes. [1] https://www.postgresql.org/message-id/CANhcyEWGfp7_AGg2zZUgJF_VYTCix01yeY8ZX9woz%2B03WCMPRg%40mail.gmail.com [2] https://www.postgresql.org/message-id/OSBPR01MB25525C17E2EF5FC81152F6C8F5F42%40OSBPR01MB2552.jpnprd01.prod.outlook.com [3] https://www.postgresql.org/message-id/CANhcyEWvimA1-f6hSrA%3D9qkfR5SonFb56b36M%2B%2BvT%3DLiFj%3D76g%40mail.gmail.com -- With Regards, Amit Kapila.
Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions
Hi John, On Tue, Jun 18, 2024 at 2:35 AM John H wrote: > > Hi Ashutosh, > > Thinking about this more, could you clarify the problem/issue at hand? > I think it's still not clear to me. > Yes, CREATE EXTENSION can create functions that lead to unexpected > privilege escalation, regardless > if they are SECURITY DEFINER or SECURITY INVOKER (if the function is > inadvertently executed by superuser). > But that's also true for a general CREATE FUNCTION call outside of extensions. > This specifically applies to extension functions, not standalone functions created independently. The difference is that installing extensions typically requires superuser privileges, which is not the case with standalone functions. -- With Regards, Ashutosh Sharma.
Re: Xact end leaves CurrentMemoryContext = TopMemoryContext
On Tue, 18 Jun 2024 at 08:37, Tom Lane wrote: > As to what to do about it: I'm imagining that instead of resetting > CurrentMemoryContext to TopMemoryContext, we set it to some special > context that we expect we can reset every so often, like at the start > of the next transaction. The existing TransactionAbortContext is > a very similar thing, and maybe could be repurposed/shared with this > usage. > > Thoughts? Instead, could we just not delete TopTransactionContext in AtCommit_Memory() and instead do MemoryContextReset() on it? Likewise in AtCleanup_Memory(). If we got to a stage where we didn't expect anything to allocate into that context outside of a transaction, we could check if the context is still reset in AtStart_Memory() and do something like raise a WARNING on debug builds (or Assert()) to alert us that some code that broke our expectations. It might also be a very tiny amount more efficient to not delete the context so we don't have to fetch a new context from the context freelist in AtStart_Memory(). Certainly, it wouldn't add any overhead. Adding a new special context would and so would the logic to reset it every so often. David
Re: Conflict Detection and Resolution
On Mon, Jun 17, 2024 at 3:23 PM Amit Kapila wrote: > > On Wed, Jun 12, 2024 at 10:03 AM Dilip Kumar wrote: > > > > On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra > > wrote: > > > > > > Yes, that's correct. However, many cases could benefit from the > > > > update_deleted conflict type if it can be implemented reliably. That's > > > > why we wanted to give it a try. But if we can't achieve predictable > > > > results with it, I'm fine to drop this approach and conflict_type. We > > > > can consider a better design in the future that doesn't depend on > > > > non-vacuumed entries and provides a more robust method for identifying > > > > deleted rows. > > > > > > > > > > I agree having a separate update_deleted conflict would be beneficial, > > > I'm not arguing against that - my point is actually that I think this > > > conflict type is required, and that it needs to be detected reliably. > > > > > > > When working with a distributed system, we must accept some form of > > eventual consistency model. However, it's essential to design a > > predictable and acceptable behavior. For example, if a change is a > > result of a previous operation (such as an update on node B triggered > > after observing an operation on node A), we can say that the operation > > on node A happened before the operation on node B. Conversely, if > > operations on nodes A and B are independent, we consider them > > concurrent. > > > > In distributed systems, clock skew is a known issue. To establish a > > consistency model, we need to ensure it guarantees the > > "happens-before" relationship. Consider a scenario with three nodes: > > NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and > > subsequently NodeB makes changes, and then both NodeA's and NodeB's > > changes are sent to NodeC, the clock skew might make NodeB's changes > > appear to have occurred before NodeA's changes. However, we should > > maintain data that indicates NodeB's changes were triggered after > > NodeA's changes arrived at NodeB. This implies that logically, NodeB's > > changes happened after NodeA's changes, despite what the timestamps > > suggest. > > > > A common method to handle such cases is using vector clocks for > > conflict resolution. > > > > I think the unbounded size of the vector could be a problem to store > for each event. However, while researching previous discussions, it > came to our notice that we have discussed this topic in the past as > well in the context of standbys. For recovery_min_apply_delay, we > decided the clock skew is not a problem as the settings of this > parameter are much larger than typical time deviations between servers > as mentioned in docs. Similarly for casual reads [1], there was a > proposal to introduce max_clock_skew parameter and suggesting the user > to make sure to have NTP set up correctly. We have tried to check > other databases (like Ora and BDR) where CDR is implemented but didn't > find anything specific to clock skew. So, I propose to go with a GUC > like max_clock_skew such that if the difference of time between the > incoming transaction's commit time and the local time is more than > max_clock_skew then we raise an ERROR. It is not clear to me that > putting bigger effort into clock skew is worth especially when other > systems providing CDR feature (like Ora or BDR) for decades have not > done anything like vector clocks. It is possible that this is less of > a problem w.r.t CDR and just detecting the anomaly in clock skew is > good enough. I believe that if we've accepted this solution elsewhere, then we can also consider the same. Basically, we're allowing the application to set its tolerance for clock skew. And, if the skew exceeds that tolerance, it's the application's responsibility to synchronize; otherwise, an error will occur. This approach seems reasonable. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Xact end leaves CurrentMemoryContext = TopMemoryContext
David Rowley writes: > Instead, could we just not delete TopTransactionContext in > AtCommit_Memory() and instead do MemoryContextReset() on it? Likewise > in AtCleanup_Memory(). Hmm, that's a nice idea. Maybe reset again in AtStart_Memory, although that seems optional. My first reaction was "what about memory context callbacks attached to TopTransactionContext?" ... but those are defined to be fired on either reset or delete, so semantically this seems like it creates no issues. And you're right that not constantly deleting and recreating that context should save some microscopic amount. > If we got to a stage where we didn't expect anything to allocate into > that context outside of a transaction, we could check if the context > is still reset in AtStart_Memory() and do something like raise a > WARNING on debug builds (or Assert()) to alert us that some code that > broke our expectations. My point is exactly that I don't think that we can expect that, or at least that the cost of guaranteeing it will vastly outweigh any possible benefit. (So I wasn't excited about Andres' suggestion. But this one seems promising.) I'll poke at this tomorrow, unless you're hot to try it right now. regards, tom lane