Re: avoid multiple hard links to same WAL file after a crash
On Tue, Apr 26, 2022 at 01:09:35PM -0700, Nathan Bossart wrote: > Here is an attempt at creating something that can be back-patched. 0001 > simply replaces calls to durable_rename_excl() with durable_rename() and is > intended to be back-patched. 0002 removes the definition of > durable_rename_excl() and is _not_ intended for back-patching. I imagine > 0002 will need to be held back for v16devel. I would not mind applying 0002 on HEAD now to avoid more uses of this API, and I can get behind 0001 after thinking more about it. > I think back-patching 0001 will encounter a couple of small obstacles. For > example, the call in basic_archive won't exist on most of the > back-branches, and durable_rename_excl() was named durable_link_or_rename() > before v13. I don't mind producing a patch for each back-branch if needed. I am not sure that have any need to backpatch this change based on the unlikeliness of the problem, TBH. One thing that is itching me a bit, like Robert upthread, is that we don't check anymore that the newfile does not exist in the code paths because we never expect one. It is possible to use stat() for that. But access() within a simple assertion would be simpler? Say something like: Assert(access(path, F_OK) != 0 && errno == ENOENT); The case for basic_archive is limited as the comment of the patch states, but that would be helpful for the two calls in timeline.c and the one in xlog.c in the long-term. And this has no need to be part of fd.c, this can be added before the durable_rename() calls. What do you think? -- Michael signature.asc Description: PGP signature
Re: Fix NULL pointer reference in _outPathTarget()
On 22.04.22 16:18, Tom Lane wrote: Peter Eisentraut writes: On 20.04.22 18:53, Tom Lane wrote: Yeah, that's another way to do it. I think though that the unresolved question is whether or not we want the field name to appear in the output when the field is null. I believe that I intentionally made it not appear originally, so that that case could readily be distinguished. You could argue that that would complicate life greatly for a _readPathTarget() function, which is true, but I don't foresee that we'll need one. We could adapt the convention to print NULL values as "<>", like Works for me. done
Fix primary crash continually with invalid checkpoint after promote
Newly promoted primary may leave an invalid checkpoint. In function CreateRestartPoint, control file is updated and old wals are removed. But in some situations, control file is not updated, old wals are still removed. Thus produces an invalid checkpoint with nonexistent wal. Crucial log: "invalid primary checkpoint record", "could not locate a valid checkpoint record". The following timeline reproduces above situation: tl1: standby begins to create restart point (time or wal triggered). tl2: standby promotes and control file state is updated to DB_IN_PRODUCTION. Control file will not update (xlog.c:9690). But old wals are still removed (xlog.c:9719). tl3: standby becomes primary. primary may crash before the next complete checkpoint (OOM in my situation). primary will crash continually with invalid checkpoint. The attached patch reproduces this problem using standard postgresql perl test, you can run with ./configure --enable-tap-tests; make -j; make -C src/test/recovery/ check PROVE_TESTS=t/027_invalid_checkpoint_after_promote.pl The attached patch also fixes this problem by ensuring that remove old wals only after control file is updated. 0001-Fix-primary-crash-continually-with-invalid-checkpoint-after-promote.patch Description: Binary data
Wrong rows count in EXPLAIN
When I create a new table, and then I evaluate the execution of the SELECT query, I see a strange rows count in EXPLAIN CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL); ANALYZE test1; EXPLAIN SELECT * FROM test1; QUERY PLAN - Seq Scan on test1 (cost=0.00..35.50 rows=2550 width=4) (1 row) Table is empty but rows=2550. Seem like it was calculated from some default values. Is this normal behavior or a bug? Can it lead to a poor choice of the plan of a query in general?
Re: Fix primary crash continually with invalid checkpoint after promote
On Tue, Apr 26, 2022 at 03:16:13PM +0800, Zhao Rui wrote: > In function CreateRestartPoint, control file is updated and old wals are > removed. But in some situations, control file is not updated, old wals are > still removed. Thus produces an invalid checkpoint with nonexistent wal. > Crucial log: "invalid primary checkpoint record", "could not locate a valid > checkpoint record". I think this is the same issue tracked here: [0]. [0] https://postgr.es/m/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.
On Tue, Apr 26, 2022 at 12:54:35PM +0800, Julien Rouhaud wrote: > I searched a bit and apparently some people are using this function directly > opening some dll, which seems wrong. I was wondering about this whole business, and the manifest approach is a *horrible* design for an API where the goal is to know if your run-time environment is greater than a given threshold. >> Another Idea on windows machines would be to use the commandline to execute >> ver in a separate Process and store the result in a file. > > That also seems hackish, I don't think that we want to rely on something like > that. Hmm. That depends on the dependency set, I guess. We do that on Linux at some extent to for large pages in sysv_shmem.c. Perhaps this could work for Win10 if this avoids the extra loopholes with the manifests. > Their API is entirely useless, This I agree. > so I'm still on the opinion that we should > unconditionally use the FILE_MAP_LARGE_PAGES flag if it's defined and call it > a > day. Are we sure that this is not going to cause failures in environments where the flag is not supported? -- Michael signature.asc Description: PGP signature
Re: pgsql: Add contrib/pg_walinspect.
On Wed, Apr 27, 2022 at 8:57 AM Bharath Rupireddy wrote: > > On Wed, Apr 27, 2022 at 8:45 AM Tom Lane wrote: > > > > I wrote: > > > Thomas Munro writes: > > >> BTW If you had your local change from debug.patch (upthread), that'd > > >> defeat the patch. I mean this: > > > > >> +if(!*errormsg) > > >> + *errormsg = "decode_queue_head is null"; > > > > > Oh! Okay, I'll retry without that. > > > > I've now done several runs with your patch and not seen the test failure. > > However, I think we ought to rethink this API a bit rather than just > > apply the patch as-is. Even if it were documented, relying on > > errormsg = NULL to mean something doesn't seem like a great plan. > > Sorry for being late in the game, occupied with other stuff. > > How about using private_data of XLogReaderState for > read_local_xlog_page_no_wait, something like this? > > typedef struct ReadLocalXLogPageNoWaitPrivate > { > bool end_of_wal; > } ReadLocalXLogPageNoWaitPrivate; > > In read_local_xlog_page_no_wait: > > /* If asked, let's not wait for future WAL. */ > if (!wait_for_wal) >{ > private_data->end_of_wal = true; > break; >} > > /* > * Opaque data for callbacks to use. Not used by XLogReader. > */ > void *private_data; I found an easy way to reproduce this consistently (I think on any server): I basically generated huge WAL record (I used a fun extension that I wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can use pg_logical_emit_message as well) session 1: select * from pg_synthesize_wal_record(1*1024*1024); --> generate 1 MB of WAL record first and make a note of the output lsn (lsn1) session 2: select * from pg_get_wal_records_info_till_end_of_wal(lsn1); \watch 1 session 1: select * from pg_synthesize_wal_record(1000*1024*1024); --> generate ~1 GB of WAL record and we see ERROR: could not read WAL at X in session 2. Delay the checkpoint (set checkpoint_timeout to 1hr) just not recycle the wal files while we run pg_walinspect functions, no other changes required from the default initdb settings on the server. And, Thomas's patch fixes the issue. Regards, Bharath Rupireddy.
Re: Wrong rows count in EXPLAIN
Hi! > 26 апр. 2022 г., в 13:45, Пантюшин Александр Иванович > написал(а): > > When I create a new table, and then I evaluate the execution of the SELECT > query, I see a strange rows count in EXPLAIN > CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL); > ANALYZE test1; > EXPLAIN SELECT * FROM test1; >QUERY PLAN > - > Seq Scan on test1 (cost=0.00..35.50 rows=2550 width=4) > (1 row) > > Table is empty but rows=2550. Seem like it was calculated from some default > values. > Is this normal behavior or a bug? Can it lead to a poor choice of the plan of > a query in general? Which Postgres version do you use? I observe: postgres=# CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL); CREATE TABLE postgres=# ANALYZE test1; ANALYZE postgres=# EXPLAIN SELECT * FROM test1; QUERY PLAN - Index Only Scan using test1_pkey on test1 (cost=0.12..8.14 rows=1 width=4) (1 row) postgres=# select version(); version -- PostgreSQL 15devel on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit (1 row) Without "ANALYZE test1;" table_block_relation_estimate_size() assumes relation size is 10 blocks. Best regards, Andrey Borodin.
Re: bogus: logical replication rows/cols combinations
On 2022-Apr-26, Tomas Vondra wrote: > I'm not quite sure which of the two behaviors is more "desirable". In a > way, it's somewhat similar to publish_as_relid, which is also calculated > not considering which of the row filters match? I grepped doc/src/sgml for `publish_as_relid` and found no hits, so I suppose it's not a user-visible feature as such. > But maybe you're right and it should behave the way you propose ... the > example I have in mind is a use case replicating table with two types of > rows - sensitive and non-sensitive. For sensitive, we replicate only > some of the columns, for non-sensitive we replicate everything. Exactly. If we blindly publish row/column values that aren't in *any* publications, this may lead to leaking protected values. > Changing this to behave the way you expect would be quite difficult, > because at the moment we build a single OR expression from all the row > filters. We'd have to keep the individual expressions, so that we can > build a column list for each of them (in order to ignore those that > don't match). I think we should do that, yeah. > I can take a stab at it, but it seems strange to not apply the same > logic to evaluation of publish_as_relid. I wonder what Amit thinks about > this, as he wrote the row filter stuff. By grepping publicationcmds.c, it seems that publish_as_relid refers to the ancestor partitioned table that is used for column list and rowfilter determination, when a partition is being published as part of it. I don't think these things are exactly parallel. ... In fact I think they are quite orthogonal: probably you should be able to publish a partitioned table in two publications, with different rowfilters and different column lists (which can come from the topmost partitioned table), and each partition should still work in the way I describe above. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "[PostgreSQL] is a great group; in my opinion it is THE best open source development communities in existence anywhere."(Lamar Owen)
Re: bogus: logical replication rows/cols combinations
On 2022-Apr-27, Amit Kapila wrote: > On Tue, Apr 26, 2022 at 4:00 AM Tomas Vondra > wrote: > > I can take a stab at it, but it seems strange to not apply the same > > logic to evaluation of publish_as_relid. > > Yeah, the current behavior seems to be consistent with what we already > do. Sorry, this argument makes no sense to me. The combination of both features is not consistent, and both features are new. 'publish_as_relid' is an implementation detail. If the implementation fails to follow the feature design, then the implementation must be fixed ... not the design! IMO, we should first determine how we want row filters and column lists to work when used in conjunction -- for relations (sets of rows) in a general sense. After we have done that, then we can use that design to drive how we want partitioned tables to be handled for it. Keep in mind that when users see a partitioned table, what they first see is a table. They want all their tables to work in pretty much the same way -- partitioned or not partitioned. The fact that a table is partitioned should affect as little as possible the way it interacts with other features. Now, another possibility is to say "naah, this is too hard", or even "naah, there's no time to write all that for this release". That might be okay, but in that case let's add an implementation restriction to ensure that we don't paint ourselves in a corner regarding what is reasonable behavior. For example, an easy restriction might be: if a table is in multiple publications with mismatching row filters/column lists, then a subscriber is not allowed to subscribe to both publications. (Maybe this restriction isn't exactly what we need so that it actually implements what we need, not sure). Then, if/when in the future we implement this correctly, we can lift the restriction. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La conclusión que podemos sacar de esos estudios es que no podemos sacar ninguna conclusión de ellos" (Tanenbaum)
Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.
On Tue, Apr 26, 2022, 05:55 Julien Rouhaud wrote: > Hi, > > Please keep the list in copy, especially if that's about Windows specific > as > I'm definitely not very knowledgeable about it. > > On Fri, Apr 01, 2022 at 09:18:03AM +, Wilm Hoyer wrote: > > > > If you don't wanna go the manifest way, maybe the RtlGetVersion function > is the one you need: > > > https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-rtlgetversion?redirectedfrom=MSDN > > Thanks for the info! I tried to use the function but trying to include > either > wdm.h or Ntddk.h errors out. Unfortunately I don't know how to look for a > file > in Windows so I don't even know if those files are present. > That's a kernel api for use in drivers. Not in applications. You need the device driver developer kit to get to the headers. It's not supposed to be used from a user land application. But note the documentation comment that says: “*RtlGetVersion* is the kernel-mode equivalent of the user-mode *GetVersionEx* function in the Windows SDK. ". Tldr, user mode applications are supposed to use GetVersionEx(). /Magnus
RE: bogus: logical replication rows/cols combinations
On Wednesday, April 27, 2022 12:56 PM From: Amit Kapila wrote: > On Tue, Apr 26, 2022 at 4:00 AM Tomas Vondra > wrote: > > > > On 4/25/22 17:48, Alvaro Herrera wrote: > > > > > The desired result on subscriber is: > > > > > > table uno; > > > a │ b │ c > > > ┼───┼─── > > > 1 │ 2 │ > > > -1 │ │ 4 > > > > > > > > > Thoughts? > > > > > > > I'm not quite sure which of the two behaviors is more "desirable". In a > > way, it's somewhat similar to publish_as_relid, which is also calculated > > not considering which of the row filters match? > > > > Right, or in other words, we check all publications to decide it and > similar is the case for publication actions which are also computed > independently for all publications. > > > But maybe you're right and it should behave the way you propose ... the > > example I have in mind is a use case replicating table with two types of > > rows - sensitive and non-sensitive. For sensitive, we replicate only > > some of the columns, for non-sensitive we replicate everything. Which > > could be implemented as two publications > > > > create publication sensitive_rows > >for table t (a, b) where (is_sensitive); > > > > create publication non_sensitive_rows > >for table t where (not is_sensitive); > > > > But the way it's implemented now, we'll always replicate all columns, > > because the second publication has no column list. > > > > Changing this to behave the way you expect would be quite difficult, > > because at the moment we build a single OR expression from all the row > > filters. We'd have to keep the individual expressions, so that we can > > build a column list for each of them (in order to ignore those that > > don't match). > > > > We'd have to remove various other optimizations - for example we can't > > just discard row filters if we found "no_filter" publication. > > > > I don't think that is the right way. We need some way to combine > expressions and I feel the current behavior is sane. I mean to say > that even if there is one publication that has no filter (column/row), > we should publish all rows with all columns. Now, as mentioned above > combining row filters or column lists for all publications appears to > be consistent with what we already do and seems correct behavior to > me. > > To me, it appears that the method used to decide whether a particular > table is published or not is also similar to what we do for row > filters or column lists. Even if there is one publication that > publishes all tables, we consider the current table to be published > irrespective of whether other publications have published that table > or not. > > > Or more > > precisely, we'd have to consider column lists too. > > > > In other words, we'd have to merge pgoutput_column_list_init into > > pgoutput_row_filter_init, and then modify pgoutput_row_filter to > > evaluate the row filters one by one, and build the column list. > > > > Hmm, I think even if we want to do something here, we also need to > think about how to achieve similar behavior for initial tablesync > which will be more tricky. I think it could be difficult to make the initial tablesync behave the same. Currently, we make a "COPY" command to do the table sync, I am not sure how to change the "COPY" query to achieve the expected behavior here. BTW, For the use case mentioned here: """ replicating table with two types of rows - sensitive and non-sensitive. For sensitive, we replicate only some of the columns, for non-sensitive we replicate everything. """ One approach to do this is to create two subscriptions and two publications which seems a workaround. - create publication uno for table uno (a, b) where (a > 0); create publication dos for table uno (a, c) where (a < 0); create subscription sub_uno connection 'port=55432 dbname=alvherre' publication uno; create subscription sub_dos connection 'port=55432 dbname=alvherre' publication dos; - Best regards, Hou zj
Re: pgsql: Add contrib/pg_walinspect.
On Wed, Apr 27, 2022 at 1:47 PM Bharath Rupireddy wrote: > > > > > > I've now done several runs with your patch and not seen the test failure. > > > However, I think we ought to rethink this API a bit rather than just > > > apply the patch as-is. Even if it were documented, relying on > > > errormsg = NULL to mean something doesn't seem like a great plan. > > > > Sorry for being late in the game, occupied with other stuff. > > > > How about using private_data of XLogReaderState for > > read_local_xlog_page_no_wait, something like this? > > > > typedef struct ReadLocalXLogPageNoWaitPrivate > > { > > bool end_of_wal; > > } ReadLocalXLogPageNoWaitPrivate; > > > > In read_local_xlog_page_no_wait: > > > > /* If asked, let's not wait for future WAL. */ > > if (!wait_for_wal) > >{ > > private_data->end_of_wal = true; > > break; > >} > > > > /* > > * Opaque data for callbacks to use. Not used by XLogReader. > > */ > > void *private_data; > > I found an easy way to reproduce this consistently (I think on any server): > > I basically generated huge WAL record (I used a fun extension that I > wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can > use pg_logical_emit_message as well) > session 1: > select * from pg_synthesize_wal_record(1*1024*1024); --> generate 1 MB > of WAL record first and make a note of the output lsn (lsn1) > > session 2: > select * from pg_get_wal_records_info_till_end_of_wal(lsn1); > \watch 1 > > session 1: > select * from pg_synthesize_wal_record(1000*1024*1024); --> generate > ~1 GB of WAL record and we see ERROR: could not read WAL at X in > session 2. > > Delay the checkpoint (set checkpoint_timeout to 1hr) just not recycle > the wal files while we run pg_walinspect functions, no other changes > required from the default initdb settings on the server. > > And, Thomas's patch fixes the issue. Here's v2 patch (up on Thomas's v1 at [1]) using private_data to set the end of the WAL flag. Please have a look at it. [1] https://www.postgresql.org/message-id/CA%2BhUKGLtswFk9ZO3WMOqnDkGs6dK5kCdQK9gxJm0N8gip5cpiA%40mail.gmail.com Regards, Bharath Rupireddy. v2-0001-Fix-pg_walinspect-race-against-flush-LSN.patch Description: Binary data
Re: bogus: logical replication rows/cols combinations
On Wed, Apr 27, 2022 at 3:13 PM Alvaro Herrera wrote: > > On 2022-Apr-26, Tomas Vondra wrote: > > > I'm not quite sure which of the two behaviors is more "desirable". In a > > way, it's somewhat similar to publish_as_relid, which is also calculated > > not considering which of the row filters match? > > I grepped doc/src/sgml for `publish_as_relid` and found no hits, so > I suppose it's not a user-visible feature as such. > `publish_as_relid` is computed based on 'publish_via_partition_root' setting of publication which is a user-visible feature. > > But maybe you're right and it should behave the way you propose ... the > > example I have in mind is a use case replicating table with two types of > > rows - sensitive and non-sensitive. For sensitive, we replicate only > > some of the columns, for non-sensitive we replicate everything. > > Exactly. If we blindly publish row/column values that aren't in *any* > publications, this may lead to leaking protected values. > > > Changing this to behave the way you expect would be quite difficult, > > because at the moment we build a single OR expression from all the row > > filters. We'd have to keep the individual expressions, so that we can > > build a column list for each of them (in order to ignore those that > > don't match). > > I think we should do that, yeah. > This can hit the performance as we need to evaluate each expression for each row. > > I can take a stab at it, but it seems strange to not apply the same > > logic to evaluation of publish_as_relid. I wonder what Amit thinks about > > this, as he wrote the row filter stuff. > > By grepping publicationcmds.c, it seems that publish_as_relid refers to > the ancestor partitioned table that is used for column list and > rowfilter determination, when a partition is being published as part of > it. > Yeah, this is true when the corresponding publication has set 'publish_via_partition_root' as true. > I don't think these things are exactly parallel. > Currently, when the subscription has multiple publications, we combine the objects, and actions of those publications. It happens for 'publish_via_partition_root', publication actions, tables, column lists, or row filters. I think the whole design works on this idea even the initial table sync. I think it might need a major change (which I am not sure about at this stage) if we want to make the initial sync also behave similar to what you are proposing. I feel it would be much easier to create two different subscriptions as mentioned by Hou-San [1] for the case you are talking about if the user really needs something like that. > ... In fact I think they are quite orthogonal: probably you should be > able to publish a partitioned table in two publications, with different > rowfilters and different column lists (which can come from the > topmost partitioned table), and each partition should still work in the > way I describe above. > We consider the column lists or row filters for either the partition (on which the current operation is performed) or partitioned table based on 'publish_via_partition_root' parameter of publication. [1] - https://www.postgresql.org/message-id/OS0PR01MB5716B82315A067F1D78F247E94FA9%40OS0PR01MB5716.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
Re: Wrong rows count in EXPLAIN
> 27 апр. 2022 г., в 15:17, Пантюшин Александр Иванович > написал(а): > > Hi, > >Which Postgres version do you use? > I checked this on PG 11 > ... > and on PG 13 Yes, I think before 3d351d91 it was impossible to distinguish between actually empty and never analyzed table. But now it is working just as you would expect. There's an interesting relevant discussion linked to the commit message. Best regards, Andrey Borodin. [0] https://github.com/postgres/postgres/commit/3d351d916b20534f973eda760cde17d96545d4c4
Re: Wrong rows count in EXPLAIN
On Wed, 27 Apr 2022 at 21:08, Andrey Borodin wrote: > Which Postgres version do you use? 3d351d91 changed things so we could tell the difference between a relation which was analyzed and is empty vs a relation that's never been analyzed. That's why you're not seeing the same behaviour as the OP. Tom's commit message [1] also touches on the "safety measure". Here he's referring to the 2550 estimate, or more accurately, 10 pages filled with tuples of that width. This is intended so that newly created tables that quickly subsequently are loaded with data then queried before auto-analyze gets a chance to run are not assumed to be empty. The problem, if we assumed these non-analyzed tables were empty, would be that the planner would likely choose plans containing nodes like Seq Scans and non-parameterized Nested Loops rather than maybe Index Scans and Merge or Hash joins. The 10-page thing is aimed to try and avoid the planner from making that mistake. Generally, the planner underestimating the number of rows causes worse problems than when it overestimates the row counts. So 10 seems much better than 0. David [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3d351d916b20534f973eda760cde17d96545d4c4
Re: bogus: logical replication rows/cols combinations
On 2022-Apr-27, Amit Kapila wrote: > On Wed, Apr 27, 2022 at 3:13 PM Alvaro Herrera > wrote: > > > Changing this to behave the way you expect would be quite difficult, > > > because at the moment we build a single OR expression from all the row > > > filters. We'd have to keep the individual expressions, so that we can > > > build a column list for each of them (in order to ignore those that > > > don't match). > > > > I think we should do that, yeah. > > This can hit the performance as we need to evaluate each expression > for each row. So we do things because they are easy and fast, rather than because they work correctly? > > ... In fact I think they are quite orthogonal: probably you should be > > able to publish a partitioned table in two publications, with different > > rowfilters and different column lists (which can come from the > > topmost partitioned table), and each partition should still work in the > > way I describe above. > > We consider the column lists or row filters for either the partition > (on which the current operation is performed) or partitioned table > based on 'publish_via_partition_root' parameter of publication. OK, but this isn't relevant to what I wrote. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: bogus: logical replication rows/cols combinations
On Wed, Apr 27, 2022 at 4:27 PM Alvaro Herrera wrote: > > On 2022-Apr-27, Amit Kapila wrote: > > > On Wed, Apr 27, 2022 at 3:13 PM Alvaro Herrera > > wrote: > > > > > Changing this to behave the way you expect would be quite difficult, > > > > because at the moment we build a single OR expression from all the row > > > > filters. We'd have to keep the individual expressions, so that we can > > > > build a column list for each of them (in order to ignore those that > > > > don't match). > > > > > > I think we should do that, yeah. > > > > This can hit the performance as we need to evaluate each expression > > for each row. > > So we do things because they are easy and fast, rather than because they > work correctly? > The point is I am not sure if what you are saying is better behavior than current but if others feel it is better then we can try to do something for it. In the above sentence, I just wanted to say that it will impact performance but if that is required then sure we should do it that way. -- With Regards, Amit Kapila.
[PATCH v1] remove redundant check of item pointer
In function ItemPointerEquals, the ItemPointerGetBlockNumber already checked the ItemPointer if valid, there is no need to check it again in ItemPointerGetOffset, so use ItemPointerGetOffsetNumberNoCheck instead. Signed-off-by: Junwang Zhao --- src/backend/storage/page/itemptr.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/storage/page/itemptr.c b/src/backend/storage/page/itemptr.c index 9011337aa8..61ad727b1d 100644 --- a/src/backend/storage/page/itemptr.c +++ b/src/backend/storage/page/itemptr.c @@ -37,8 +37,8 @@ ItemPointerEquals(ItemPointer pointer1, ItemPointer pointer2) if (ItemPointerGetBlockNumber(pointer1) == ItemPointerGetBlockNumber(pointer2) && - ItemPointerGetOffsetNumber(pointer1) == - ItemPointerGetOffsetNumber(pointer2)) + ItemPointerGetOffsetNumberNoCheck(pointer1) == + ItemPointerGetOffsetNumberNoCheck(pointer2)) return true; else return false; -- 2.33.0
Re: Skipping schema changes in publication
On Tue, Apr 26, 2022 at 11:32 AM osumi.takami...@fujitsu.com wrote: > > On Thursday, April 21, 2022 12:15 PM vignesh C wrote: > > Updated patch by changing the syntax to use EXCEPT instead of SKIP. > Hi > > > This is my review comments on the v2 patch. > > (1) gram.y > > I think we can make a unified function that merges > preprocess_alltables_pubobj_list with check_except_in_pubobj_list. > > With regard to preprocess_alltables_pubobj_list, > we don't use the 2nd argument "location" in this function. Removed location and made a unified function. > (2) create_publication.sgml > > + > + Create a publication that publishes all changes in all the tables except > for > + the changes of users and > + departments table; > > This sentence should end ":" not ";". Modified > (3) publication.out & publication.sql > > +-- fail - can't set except table to schema publication > +ALTER PUBLICATION testpub_forschema SET EXCEPT TABLE testpub_tbl1; > > There is one unnecessary space in the comment. > Kindly change from "schema publication" to "schema publication". Modified > (4) pg_dump.c & describe.c > > In your first email of this thread, you explained this feature > is for PG16. Don't we need additional branch for PG16 ? > > @@ -6322,6 +6328,21 @@ describePublications(const char *pattern) > } > } > > + if (pset.sversion >= 15) > + { > > > @@ -4162,7 +4164,7 @@ getPublicationTables(Archive *fout, TableInfo > tblinfo[], int numTables) > /* Collect all publication membership info. */ > if (fout->remoteVersion >= 15) > appendPQExpBufferStr(query, > -"SELECT tableoid, > oid, prpubid, prrelid, " > +"SELECT tableoid, > oid, prpubid, prrelid, prexcept," > Modified by adding a comment saying "FIXME: 15 should be changed to 16 later for PG16." > (5) psql-ref.sgml > > +If + is appended to the command name, the tables, > +except tables and schemas associated with each publication are shown > as > +well. > > I'm not sure if "except tables" is a good description. > I suggest "excluded tables". This applies to the entire patch, > in case if this is reasonable suggestion. Modified it in most of the places where it was applicable. I felt the usage was ok in a few places. Thanks for the comments, the attached v3 patch has the changes for the same. Regards. Vignesh From f8a8dff478638f822377f2515f69df9c6cce501e Mon Sep 17 00:00:00 2001 From: Vigneshwaran C Date: Wed, 20 Apr 2022 11:19:50 +0530 Subject: [PATCH v3] Skip publishing the tables specified in EXCEPT TABLE. A new option "EXCEPT TABLE" in Create/Alter Publication allows one or more tables to be excluded, publisher will exclude sending the data of the excluded tables to the subscriber. The new syntax allows specifying schemas. For example: CREATE PUBLICATION pub1 FOR ALL TABLES EXCEPT TABLE t1,t2; OR ALTER PUBLICATION pub1 ADD EXCEPT TABLE t1,t2; A new column prexcept is added to table "pg_publication_rel", to maintain the relations that the user wants to exclude publishing through the publication. Modified the output plugin (pgoutput) to exclude publishing the changes of the excluded tables. Updates pg_dump to identify and dump the excluded tables of the publications. Updates the \d family of commands to display excluded tables of the publications and \dRp+ variant will now display associated except tables if any. Bump catalog version. --- doc/src/sgml/catalogs.sgml| 9 ++ doc/src/sgml/logical-replication.sgml | 8 +- doc/src/sgml/ref/alter_publication.sgml | 14 ++- doc/src/sgml/ref/create_publication.sgml | 29 - doc/src/sgml/ref/psql-ref.sgml| 5 +- src/backend/catalog/pg_publication.c | 36 -- src/backend/commands/publicationcmds.c| 106 +++--- src/backend/commands/tablecmds.c | 4 +- src/backend/parser/gram.y | 78 +++-- src/backend/replication/pgoutput/pgoutput.c | 25 ++--- src/backend/utils/cache/relcache.c| 17 ++- src/bin/pg_dump/pg_dump.c | 45 ++-- src/bin/pg_dump/pg_dump.h | 1 + src/bin/pg_dump/pg_dump_sort.c| 7 ++ src/bin/pg_dump/t/002_pg_dump.pl | 23 src/bin/psql/describe.c | 52 +++-- src/bin/psql/tab-complete.c | 15 ++- src/include/catalog/pg_publication.h | 7 +- src/include/catalog/pg_publication_rel.h | 1 + src/include/commands/publicationcmds.h| 4 +- src/include/nodes/parsenodes.h| 2 + src/test/regress/expected/publication.out | 81 - src/test/regress/sql/publication.sql | 40 ++- .../t/033_rep_changes_except_tabl
Re: Wrong rows count in EXPLAIN
=?koi8-r?B?8MHO1MDbyc4g4czFy9PBzsTSIOnXwc7P18ne?= writes: > When I create a new table, and then I evaluate the execution of the SELECT > query, I see a strange rows count in EXPLAIN > CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL); > ANALYZE test1; > EXPLAIN SELECT * FROM test1; >QUERY PLAN > - > Seq Scan on test1 (cost=0.00..35.50 rows=2550 width=4) > (1 row) > Table is empty but rows=2550. This is intentional, arising from the planner's unwillingness to assume that a table is empty. It assumes that such a table actually contains (from memory) 10 pages, and then backs into a rowcount estimate from that depending on the data-type-dependent width of the table rows. Without this provision, we'd produce very bad plans for cases where a newly-populated table hasn't been analyzed yet. regards, tom lane
Re: Wrong rows count in EXPLAIN
On Wed, Apr 27, 2022 at 09:44:21AM -0400, Tom Lane wrote: > =?koi8-r?B?8MHO1MDbyc4g4czFy9PBzsTSIOnXwc7P18ne?= > writes: > > When I create a new table, and then I evaluate the execution of the SELECT > > query, I see a strange rows count in EXPLAIN > > CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL); > > ANALYZE test1; > > EXPLAIN SELECT * FROM test1; > >QUERY PLAN > > - > > Seq Scan on test1 (cost=0.00..35.50 rows=2550 width=4) > > (1 row) > > > Table is empty but rows=2550. > > This is intentional, arising from the planner's unwillingness to > assume that a table is empty. It assumes that such a table actually > contains (from memory) 10 pages, and then backs into a rowcount > estimate from that depending on the data-type-dependent width of > the table rows. > > Without this provision, we'd produce very bad plans for cases > where a newly-populated table hasn't been analyzed yet. We could have a noice mode that warns when a table without statistics is used. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
cirrus: run macos with COPY_PARSE_PLAN_TREES etc
fork: <20220325000933.vgazz7pjk2ytj...@alap3.anarazel.de> On Thu, Mar 24, 2022 at 05:09:33PM -0700, Andres Freund wrote: > On 2022-03-24 18:51:30 -0400, Andrew Dunstan wrote: > > I wonder if we should add these compile flags to the cfbot's setup? > > Yes, I think we should. There's a bit of discussion of that in and below > https://postgr.es/m/20220213051937.GO31460%40telsasoft.com - that veered a bit > of course, so I haven't done anything about it yet. Perhaps one build > COPY_PARSE_PLAN_TREES and RAW_EXPRESSION_COVERAGE_TEST another > WRITE_READ_PARSE_PLAN_TREES? We should add the slower to the macos build, > that's plenty fast and I'm intending to slow the linux test by using ubsan, > which works better on linux. Why would you put them on different tasks ? to avoid slowing down one task too much ? That doesn't seem to be an issue, at least for those three defines. What about adding RELCACHE_FORCE_RELEASE, too ? Even with that, macos is only ~1min slower. https://cirrus-ci.com/task/5456727205216256 commit 53480b8db63b5cd2476142e28ed3f9fe8480f9f3 Author: Justin Pryzby Date: Thu Apr 14 06:27:07 2022 -0500 cirrus/macos: enable various runtime checks cirrus CI can take a while to be schedule on macos, but the instance always has many cores, so this is a good platform to enable options which will slow it down. See: https://www.postgresql.org/message-id/20211217193159.pwrelhiyx7kev...@alap3.anarazel.de https://www.postgresql.org/message-id/20211213211223.vkgg3wwiss2tragj%40alap3.anarazel.de https://www.postgresql.org/message-id/CAH2-WzmevBhKNEtqX3N-Tkb0gVBHH62C0KfeTxXzqYES_PiFiA%40mail.gmail.com https://www.postgresql.org/message-id/20220325000933.vgazz7pjk2ytj...@alap3.anarazel.de ci-os-only: macos diff --git a/.cirrus.yml b/.cirrus.yml index e0264929c74..4a655fc 100644 --- a/.cirrus.yml +++ b/.cirrus.yml @@ -337,6 +337,7 @@ task: CLANG="ccache ${brewpath}/llvm/bin/ccache" \ CFLAGS="-Og -ggdb" \ CXXFLAGS="-Og -ggdb" \ + CPPFLAGS="-DRELCACHE_FORCE_RELEASE -DCOPY_PARSE_PLAN_TREES -DWRITE_READ_PARSE_PLAN_TREES -DRAW_EXPRESSION_COVERAGE_TEST" \ \ LLVM_CONFIG=${brewpath}/llvm/bin/llvm-config \ PYTHON=python3
AW: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.
On Tue, Apr 26, 2022 at 12:54:35PM +0800, Julien Rouhaud wrote: >> I searched a bit and apparently some people are using this function >> directly opening some dll, which seems wrong. > I was wondering about this whole business, and the manifest approach is a > *horrible* design for an API where the goal is to know if your run-time > environment is greater than a given threshold. Agreed for the use case at hand, where you want to use one core API Function or another depending on the OS Version. One Blog from Microsoft, I remember, told that one reason for the change were the increase of false installation error messages "Install Error - Your system does not meet the minimum supported operating system and service pack level." where the software in question was written for Windows XP and the user tried to install it on, say, Windows 8. That is just a Developer-Pilot error, where the Developers forgot to anticipate future OS Versions and instead of checking for Version at least, where checking for Version equality of all at design time known Windows Version. Since you can develop only against OS APIs known at design time, and Microsoft claims to be pretty good at maintaining backward compatible facades for their APIs, there is some reason in that decision. (To only see the Versions and APIs you told the OS with the manifest, you knew about at compile time). The core Problem at hand is, that ms broke the promise of backward compatibility, since the function in question is working differently, depending on windows version, although with the above reasoning we should get the exact same behavior on windows 10 as on windows 8.1 (as PostgreSql, per default, only claims to know about Windows 8.1 features). That said, I can understand the design decision. Personally, I still don't like it a bit, since developers should be allowed to make some stupid mistakes. >>> Another Idea on windows machines would be to use the commandline to >>> execute ver in a separate Process and store the result in a file. >> >> That also seems hackish, I don't think that we want to rely on >> something like that. >Hmm. That depends on the dependency set, I guess. We do that on Linux at >some extent to for large pages in sysv_shmem.c. Perhaps this could work for >Win10 if this avoids the extra loopholes with the >manifests. I used the following hack to get the "real" Major and Minor Version of Windows - it's in C# (.Net) and needs to be adjusted (you can compile as x64 and use a long-long as return value ) to return the Service Number too and translated it into C. I share it anyways, as it might help - please be kind, as it really is a little hack. Situation: Main Application can't or is not willing to add a manifest file into its resources. Solution: Start a small executable (which has a manifest file compiled into its resources), let it read the OS Version and code the Version into its return Code. CInt32 is basically an integer redefinition, where one can access the lower and higher Int16 separately. The Main Programm eventually calls this (locate the executable, adjust the Process startup to be minimal, call the executable as separate process and interpret the return value as Version): private static Version ReadModernOsVersionInternal() { String codeBase = Assembly.GetExecutingAssembly().CodeBase; Uri uri = new Uri(codeBase); String localPath = uri.LocalPath; String pathDirectory = Path.GetDirectoryName(localPath); if (pathDirectory != null) { String fullCombinePath = Path.Combine(pathDirectory, "Cf.Utilities.ReadModernOSVersion"); ProcessStartInfo processInfo = new ProcessStartInfo { FileName = fullCombinePath, CreateNoWindow = true, UseShellExecute = false }; Process process = new Process { StartInfo = processInfo }; process.Start(); if (process.WaitForExit(TimeSpan.FromSeconds(1).Milliseconds)) { CInt32 versionInteger = process.ExitCode; return new Version(versionInteger.HighValue, versionInteger.LowValue); } } return new Version(); } The small Version Check executable: static Int32 Main(String[] args) { return OsVersionErmittler.ErmittleOsVersion(); } and static class OsVersionErmittler { /// /// Ermittelt die OsVersion und übergibt diese als High und LowWord. /// /// public static CInt32 ErmittleOsVersion() { OperatingSystem version = Environment.OSVersion; if (version.Platform == PlatformID.Win32NT && version.Version >= new Version(6, 3)) { String versionString = version.VersionString; return new CInt32((Int16) version.Version.Major, (Int16) version.Version.Minor); } return 0; } } The shortened manifest of the small executable:
remove redundant check of item pointer
In function ItemPointerEquals, the ItemPointerGetBlockNumber already checked the ItemPointer if valid, there is no need to check it again in ItemPointerGetOffset, so use ItemPointerGetOffsetNumberNoCheck instead. -- Regards Junwang Zhao v1-0001-remove-redundant-check-of-item-pointer.patch Description: Binary data
Re: remove redundant check of item pointer
Junwang Zhao writes: > In function ItemPointerEquals, the ItemPointerGetBlockNumber > already checked the ItemPointer if valid, there is no need > to check it again in ItemPointerGetOffset, so use > ItemPointerGetOffsetNumberNoCheck instead. I do not think this change is worth making. The point of ItemPointerGetOffsetNumberNoCheck is not to save some cycles, it's to be able to fetch the offset field in cases where it might validly be zero. The assertion will be compiled out anyway in production builds --- and even in assert-enabled builds, I'd kind of expect the compiler to optimize away the duplicated tests. regards, tom lane
Re: Wrong rows count in EXPLAIN
Hi, >Which Postgres version do you use? I checked this on PG 11 postgres=# select version(); version - PostgreSQL 11.5 on x86_64-w64-mingw32, compiled by x86_64-w64-mingw32-gcc.exe (x86_64-win32-seh-rev0, Built by MinGW-W64 project) 8.1.0, 64-bit (1 row) and on PG 13 postgres=# select version(); version - PostgreSQL 13.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, 64-bit (1 row) Both versions shows same strange rows counting in EXPLAINE От: Andrey Borodin Отправлено: 27 апреля 2022 г. 13:08:22 Кому: Пантюшин Александр Иванович Копия: pgsql-hack...@postgresql.org; Тарасов Георгий Витальевич Тема: Re: Wrong rows count in EXPLAIN Hi! > 26 апр. 2022 г., в 13:45, Пантюшин Александр Иванович > написал(а): > > When I create a new table, and then I evaluate the execution of the SELECT > query, I see a strange rows count in EXPLAIN > CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL); > ANALYZE test1; > EXPLAIN SELECT * FROM test1; >QUERY PLAN > - > Seq Scan on test1 (cost=0.00..35.50 rows=2550 width=4) > (1 row) > > Table is empty but rows=2550. Seem like it was calculated from some default > values. > Is this normal behavior or a bug? Can it lead to a poor choice of the plan of > a query in general? Which Postgres version do you use? I observe: postgres=# CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL); CREATE TABLE postgres=# ANALYZE test1; ANALYZE postgres=# EXPLAIN SELECT * FROM test1; QUERY PLAN - Index Only Scan using test1_pkey on test1 (cost=0.12..8.14 rows=1 width=4) (1 row) postgres=# select version(); version -- PostgreSQL 15devel on x86_64-apple-darwin19.6.0, compiled by Apple clang version 11.0.3 (clang-1103.0.32.62), 64-bit (1 row) Without "ANALYZE test1;" table_block_relation_estimate_size() assumes relation size is 10 blocks. Best regards, Andrey Borodin.
Unstable tests for recovery conflict handling
[ starting a new thread cuz the shared-stats one is way too long ] Andres Freund writes: > Add minimal tests for recovery conflict handling. It's been kind of hidden by other buildfarm noise, but 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. Three of those failures look like [11:08:46.806](105.129s) ok 1 - buffer pin conflict: cursor with conflicting pin established Waiting for replication conn standby's replay_lsn to pass 0/33EF190 on primary [12:01:49.614](3182.807s) # poll_query_until timed out executing this query: # SELECT '0/33EF190' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('standby', 'walreceiver') # expecting this output: # t # last actual query output: # f # with stderr: timed out waiting for catchup at t/031_recovery_conflict.pl line 123. In each of these examples we can see in the standby's log that it detected the expected buffer pin conflict: 2022-04-27 11:08:46.353 UTC [1961604][client backend][2/2:0] LOG: statement: BEGIN; 2022-04-27 11:08:46.416 UTC [1961604][client backend][2/2:0] LOG: statement: DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; 2022-04-27 11:08:46.730 UTC [1961604][client backend][2/2:0] LOG: statement: FETCH FORWARD FROM test_recovery_conflict_cursor; 2022-04-27 11:08:47.825 UTC [1961298][startup][1/0:0] LOG: recovery still waiting after 13.367 ms: recovery conflict on buffer pin 2022-04-27 11:08:47.825 UTC [1961298][startup][1/0:0] CONTEXT: WAL redo at 0/33E6E80 for Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 but then nothing happens until the script times out and kills the test. I think this is showing us a real bug, ie we sometimes fail to cancel the conflicting query. The other example [3] looks different: [01:02:43.582](2.357s) ok 1 - buffer pin conflict: cursor with conflicting pin established Waiting for replication conn standby's replay_lsn to pass 0/342C000 on primary done [01:02:43.747](0.165s) ok 2 - buffer pin conflict: logfile contains terminated connection due to recovery conflict [01:02:43.804](0.057s) not ok 3 - buffer pin conflict: stats show conflict on standby [01:02:43.805](0.000s) [01:02:43.805](0.000s) # Failed test 'buffer pin conflict: stats show conflict on standby' # at t/031_recovery_conflict.pl line 295. [01:02:43.805](0.000s) # got: '0' # expected: '1' Not sure what to make of that --- could there be a race condition in the reporting of the conflict? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-04-27%2007%3A16%3A51 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2022-04-21%2021%3A20%3A15 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2022-04-13%2022%3A45%3A30 [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-04-11%2005%3A40%3A41
Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.
On Wed, Apr 27, 2022 at 05:13:12PM +0900, Michael Paquier wrote: > On Tue, Apr 26, 2022 at 12:54:35PM +0800, Julien Rouhaud wrote: > > so I'm still on the opinion that we should > > unconditionally use the FILE_MAP_LARGE_PAGES flag if it's defined and call > > it a > > day. > > Are we sure that this is not going to cause failures in environments > where the flag is not supported? I don't know for sure as I have no way to test, but it would be very lame for an OS to provide a #define explicitly intended for one use case if that use case can't handle that flag yet.
Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.
On Wed, Apr 27, 2022 at 03:04:23PM +, Wilm Hoyer wrote: > > I used the following hack to get the "real" Major and Minor Version of > Windows - it's in C# (.Net) and needs to be adjusted (you can compile as x64 > and use a long-long as return value ) to return the Service Number too and > translated it into C. > I share it anyways, as it might help - please be kind, as it really is a > little hack. > > Situation: > Main Application can't or is not willing to add a manifest file into its > resources. > > Solution: > Start a small executable (which has a manifest file compiled into its > resources), let it read the OS Version and code the Version into its return > Code. Thanks for sharing. Having to compile another tool just for that seems like a very high price to pay, especially since we don't have any C# code in the tree. I'm not even sure that compiling this wouldn't need additional requirements and/or if it would work on our oldest supported Windows versions.
Re: Unstable tests for recovery conflict handling
> On Apr 27, 2022, at 9:45 AM, Tom Lane wrote: > > [ starting a new thread cuz the shared-stats one is way too long ] > > Andres Freund writes: >> Add minimal tests for recovery conflict handling. > > It's been kind of hidden by other buildfarm noise, but > 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. > > Three of those failures look like Interesting, I have been getting failures on REL_14_STABLE: t/012_subtransactions.pl . 11/12 # Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby' # at t/012_subtransactions.pl line 211. # got: '3' # expected: '0' t/012_subtransactions.pl . 12/12 # Looks like you failed 1 test of 12. t/012_subtransactions.pl . Dubious, test returned 1 (wstat 256, 0x100) Failed 1/12 subtests And the logs, tmp_check/log/regress_log_012_subtransactions, showing: ### Enabling streaming replication for node "primary" ### Starting node "primary" # Running: pg_ctl -D /Users/mark.dilger/recovery_test/postgresql/src/test/recovery/tmp_check/t_012_subtransactions_primary_data/pgdata -l /Users/mark.dilger/recovery_test/postgresql/src/test/recovery/tmp_check/log/012_subtransactions_primary.log -o --cluster-name=primary start waiting for server to start done server started # Postmaster PID for node "primary" is 46270 psql::1: ERROR: prepared transaction with identifier "xact_012_1" does not exist not ok 11 - Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby # Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby' # at t/012_subtransactions.pl line 211. # got: '3' # expected: '0' This is quite consistent for me, but only when I configure with --enable-coverage and --enable-dtrace. (I haven't yet tried one of those without the other.) I wasn't going to report this yet, having not yet completely narrowed this down, but I wonder if anybody else is seeing this? I'll try again on master — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pgsql: Add contrib/pg_walinspect.
On Wed, 2022-04-27 at 13:47 +0530, Bharath Rupireddy wrote: > I found an easy way to reproduce this consistently (I think on any > server): > > I basically generated huge WAL record (I used a fun extension that I > wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can > use pg_logical_emit_message as well) Thank you Bharath for creating the extension and the simple test case. Thomas's patch solves the issue for me as well. Tom, the debug patch you posted[0] seems to be setting the error message if it's not already set. Thomas's patch uses the lack of a message as a signal that we've reached the end of WAL. That explains why you are still seeing the problem. Obviously, that's a sign that Thomas's patch is not the cleanest solution. But other approaches would be more invasive. I guess the question is whether that's a good enough solution for now, and hopefully we could improve the API later; or whether we need to come up with something better. When reviewing, I considered the inability to read old WAL and the inability to read flushed-in-the-middle-of-a-record WAL as similar kinds of errors that the user would need to deal with. But they are different: the former can be avoided by creating a slot; the latter can't be easily avoided, only retried. Depending on the intended use cases, forcing the user to retry might be reasonable, in which case we could consider this a test problem rather than a real problem, and we might be able to do something simpler to just stabilize the test. Regards, Jeff Davis [0] https://postgr.es/m/295868.1651024...@sss.pgh.pa.us
Re: Unstable tests for recovery conflict handling
> On Apr 27, 2022, at 10:11 AM, Mark Dilger > wrote: > > I'll try again on master Still with coverage and dtrace enabled, I get the same thing, except that master formats the logs a bit differently: # Postmaster PID for node "primary" is 19797 psql::1: ERROR: prepared transaction with identifier "xact_012_1" does not exist [10:26:16.314](1.215s) not ok 11 - Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby [10:26:16.314](0.000s) [10:26:16.314](0.000s) # Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on promoted standby' [10:26:16.314](0.000s) # at t/012_subtransactions.pl line 208. [10:26:16.314](0.000s) # got: '3' # expected: '0' With coverage but not dtrace enabled, I still get the error, though the log leading up to the error now has a bunch of coverage noise lines like: profiling: /Users/mark.dilger/recovery_test/postgresql/src/backend/utils/sort/tuplesort.gcda: cannot merge previous GCDA file: corrupt arc tag The error itself looks the same except the timing numbers differ a little. With neither enabled, all tests pass. I'm inclined to think that either the recovery code or the test have a race condition, and that enabling coverage causes the race to come out differently. I'll keep poking — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Unstable tests for recovery conflict handling
I wrote: > It's been kind of hidden by other buildfarm noise, but > 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. > ... > I think this is showing us a real bug, ie we sometimes fail to cancel > the conflicting query. After digging around in the code, I think this is almost certainly some manifestation of the previously-complained-of problem [1] that RecoveryConflictInterrupt is not safe to call in a signal handler, leading the conflicting backend to sometimes decide that it's not the problem. That squares with the observation that skink is more prone to show this than other animals: you'd have to get the SIGUSR1 while the target backend isn't idle, so a very slow machine ought to show it more. We don't seem to have that issue on the open items list, but I'll go add it. Not sure if the 'buffer pin conflict: stats show conflict on standby' failure could trace to a similar cause. regards, tom lane [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
Re: Possible corruption by CreateRestartPoint at promotion
On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote: > On Tue, Apr 26, 2022 at 08:26:09PM -0700, Nathan Bossart wrote: >> On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote: >>> + ControlFile->minRecoveryPoint = InvalidXLogRecPtr; >>> + ControlFile->minRecoveryPointTLI = 0; >>> + >>> + /* also update local copy */ >>> + LocalMinRecoveryPoint = InvalidXLogRecPtr; >>> + LocalMinRecoveryPointTLI = 0; >> >> Should this be handled by the code that changes the control file state to >> DB_IN_PRODUCTION instead? It looks like this is ordinarily done in the >> next checkpoint. It's not clear to me why it is done this way. > > Anyway, that would be the work of the end-of-recovery checkpoint > requested at the end of StartupXLOG() once a promotion happens or of > the checkpoint requested by PerformRecoveryXLogAction() in the second > case, no? So, I don't quite see why we need to update > minRecoveryPoint and minRecoveryPointTLI in the control file here, as > much as this does not have to be part of the end-of-recovery code > that switches the control file to DB_IN_PRODUCTION. +1. We probably don't need to reset minRecoveryPoint here. > - if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && > - ControlFile->checkPointCopy.redo < lastCheckPoint.redo) > - { > 7ff23c6 has removed the last call to CreateCheckpoint() outside the > checkpointer, meaning that there is one less concurrent race to worry > about, but I have to admit that this change, to update the control > file's checkPoint and checkPointCopy even if we don't check after > ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the > code less robust in ~14. So I am questioning whether a backpatch > is actually worth the risk here. IMO we should still check this before updating ControlFile to be safe. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: avoid multiple hard links to same WAL file after a crash
On Wed, Apr 27, 2022 at 04:09:20PM +0900, Michael Paquier wrote: > I am not sure that have any need to backpatch this change based on the > unlikeliness of the problem, TBH. One thing that is itching me a bit, > like Robert upthread, is that we don't check anymore that the newfile > does not exist in the code paths because we never expect one. It is > possible to use stat() for that. But access() within a simple > assertion would be simpler? Say something like: > Assert(access(path, F_OK) != 0 && errno == ENOENT); > > The case for basic_archive is limited as the comment of the patch > states, but that would be helpful for the two calls in timeline.c and > the one in xlog.c in the long-term. And this has no need to be part > of fd.c, this can be added before the durable_rename() calls. What do > you think? Here is a new patch set with these assertions added. I think at least the xlog.c change ought to be back-patched. The problem may be unlikely, but AFAICT the possible consequences include WAL corruption. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 8ffc337621f8a287350a7a55256b58b0585f7a1f Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Tue, 26 Apr 2022 11:56:50 -0700 Subject: [PATCH v4 1/2] Replace calls to durable_rename_excl() with durable_rename(). durable_rename_excl() attempts to avoid overwriting any existing files by using link() and unlink(), but it falls back to rename() on some platforms (e.g., Windows), which offers no such overwrite protection. Most callers use durable_rename_excl() just in case there is an existing file, but in practice there shouldn't be one. basic_archive used it to avoid overwriting an archive concurrently created by another server, but as mentioned above, it will still overwrite files on some platforms. Furthermore, failures during durable_rename_excl() can result in multiple hard links to the same file. My testing demonstrated that it was possible to end up with two links to the same file in pg_wal after a crash just before unlink() during WAL recycling. Specifically, the test produced links to the same file for the current WAL file and the next one because the half-recycled WAL file was re-recycled upon restarting. This seems likely to lead to WAL corruption. This change replaces all calls to durable_rename_excl() with durable_rename(). This removes the protection against accidentally overwriting an existing file, but some platforms are already living without it, and ordinarily there shouldn't be one. The function itself is left around in case any extensions are using it. It will be removed in v16 via a follow-up commit. Back-patch to all supported versions. Before v13, durable_rename_excl() was named durable_link_or_rename(). Author: Nathan Bossart Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/20220418182336.GA2298576%40nathanxps13 --- contrib/basic_archive/basic_archive.c | 5 +++-- src/backend/access/transam/timeline.c | 16 src/backend/access/transam/xlog.c | 9 +++-- 3 files changed, 10 insertions(+), 20 deletions(-) diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c index e7efbfb9c3..ed33854c57 100644 --- a/contrib/basic_archive/basic_archive.c +++ b/contrib/basic_archive/basic_archive.c @@ -281,9 +281,10 @@ basic_archive_file_internal(const char *file, const char *path) /* * Sync the temporary file to disk and move it to its final destination. - * This will fail if destination already exists. + * Note that this will overwrite any existing file, but this is only + * possible if someone else created the file since the stat() above. */ - (void) durable_rename_excl(temp, destination, ERROR); + (void) durable_rename(temp, destination, ERROR); ereport(DEBUG1, (errmsg("archived \"%s\" via basic_archive", file))); diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index be21968293..f3a8e53aa4 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -441,12 +441,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, * Now move the completed history file into place with its final name. */ TLHistoryFilePath(path, newTLI); - - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existing file (there shouldn't be one). - */ - durable_rename_excl(tmppath, path, ERROR); + Assert(access(path, F_OK) != 0 && errno == ENOENT); + durable_rename(tmppath, path, ERROR); /* The history file can be archived immediately. */ if (XLogArchivingActive()) @@ -519,12 +515,8 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size) * Now move the completed history file into place with its final name. */ TLHistoryFilePath(path, tli); - - /* - * Perform the rename using link if available, paranoidly trying to avoid - * overwriting an existi
Re: [RFC] building postgres with meson -v8
Here is a patch that adds in NLS. There are some opportunities to improve this. For example, we could move the list of languages from the meson.build files into separate LINGUAS files, which could be shared with the makefile-based build system. I need to research this a bit more. Also, this only covers the build and install phases of the NLS process. The xgettext and msgmerge aspects I haven't touched at all. There is more to research there as well. The annoying thing is that the i18n module doesn't appear to have a way to communicate with feature options or dependencies, so there isn't a way to tell it to only do its things when some option is enabled, or conversely to check whether the module found the things it needs and to enable or disable an option based on that. So right now for example if you explicitly disable the 'nls' option, the binaries are built without NLS but the .mo files are still built and installed. In any case, this works for the main use cases and gets us a step forward, so it's worth considering.From ccfbff1beed568ef3ebe2e1af0701802d96b9017 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 27 Apr 2022 17:57:04 +0200 Subject: [PATCH 7/7] meson: NLS support --- meson.build| 23 +++--- meson_options.txt | 3 +++ src/backend/meson.build| 2 ++ src/backend/po/meson.build | 3 +++ src/bin/initdb/meson.build | 2 ++ src/bin/initdb/po/meson.build | 3 +++ src/bin/pg_amcheck/meson.build | 2 ++ src/bin/pg_amcheck/po/meson.build | 3 +++ src/bin/pg_archivecleanup/meson.build | 2 ++ src/bin/pg_archivecleanup/po/meson.build | 3 +++ src/bin/pg_basebackup/meson.build | 2 ++ src/bin/pg_basebackup/po/meson.build | 3 +++ src/bin/pg_checksums/meson.build | 2 ++ src/bin/pg_checksums/po/meson.build| 3 +++ src/bin/pg_config/meson.build | 2 ++ src/bin/pg_config/po/meson.build | 3 +++ src/bin/pg_controldata/meson.build | 2 ++ src/bin/pg_controldata/po/meson.build | 3 +++ src/bin/pg_ctl/meson.build | 2 ++ src/bin/pg_ctl/po/meson.build | 3 +++ src/bin/pg_dump/meson.build| 2 ++ src/bin/pg_dump/po/meson.build | 3 +++ src/bin/pg_resetwal/meson.build| 2 ++ src/bin/pg_resetwal/po/meson.build | 3 +++ src/bin/pg_rewind/meson.build | 2 ++ src/bin/pg_rewind/po/meson.build | 3 +++ src/bin/pg_test_fsync/meson.build | 2 ++ src/bin/pg_test_fsync/po/meson.build | 3 +++ src/bin/pg_test_timing/meson.build | 2 ++ src/bin/pg_test_timing/po/meson.build | 3 +++ src/bin/pg_upgrade/meson.build | 2 ++ src/bin/pg_upgrade/po/meson.build | 3 +++ src/bin/pg_verifybackup/meson.build| 2 ++ src/bin/pg_verifybackup/po/meson.build | 3 +++ src/bin/pg_waldump/meson.build | 2 ++ src/bin/pg_waldump/po/meson.build | 3 +++ src/bin/psql/meson.build | 2 ++ src/bin/psql/po/meson.build| 3 +++ src/bin/scripts/meson.build| 2 ++ src/bin/scripts/po/meson.build | 3 +++ src/interfaces/ecpg/ecpglib/meson.build| 2 ++ src/interfaces/ecpg/ecpglib/po/meson.build | 3 +++ src/interfaces/ecpg/preproc/meson.build| 2 ++ src/interfaces/ecpg/preproc/po/meson.build | 3 +++ src/interfaces/libpq/meson.build | 5 - src/interfaces/libpq/po/meson.build| 3 +++ src/pl/plperl/meson.build | 2 ++ src/pl/plperl/po/meson.build | 3 +++ src/pl/plpgsql/src/meson.build | 2 ++ src/pl/plpgsql/src/po/meson.build | 3 +++ src/pl/plpython/meson.build| 2 ++ src/pl/plpython/po/meson.build | 3 +++ src/pl/tcl/meson.build | 2 ++ src/pl/tcl/po/meson.build | 3 +++ 54 files changed, 155 insertions(+), 4 deletions(-) create mode 100644 src/backend/po/meson.build create mode 100644 src/bin/initdb/po/meson.build create mode 100644 src/bin/pg_amcheck/po/meson.build create mode 100644 src/bin/pg_archivecleanup/po/meson.build create mode 100644 src/bin/pg_basebackup/po/meson.build create mode 100644 src/bin/pg_checksums/po/meson.build create mode 100644 src/bin/pg_config/po/meson.build create mode 100644 src/bin/pg_controldata/po/meson.build create mode 100644 src/bin/pg_ctl/po/meson.build create mode 100644 src/bin/pg_dump/po/meson.build create mode 100644 src/bin/pg_resetwal/po/meson.build create mode 100644 src/bin/pg_rewind/po/meson.build create mode 100644 src/bin/pg_test_fsync/po/meson.build create mode 100644 src/bin/pg_test_timing/po/meson.build create mode 100644 src/bin/pg_upgrade/po/meson.build create mode 100644 src/bin/pg_verifyb
Re: bogus: logical replication rows/cols combinations
Hi, so I've been looking at tweaking the code so that the behavior matches Alvaro's expectations. It passes check-world but I'm not claiming it's nowhere near commitable - the purpose is mostly to give better idea of how invasive the change is etc. As described earlier, this abandons the idea of building a single OR expression from all the row filters (per action), and replaces that with a list of per-publication info (struct PublicationInfo), combining info about both row filters and column lists. This means we can't initialize the row filters and column lists separately, but at the same time. So pgoutput_row_filter_init was modified to initialize both, and pgoutput_column_list_init was removed. With this info, we can calculate column lists only for publications with matching row filters, which is what the modified pgoutput_row_filter does (the calculated column list is returned through a parameter). This however does not remove the 'columns' from RelationSyncEntry entirely. We still need that "superset" column list when sending schema. Imagine two publications, one replicating (a,b) and the other (a,c), maybe depending on row filter. send_relation_and_attrs() needs to send info about all three attributes (a,b,c), i.e. about any attribute that might end up being replicated. We might try to be smarter and send the exact schema needed by the next operation, i.e. when inserting (a,b) we'd make sure the last schema we sent was (a,b) and invalidate/resend it otherwise. But that might easily result in "trashing" where we send the schema and the next operation invalidates it right away because it needs a different schema. But there's another reason to do it like this - it seems desirable to actually reset columns don't match the calculated column list. Using Alvaro's example, it seems reasonable to expect these two transactions to produce the same result on the subscriber: 1) insert (a,b) + update to (a,c) insert into uno values (1, 2, 3); update uno set a = -1 where a = 1; 2) insert (a,c) insert into uno values (-1, 2, 3); But to do this, the update actually needs to send (-1,NULL,3). So in this case we'll have (a,b,c) column list in RelationSyncEntry, and only attributes on this list will be sent as part of schema. And DML actions we'll calculate either (a,b) or (a,c) depending on the row filter, and missing attributes will be replicated as NULL. I haven't done any tests how this affect performance, but I have a couple thoughts regarding that: a) I kinda doubt the optimizations would really matter in practice, because how likely is it that one relation is in many publications (in the same subscription)? b) Did anyone actually do some benchmarks that I could repeat, to see how much worse this is? c) AFAICS we could optimize this in at least some common cases. For example we could combine the entries with matching row filters, and/or column filters. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companydiff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c index ff8513e2d2..41c5e3413f 100644 --- a/src/backend/replication/logical/proto.c +++ b/src/backend/replication/logical/proto.c @@ -33,7 +33,9 @@ static void logicalrep_write_attrs(StringInfo out, Relation rel, Bitmapset *columns); static void logicalrep_write_tuple(StringInfo out, Relation rel, TupleTableSlot *slot, - bool binary, Bitmapset *columns); + bool binary, + Bitmapset *schema_columns, + Bitmapset *columns); static void logicalrep_read_attrs(StringInfo in, LogicalRepRelation *rel); static void logicalrep_read_tuple(StringInfo in, LogicalRepTupleData *tuple); @@ -412,7 +414,8 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn) */ void logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel, - TupleTableSlot *newslot, bool binary, Bitmapset *columns) + TupleTableSlot *newslot, bool binary, + Bitmapset *schema_columns, Bitmapset *columns) { pq_sendbyte(out, LOGICAL_REP_MSG_INSERT); @@ -424,7 +427,8 @@ logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel, pq_sendint32(out, RelationGetRelid(rel)); pq_sendbyte(out, 'N'); /* new tuple follows */ - logicalrep_write_tuple(out, rel, newslot, binary, columns); + logicalrep_write_tuple(out, rel, newslot, binary, + schema_columns, columns); } /* @@ -457,7 +461,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup) void logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel, TupleTableSlot *oldslot, TupleTableSlot *newslot, - bool binary, Bitmapset *columns) + bool binary, Bitmapset *schema_columns, + Bitmapset *columns) { pq_sendbyte(out, LOGICAL_REP_MSG_UPDATE); @@ -478,11 +483,12 @@ logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel, pq_sendbyte(out, 'O')
Multi-Master Logical Replication
MULTI-MASTER LOGICAL REPLICATION 1.0 BACKGROUND Let’s assume that a user wishes to set up a multi-master environment so that a set of PostgreSQL instances (nodes) use logical replication to share tables with every other node in the set. We define this as a multi-master logical replication (MMLR) node-set. 1.1 ADVANTAGES OF MMLR - Increases write scalability (e.g., all nodes can write arbitrary data). - Allows load balancing - Allows rolling updates of nodes (e.g., logical replication works between different major versions of PostgreSQL). - Improves the availability of the system (e.g., no single point of failure) - Improves performance (e.g., lower latencies for geographically local nodes) 2.0 MMLR AND POSTGRESQL It is already possible to configure a kind of MMLR set in PostgreSQL 15 using PUB/SUB, but it is very restrictive because it can only work when no two nodes operate on the same table. This is because when two nodes try to share the same table then there becomes a circular recursive problem where Node1 replicates data to Node2 which is then replicated back to Node1 and so on. To prevent the circular recursive problem Vignesh is developing a patch [1] that introduces new SUBSCRIPTION options "local_only" (for publishing only data originating at the publisher node) and "copy_data=force". Using this patch, we have created a script [2] demonstrating how to set up all the above multi-node examples. An overview of the necessary steps is given in the next section. 2.1 STEPS – Adding a new node N to an existing node-set step 1. Prerequisites – Apply Vignesh’s patch [1]. All nodes in the set must be visible to each other by a known CONNECTION. All shared tables must already be defined on all nodes. step 2. On node N do CREATE PUBLICATION pub_N FOR ALL TABLES step 3. All other nodes then CREATE SUBSCRIPTION to PUBLICATION pub_N with "local_only=on, copy_data=on" (this will replicate initial data from the node N tables to every other node). step 4. On node N, temporarily ALTER PUBLICATION pub_N to prevent replication of 'truncate', then TRUNCATE all tables of node N, then re-allow replication of 'truncate'. step 5. On node N do CREATE SUBSCRIPTION to the publications of all other nodes in the set 5a. Specify "local_only=on, copy_data=force" for exactly one of the subscriptions (this will make the node N tables now have the same data as the other nodes) 5b. Specify "local_only=on, copy_data=off" for all other subscriptions. step 6. Result - Now changes to any table on any node should be replicated to every other node in the set. Note: Steps 4 and 5 need to be done within the same transaction to avoid loss of data in case of some command failure. (Because we can't perform create subscription in a transaction, we need to create the subscription in a disabled mode first and then enable it in the transaction). 2.2 DIFFICULTIES Notice that it becomes increasingly complex to configure MMLR manually as the number of nodes in the set increases. There are also some difficulties such as - dealing with initial table data - coordinating the timing to avoid concurrent updates - getting the SUBSCRIPTION options for copy_data exactly right. 3.0 PROPOSAL To make the MMLR setup simpler, we propose to create a new API that will hide all the step details and remove the burden on the user to get it right without mistakes. 3.1 MOTIVATION - MMLR (sharing the same tables) is not currently possible - Vignesh's patch [1] makes MMLR possible, but the manual setup is still quite difficult - An MMLR implementation can solve the timing problems (e.g., using Database Locking) 3.2 API Preferably the API would be implemented as new SQL functions in PostgreSQL core, however, implementation using a contrib module or some new SQL syntax may also be possible. SQL functions will be like below: - pg_mmlr_set_create = create a new set, and give it a name - pg_mmlr_node_attach = attach the current node to a specified set - pg_mmlr_node_detach = detach a specified node from a specified set - pg_mmlr_set_delete = delete a specified set For example, internally the pg_mmlr_node_attach API function would execute the equivalent of all the CREATE PUBLICATION, CREATE SUBSCRIPTION, and TRUNCATE steps described above. Notice this proposal has some external API similarities with the BDR extension [3] (which also provides multi-master logical replication), although we plan to implement it entirely using PostgreSQL’s PUB/SUB. 4.0 ACKNOWLEDGEMENTS The following people have contributed to this proposal – Hayato Kuroda, Vignesh C, Peter Smith, Amit Kapila. 5.0 REFERENCES [1] https://www.postgresql.org/message-id/flat/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9%3D9orqubhjcQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAHut%2BPvY2P%3DUL-X6maMA5QxFKdcdciRRCKDH3j%3D_hO8u2OyRYg%40mail.gmail.com [3] https://www.enterprisedb.com/docs/bdr/latest/ [END] ~~~ One of my colleagues will post more detailed information later. ---
Re: pgsql: Add contrib/pg_walinspect.
On Wed, Apr 27, 2022 at 10:22 PM Bharath Rupireddy wrote: > Here's v2 patch (up on Thomas's v1 at [1]) using private_data to set > the end of the WAL flag. Please have a look at it. I don't have a strong view on whether it's better to use a NULL error for this private communication between pg_walinspect.c and the read_page callback it installs, or install a custom private_data to signal this condition, or to give up on all this stuff completely and just wait (see below for thoughts). I'd feel better about both no-wait options if the read_page callback in question were actually in the contrib module, and not in the core code. On the other hand, I'm not too hung up about it because I'm really hoping to see the get-rid-of-all-these-callbacks-and-make-client-code-do-the-waiting scheme proposed by Horiguchi-san and Heikki developed further, to rip much of this stuff out in a future release. If you go with the private_data approach, a couple of small comments: if (record == NULL) { +ReadLocalXLogPageNoWaitPrivate *private_data; + +/* return NULL, if end of WAL is reached */ +private_data = (ReadLocalXLogPageNoWaitPrivate *) +xlogreader->private_data; + +if (private_data->end_of_wal) +return NULL; + if (errormsg) ereport(ERROR, (errcode_for_file_access(), errmsg("could not read WAL at %X/%X: %s", LSN_FORMAT_ARGS(first_record), errormsg))); -else -ereport(ERROR, -(errcode_for_file_access(), - errmsg("could not read WAL at %X/%X", -LSN_FORMAT_ARGS(first_record; } I think you should leave that second ereport() call in, in this variant of the patch, no? I don't know if anything else raises errors with no message, but if we're still going to treat them as silent end-of-data conditions then you might as well go with the v1 patch. Another option might be to abandon this whole no-wait concept and revert 2258e76f's changes to xlogutils.c. pg_walinspect already does preliminary checks that LSNs are in range, so you can't enter a value that will wait indefinitely, and it's interruptible (it's a 1ms sleep/check loop, not my favourite programming pattern but that's pre-existing code). If you're unlucky enough to hit the case where the LSN is judged to be valid but is in the middle of a record that hasn't been totally flushed yet, it'll just be a bit slower to return as we wait for the inevitable later flush(es) to happen. The rest of your record will *surely* be flushed pretty soon (or the flushing backend panics the whole system and time ends). I don't imagine this is performance critical work, so maybe that'd be acceptable?
Re: Possible corruption by CreateRestartPoint at promotion
On Wed, Apr 27, 2022 at 11:09:45AM -0700, Nathan Bossart wrote: > On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote: >> - if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && >> - ControlFile->checkPointCopy.redo < lastCheckPoint.redo) >> - { >> 7ff23c6 has removed the last call to CreateCheckpoint() outside the >> checkpointer, meaning that there is one less concurrent race to worry >> about, but I have to admit that this change, to update the control >> file's checkPoint and checkPointCopy even if we don't check after >> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the >> code less robust in ~14. So I am questioning whether a backpatch >> is actually worth the risk here. > > IMO we should still check this before updating ControlFile to be safe. Sure. Fine by me to play it safe. -- Michael signature.asc Description: PGP signature
trivial comment fix
Hi, While reading worker.c, I noticed that the referred SQL command was wrong. ALTER SUBSCRIPTION ... REFRESH PUBLICATION instead of ALTER TABLE ... REFRESH PUBLICATION. Trivial fix attached. -- Euler Taveira EDB https://www.enterprisedb.com/ diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index 4171371296..7da7823c35 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -93,7 +93,7 @@ * ReorderBufferFinishPrepared. * * If the subscription has no tables then a two_phase tri-state PENDING is - * left unchanged. This lets the user still do an ALTER TABLE REFRESH + * left unchanged. This lets the user still do an ALTER SUBSCRIPTION REFRESH * PUBLICATION which might otherwise be disallowed (see below). * * If ever a user needs to be aware of the tri-state value, they can fetch it
RE: Data is copied twice when specifying both child and parent table in publication
On Sun, Apr 24, 2022 2:16 PM Wang, Wei/王 威 wrote: > > Attach the new patches.[suggestions by Amit-San] > The patch for HEAD: > 1. Add a new function to get tables info by a publications array. > The patch for REL14: > 1. Use an alias to make the statement understandable. BTW, I adjusted the > alignment. > 2. Improve the test cast about the column list and row filter to cover this > bug. > Thanks for your patches. Here's a comment on the patch for REL14. + appendStringInfo(&cmd, "SELECT DISTINCT ns.nspname, c.relname\n" +" FROM pg_catalog.pg_publication_tables t\n" +" JOIN pg_catalog.pg_namespace ns\n" +" ON ns.nspname = t.schemaname\n" +" JOIN pg_catalog.pg_class c\n" +" ON c.relname = t.tablename AND c.relnamespace = ns.oid\n" +" WHERE t.pubname IN (%s)\n" +" AND (c.relispartition IS FALSE\n" +" OR NOT EXISTS\n" +"( SELECT 1 FROM pg_partition_ancestors(c.oid) as relid\n" +" WHERE relid IN\n" +"(SELECT DISTINCT (schemaname || '.' || tablename)::regclass::oid\n" +" FROM pg_catalog.pg_publication_tables t\n" +" WHERE t.pubname IN (%s))\n" +" AND relid != c.oid))\n", +pub_names.data, pub_names.data); I think we can use an alias like 'pa' for pg_partition_ancestors, and modify the SQL as follows. + appendStringInfo(&cmd, "SELECT DISTINCT ns.nspname, c.relname\n" +" FROM pg_catalog.pg_publication_tables t\n" +" JOIN pg_catalog.pg_namespace ns\n" +" ON ns.nspname = t.schemaname\n" +" JOIN pg_catalog.pg_class c\n" +" ON c.relname = t.tablename AND c.relnamespace = ns.oid\n" +" WHERE t.pubname IN (%s)\n" +" AND (c.relispartition IS FALSE\n" +" OR NOT EXISTS\n" +"( SELECT 1 FROM pg_partition_ancestors(c.oid) pa\n" +" WHERE pa.relid IN\n" +"(SELECT DISTINCT (t.schemaname || '.' || t.tablename)::regclass::oid\n" +" FROM pg_catalog.pg_publication_tables t\n" +" WHERE t.pubname IN (%s))\n" +" AND pa.relid != c.oid))\n", +pub_names.data, pub_names.data); Regards, Shi yu
Re: Unstable tests for recovery conflict handling
Hi, On 2022-04-27 10:11:53 -0700, Mark Dilger wrote: > > > > On Apr 27, 2022, at 9:45 AM, Tom Lane wrote: > > > > [ starting a new thread cuz the shared-stats one is way too long ] > > > > Andres Freund writes: > >> Add minimal tests for recovery conflict handling. > > > > It's been kind of hidden by other buildfarm noise, but > > 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. > > > > Three of those failures look like > > Interesting, > > I have been getting failures on REL_14_STABLE: > > t/012_subtransactions.pl . 11/12 > > # Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction > on promoted standby' > # at t/012_subtransactions.pl line 211. > # got: '3' > # expected: '0' > t/012_subtransactions.pl . 12/12 # Looks like you failed 1 test > of 12. > t/012_subtransactions.pl . Dubious, test returned 1 (wstat 256, > 0x100) > Failed 1/12 subtests > > And the logs, tmp_check/log/regress_log_012_subtransactions, showing: I'm a bit confused - what's the relation of that failure to this thread / the tests / this commit? Greetings, Andres Freund
Re: Unstable tests for recovery conflict handling
Hi, On 2022-04-27 14:08:45 -0400, Tom Lane wrote: > I wrote: > > It's been kind of hidden by other buildfarm noise, but > > 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. > > ... > > I think this is showing us a real bug, ie we sometimes fail to cancel > > the conflicting query. > > After digging around in the code, I think this is almost certainly > some manifestation of the previously-complained-of problem [1] that > RecoveryConflictInterrupt is not safe to call in a signal handler, > leading the conflicting backend to sometimes decide that it's not > the problem. I think at least some may actually be because of https://postgr.es/m/20220409045515.35ypjzddp25v72ou%40alap3.anarazel.de rather than RecoveryConflictInterrupt itself. I'll go an finish up the comment bits that I still need to clean up in my bugix and commit that... Greetings, Andres Freund
Re: remove redundant check of item pointer
got it, thanks for the explanation. On Wed, Apr 27, 2022 at 11:34 PM Tom Lane wrote: > > Junwang Zhao writes: > > In function ItemPointerEquals, the ItemPointerGetBlockNumber > > already checked the ItemPointer if valid, there is no need > > to check it again in ItemPointerGetOffset, so use > > ItemPointerGetOffsetNumberNoCheck instead. > > I do not think this change is worth making. The point of > ItemPointerGetOffsetNumberNoCheck is not to save some cycles, > it's to be able to fetch the offset field in cases where it might > validly be zero. The assertion will be compiled out anyway in > production builds --- and even in assert-enabled builds, I'd kind > of expect the compiler to optimize away the duplicated tests. > > regards, tom lane -- Regards Junwang Zhao
Re: Handle infinite recursion in logical replication setup
Here are my review comments for v10-0002 (TAP tests part only) FIle: src/test/subscription/t/032_localonly.pl == 1. +# Detach node C and clean the table contents. +sub detach_node_clean_table_data +{ 1a. Maybe say "Detach node C from the node-group of (A, B, C) and clean the table contents from all nodes" 1b. Actually wondered do you need to TRUNCATE from both A and B (maybe only truncate 1 is OK since those nodes are still using MMC). OTOH maybe your explicit way makes the test simpler. ~~~ 2. +# Subroutine for verify the data is replicated successfully. +sub verify_data +{ 2a. Typo: "for verify" -> "to verify" 2b. The messages in this function maybe are not very appropriate. They say 'Inserted successfully without leading to infinite recursion in circular replication setup', but really the function is only testing all the data is the same as 'expected'. So it could be the result of any operation - not just Insert. ~~~ 3. SELECT ORDER BY? $result = $node_B->safe_psql('postgres', "SELECT * FROM tab_full;"); is($result, qq(11 12 13), 'Node_C data replicated to Node_B' ); I am not sure are these OK like this or if *every* SELECT use ORDER BY to make sure the data is in the same qq expected order? There are multiple cases like this. (BTW, I think this comment needs to be applied for the v10-0001 patch, maybe not v10-0002). ~~~ 4. +### +# Specifying local_only 'on' which indicates that the publisher should only +# replicate the changes that are generated locally from node_B, but in +# this case since the node_B is also subscribing data from node_A, node_B can +# have data originated from node_A, so throw an error in this case to prevent +# node_A data being replicated to the node_C. +### There is something wrong with the description because there is no "node_C" in this test. You are just creating a 2nd subscription on node A. ~~ 5. +($result, $stdout, $stderr) = $node_A->psql( + 'postgres', " +CREATE SUBSCRIPTION tap_sub_A3 +CONNECTION '$node_B_connstr application_name=$subname_AB' +PUBLICATION tap_pub_B +WITH (local_only = on, copy_data = on)"); +like( It seemed strange to call this 2nd subscription "tap_sub_A3". Wouldn't it be better to call it "tap_sub_A2"? ~~~ 6. +### +# Join 3rd node (node_C) to the existing 2 nodes(node_A & node_B) bidirectional +# replication setup when the existing nodes (node_A & node_B) has no data and +# the new node (node_C) some pre-existing data. +### +$node_C->safe_psql('postgres', "INSERT INTO tab_full VALUES (31);"); + +$result = $node_A->safe_psql('postgres', "SELECT * FROM tab_full order by 1;"); +is( $result, qq(), 'Check existing data'); + +$result = $node_B->safe_psql('postgres', "SELECT * FROM tab_full order by 1;"); +is( $result, qq(), 'Check existing data'); + +$result = $node_C->safe_psql('postgres', "SELECT * FROM tab_full order by 1;"); +is($result, qq(31), 'Check existing data'); + +create_subscription($node_A, $node_C, $subname_AC, $node_C_connstr, + 'tap_pub_C', 'copy_data = force, local_only = on'); +create_subscription($node_B, $node_C, $subname_BC, $node_C_connstr, + 'tap_pub_C', 'copy_data = force, local_only = on'); + Because the Node_C does not yet have any subscriptions aren't these cases where you didn't really need to use "force"? -- Kind Regards, Peter Smith. Fujitsu Australia
Re: trivial comment fix
On Thu, Apr 28, 2022 at 7:27 AM Euler Taveira wrote: > > Hi, > > While reading worker.c, I noticed that the referred SQL command was wrong. > ALTER SUBSCRIPTION ... REFRESH PUBLICATION instead of ALTER TABLE ... REFRESH > PUBLICATION. Trivial fix attached. Pushed, thanks! -- John Naylor EDB: http://www.enterprisedb.com
Re: Possible corruption by CreateRestartPoint at promotion
At Wed, 27 Apr 2022 14:16:01 +0900, Michael Paquier wrote in > On Tue, Apr 26, 2022 at 08:26:09PM -0700, Nathan Bossart wrote: > > On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote: > >> At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart > >> wrote in > >>> I suspect we'll start seeing this problem more often once end-of-recovery > >>> checkpoints are removed [0]. Would you mind creating a commitfest entry > >>> for this thread? I didn't see one. > >> > >> I'm not sure the patch makes any change here, because restart points > >> don't run while crash recovery, since no checkpoint records seen > >> during a crash recovery. Anyway the patch doesn't apply anymore so > >> rebased, but only the one for master for the lack of time for now. > > > > Thanks for the new patch! Yeah, it wouldn't affect crash recovery, but > > IIUC Robert's patch also applies to archive recovery. > > > >> + /* Update pg_control */ > >> + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); > >> + > >>/* > >> * Remember the prior checkpoint's redo ptr for > >> * UpdateCheckPointDistanceEstimate() > >> */ > >>PriorRedoPtr = ControlFile->checkPointCopy.redo; > > > > nitpick: Why move the LWLockAcquire() all the way up here? > > Yeah, that should not be necessary. InitWalRecovery() is the only > place outside the checkpointer that would touch this field, but that > happens far too early in the startup sequence to matter with the > checkpointer. Yes it is not necessary. I just wanted to apparently ensure not to access ControlFile outside ControlFileLoc. So I won't be opposed to reverting it since, as you say, it is *actuall* safe.. > >> + Assert (PriorRedoPtr < RedoRecPtr); > > > > I think this could use a short explanation. > > That's just to make sure that the current redo LSN is always older > than the one prior that. It does not seem really necessary to me to > add that. Just after we call UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr). Don't we really need any safeguard against giving a wrap-arounded (in other words, treamendously large) value to the function? Actually it doesn't seem to happen now, but I don't confident that that never ever happens. That being said, I'm a minority here, so removed it. > >> + /* > >> + * Aarchive recovery has ended. Crash recovery ever after should > >> + * always recover to the end of WAL > >> + */ > > s/Aarchive/Archive/. Oops! Fixed. > >> + ControlFile->minRecoveryPoint = InvalidXLogRecPtr; > >> + ControlFile->minRecoveryPointTLI = 0; > >> + > >> + /* also update local copy */ > >> + LocalMinRecoveryPoint = InvalidXLogRecPtr; > >> + LocalMinRecoveryPointTLI = 0; > > > > Should this be handled by the code that changes the control file state to > > DB_IN_PRODUCTION instead? It looks like this is ordinarily done in the > > next checkpoint. It's not clear to me why it is done this way. > > Anyway, that would be the work of the end-of-recovery checkpoint > requested at the end of StartupXLOG() once a promotion happens or of > the checkpoint requested by PerformRecoveryXLogAction() in the second > case, no? So, I don't quite see why we need to update Eventually the work is done by StartupXLOG(). So we don't need to do that at all even in CreateCheckPoint(). If we expect that the end-of-recovery checkpoint clears it, that won't happen if if the last restart point takes so long time that the end-of-recovery checkpoint request is ignored. If DB_IN_ARCHIVE_RECOVERY ended while the restart point is running, it is highly possible that the end-of-recovery checkpoint trigger is ignored. In that case the values are cleard at the next checkpoint. In short, if we want to reset them at so-called end-of-recovery checkpoint, we should do that also in CreateRecoveryPoint. So, it is not removed in this version. > minRecoveryPoint and minRecoveryPointTLI in the control file here, as > much as this does not have to be part of the end-of-recovery code > that switches the control file to DB_IN_PRODUCTION. > > - if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && > - ControlFile->checkPointCopy.redo < lastCheckPoint.redo) > - { > 7ff23c6 has removed the last call to CreateCheckpoint() outside the > checkpointer, meaning that there is one less concurrent race to worry > about, but I have to admit that this change, to update the control Sure. In very early stage the reasoning to rmove the code was it. And the rason for proposing to back-patch the same to older versions is basing on the further investigation, and I'm not fully confident that for the earlier versions. > file's checkPoint and checkPointCopy even if we don't check after > ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the > code less robust in ~14. So I am questioning whether a backpatch > is actually worth the risk here. Agreed. regards. -- Kyotaro Horiguchi N
Re: Possible corruption by CreateRestartPoint at promotion
At Thu, 28 Apr 2022 09:12:13 +0900, Michael Paquier wrote in > On Wed, Apr 27, 2022 at 11:09:45AM -0700, Nathan Bossart wrote: > > On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote: > >> - if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && > >> - ControlFile->checkPointCopy.redo < lastCheckPoint.redo) > >> - { > >> 7ff23c6 has removed the last call to CreateCheckpoint() outside the > >> checkpointer, meaning that there is one less concurrent race to worry > >> about, but I have to admit that this change, to update the control > >> file's checkPoint and checkPointCopy even if we don't check after > >> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the > >> code less robust in ~14. So I am questioning whether a backpatch > >> is actually worth the risk here. > > > > IMO we should still check this before updating ControlFile to be safe. > > Sure. Fine by me to play it safe. Why do we consider concurrent check/restart points here while we don't consider the same for ControlFile->checkPointCopy? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Possible corruption by CreateRestartPoint at promotion
At Wed, 27 Apr 2022 01:31:55 -0400, Tom Lane wrote in > Michael Paquier writes: > > On Wed, Apr 27, 2022 at 12:36:10PM +0800, Rui Zhao wrote: > >> Do you have interest in adding a test like one in my patch? > > > I have studied the test case you are proposing, and I am afraid that > > it is too expensive as designed. > > That was my feeling too. It's certainly a useful test for verifying > that we fixed the problem, but that doesn't mean that it's worth the > cycles to add it as a permanent fixture in check-world, even if we > could get rid of the timing assumptions. My first feeling is the same. And I don't find a way to cause this cheap and reliably without inserting a dedicate debugging-aid code. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pgsql: Add contrib/pg_walinspect.
On Thu, 2022-04-28 at 12:11 +1200, Thomas Munro wrote: > > Another option might be to abandon this whole no-wait concept and > revert 2258e76f's changes to xlogutils.c. pg_walinspect already does > preliminary checks that LSNs are in range, so you can't enter a value > that will wait indefinitely, and it's interruptible (it's a 1ms > sleep/check loop, not my favourite programming pattern but that's > pre-existing code). If you're unlucky enough to hit the case where > the LSN is judged to be valid but is in the middle of a record that > hasn't been totally flushed yet, it'll just be a bit slower to return > as we wait for the inevitable later flush(es) to happen. The rest of > your record will *surely* be flushed pretty soon (or the flushing > backend panics the whole system and time ends). I don't imagine this > is performance critical work, so maybe that'd be acceptable? I'm inclined toward this option. I was a bit iffy on those xlogutils.c changes to begin with, and they are causing a problem now, so I'd like to avoid layering on more workarounds. The time when we need to wait is very narrow, only in this case where it's earlier than the flush pointer, and the flush pointer is in the middle of a record that's not fully flushed. And as you say, we won't be waiting very long in that case, because once we start to write a WAL record it better finish soon. Bharath, thoughts? When you originally introduced the nowait behavior, I believe that was to solve the case where someone specifies an LSN range well in the future, but we can still catch that and throw an error if we see that it's beyond the flush pointer. Do you see a problem with just doing that and getting rid of the nowait changes? Regards, Jeff Davis
Re: bogus: logical replication rows/cols combinations
On Thu, Apr 28, 2022 at 3:26 AM Tomas Vondra wrote: > > so I've been looking at tweaking the code so that the behavior matches > Alvaro's expectations. It passes check-world but I'm not claiming it's > nowhere near commitable - the purpose is mostly to give better idea of > how invasive the change is etc. > I was just skimming through the patch and didn't find anything related to initial sync handling. I feel the behavior should be same for initial sync and replication. -- With Regards, Amit Kapila.
Re: avoid multiple hard links to same WAL file after a crash
On Wed, Apr 27, 2022 at 11:42:04AM -0700, Nathan Bossart wrote: > Here is a new patch set with these assertions added. I think at least the > xlog.c change ought to be back-patched. The problem may be unlikely, but > AFAICT the possible consequences include WAL corruption. Okay, so I have applied this stuff this morning to see what the buildfarm had to say, and we have finished with a set of failures in various buildfarm members: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-04-28%2002%3A13%3A27 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-04-28%2002%3A14%3A08 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2022-04-28%2002%3A59%3A26 All of them did not like the part where we assume that a TLI history file written by a WAL receiver should not exist beforehand, but as 025_stuck_on_old_timeline.pl is showing, a standby may attempt to retrieve a TLI history file after getting it from the archives. I was analyzing the whole thing, and it looks like a race condition. Per the the buildfarm logs, we have less than 5ms between the moment the startup process retrieves the history file of TLI 2 from the archives and the moment the WAL receiver decides to check if this TLI file exists. If it does not exist, it would then retrieve it from the primary via streaming. So I guess that the sequence of events is that: - In WalRcvFetchTimeLineHistoryFiles(), the WAL receiver checks the existence of the history file for TLI 2, does not find it. - The startup process retrieves the file from the archives. - The WAL receiver goes through the internal loop of WalRcvFetchTimeLineHistoryFiles(), retrieves the history file from the primary's stream. Switching from durable_rename_excl() to durable_rename() would mean that we'd overwrite the TLI file received from the primary stream over what's been retrieved from the archives. That does not strike me as an issue in itself and that should be safe, so the comment is misleading, and we can live without the assertion in writeTimeLineHistoryFile() called by the WAL receiver. Now, I think that we'd better keep some belts in writeTimeLineHistory() called by the startup process at the end-of-recovery as I should never ever have a TLI file generated when selecting a new timeline. Perhaps this should be a elog(ERROR) at least, with a check on the file existence before calling durable_rename()? Anyway, my time is constrained next week due to the upcoming Japanese Golden Week and the buildfarm has to be stable, so I have reverted the change for now. -- Michael signature.asc Description: PGP signature
Re: Add --{no-,}bypassrls flags to createuser
Thank you for the reviews! On 2022-04-26 05:19, Nathan Bossart wrote: - printf(_(" -g, --role=ROLE new role will be a member of this role\n")); + printf(_(" -g, --role=ROLEnew role will be a member of this role\n")); This looks lik an unexpected change. I fixed it. I'm ok with -m/--member as well (like with --role only one role can be specified per switch instance so member, not membership, the later meaning, at least for me, the collective). That -m doesn't match --role-to is no worse than -g not matching --role, a short option seems worthwhile, and the -m (membership) mnemonic should be simple to pick-up. I don't see the addition of "-name" to the option name being beneficial. Yes, the standard doesn't use the "TO" prefix for "ROLE" - but taking that liberty for consistency here is very appealing and there isn't another SQL clause that it would be confused with. +1 for "member". It might not be perfect, but IMO it's the clearest option. Thanks! I changed the option "--membership" to "--member". For now, I also think "-m / --member" is the best choice, although it is ambiguous:( I'd like to hear others' opinions. regards -- Shinya Kato Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATIONdiff --git a/doc/src/sgml/ref/createuser.sgml b/doc/src/sgml/ref/createuser.sgml index 17579e50af..189ca5bb67 100644 --- a/doc/src/sgml/ref/createuser.sgml +++ b/doc/src/sgml/ref/createuser.sgml @@ -76,6 +76,20 @@ PostgreSQL documentation + + -a role + --admin=role + + + The new role will be added immediately as a member with admin option + of this role. + Multiple roles to which new role will be added as a member with admin + option can be specified by writing multiple + -a switches. + + + + -c number --connection-limit=number @@ -204,6 +218,19 @@ PostgreSQL documentation + + -m role + --member=role + + + The new role will be added immediately as a member of this role. + Multiple roles to which new role will be added as a member + can be specified by writing multiple + -m switches. + + + + -P --pwprompt @@ -258,6 +285,17 @@ PostgreSQL documentation + + -v timestamp + --valid-until=timestamp + + +Set a timestamp after which the role's password is no longer valid. +The default is to set no expiration. + + + + -V --version @@ -290,6 +328,28 @@ PostgreSQL documentation + + --bypassrls + + +The new user will have the BYPASSRLS privilege, +which is described more fully in the documentation for . + + + + + + --no-bypassrls + + +The new user will not have the BYPASSRLS +privilege, which is described more fully in the documentation for . + + + + -? --help diff --git a/src/bin/scripts/createuser.c b/src/bin/scripts/createuser.c index bfba0d09d1..73cd1b479e 100644 --- a/src/bin/scripts/createuser.c +++ b/src/bin/scripts/createuser.c @@ -51,6 +51,11 @@ main(int argc, char *argv[]) {"connection-limit", required_argument, NULL, 'c'}, {"pwprompt", no_argument, NULL, 'P'}, {"encrypted", no_argument, NULL, 'E'}, + {"bypassrls", no_argument, NULL, 4}, + {"no-bypassrls", no_argument, NULL, 5}, + {"valid-until", required_argument, NULL, 'v'}, + {"member", required_argument, NULL, 'm'}, + {"admin", required_argument, NULL, 'a'}, {NULL, 0, NULL, 0} }; @@ -69,6 +74,9 @@ main(int argc, char *argv[]) int conn_limit = -2; /* less than minimum valid value */ bool pwprompt = false; char *newpassword = NULL; + SimpleStringList members = {NULL, NULL}; + SimpleStringList admins = {NULL, NULL}; + char *timestamp = NULL; /* Tri-valued variables. */ enum trivalue createdb = TRI_DEFAULT, @@ -76,7 +84,8 @@ main(int argc, char *argv[]) createrole = TRI_DEFAULT, inherit = TRI_DEFAULT, login = TRI_DEFAULT, -replication = TRI_DEFAULT; +replication = TRI_DEFAULT, +bypassrls = TRI_DEFAULT; PQExpBufferData sql; @@ -89,7 +98,7 @@ main(int argc, char *argv[]) handle_help_version_opts(argc, argv, "createuser", help); - while ((c = getopt_long(argc, argv, "h:p:U:g:wWedDsSrRiIlLc:PE", + while ((c = getopt_long(argc, argv, "h:p:U:g:wWedDsSrRiIlLc:PEv:m:a:", long_options, &optindex)) != -1) { switch (c) @@ -165,6 +174,21 @@ main(int argc, char *argv[]) case 3: interactive = true; break; + case 4: +bypassrls = TRI_YES; +break; + case 5: +bypassrls = TRI_NO; +break; + case 'v': +timestamp = pg_strdup(optarg); +break; + case 'm': +simple_string
Re: Defer selection of asynchronous subplans until the executor initialization stage
On Mon, Apr 25, 2022 at 1:29 PM Etsuro Fujita wrote: > I modified mark_async_capable_plan() a bit further; 1) adjusted code > in the ProjectionPath case, just for consistency with other cases, and > 2) tweaked/improved comments a bit. Attached is a new version of the > patch (“prevent-async-2.patch”). > > As mentioned before, v14 has the same issue, so I created a fix for > v14, which I’m attaching as well (“prevent-async-2-v14.patch”). In > the fix I modified is_async_capable_path() the same way as > mark_async_capable_plan() in HEAD, renaming it to > is_async_capable_plan(), and updated some comments. > > Barring objections, I’ll push/back-patch these. Done. Best regards, Etsuro Fujita
Re: Implementing Incremental View Maintenance
Hello Greg, On Sat, 23 Apr 2022 08:18:01 +0200 Greg Stark wrote: > I'm trying to figure out how to get this feature more attention. Everyone > agrees it would be a huge help but it's a scary patch to review. > > I wonder if it would be helpful to have a kind of "readers guide" > explanation of the patches to help a reviewer understand what the point of > each patch is and how the whole system works? I think Andres and Robert > have both taken that approach before with big patches and it really helped > imho. Thank you very much for your suggestion! Following your advice, I am going to write a readers guide referring to the past posts of Andres and Rebert. Regards, Yugo Nagata -- Yugo NAGATA
Re: Multi-Master Logical Replication
On Thu, 2022-04-28 at 09:49 +1000, Peter Smith wrote: > To prevent the circular recursive problem Vignesh is developing a > patch [1] that introduces new SUBSCRIPTION options "local_only" (for > publishing only data originating at the publisher node) and > "copy_data=force". Using this patch, we have created a script [2] > demonstrating how to set up all the above multi-node examples. An > overview of the necessary steps is given in the next section. I am missing a discussion how replication conflicts are handled to prevent replication from breaking or the databases from drifting apart. Yours, Laurenz Albe
Re: Possible corruption by CreateRestartPoint at promotion
On Thu, Apr 28, 2022 at 11:43:57AM +0900, Kyotaro Horiguchi wrote: > At Thu, 28 Apr 2022 09:12:13 +0900, Michael Paquier > wrote in >> On Wed, Apr 27, 2022 at 11:09:45AM -0700, Nathan Bossart wrote: >>> On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote: - if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && - ControlFile->checkPointCopy.redo < lastCheckPoint.redo) - { 7ff23c6 has removed the last call to CreateCheckpoint() outside the checkpointer, meaning that there is one less concurrent race to worry about, but I have to admit that this change, to update the control file's checkPoint and checkPointCopy even if we don't check after ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the code less robust in ~14. So I am questioning whether a backpatch is actually worth the risk here. >>> >>> IMO we should still check this before updating ControlFile to be safe. >> >> Sure. Fine by me to play it safe. > > Why do we consider concurrent check/restart points here while we don't > consider the same for ControlFile->checkPointCopy? I am not sure what you mean here. FWIW, I am translating the suggestion of Nathan to split the existing check in CreateRestartPoint() that we are discussing here into two if blocks, instead of just one: - Move the update of checkPoint and checkPointCopy into its own if block, controlled only by the check on (ControlFile->checkPointCopy.redo < lastCheckPoint.redo) - Keep the code updating minRecoveryPoint and minRecoveryPointTLI mostly as-is, but just do the update when the control file state is DB_IN_ARCHIVE_RECOVERY. Of course we need to keep the check on (minRecoveryPoint < lastCheckPointEndPtr). v5 is mostly doing that. -- Michael signature.asc Description: PGP signature