Question about HEAP_XMIN_COMMITTED
Hi I did the following steps on PG. 1. Building a synchronous streaming replication environment. 2. Executing the following SQL statements on primary (1) postgres=# CREATE EXTENSION pageinspect; (2) postgres=# begin; (3) postgres=# select txid_current(); (4) postgres=# create table mytest6(i int); (6) postgres=# insert into mytest6 values(1); (7) postgres=# commit; 3. Executing the following SQL statements on standby (8) postgres=# select * from mytest6; i --- 1 (1 row) (9) postgres=# SELECT t_infomask FROM heap_page_items(get_raw_page('pg_class', 0)) where t_xmin=502※; t_infomask 2049 (1 row) ※502 is the transaction ID returned by step (3) above. In the result of step (9),the value of the t_infomask field is 2049(0x801) which means that HEAP_XMAX_INVALID and HEAP_HASNULL flags were setted, but HEAP_XMIN_COMMITTED flag was not setted. According to source , when step (8) was executed,SetHintBits function were called to set HEAP_XMIN_COMMITTED. however, the minRecoveryPoint value was not updated. So HEAP_XMIN_COMMITTED flag was not setted successfully. After CheckPoint, select from mytest6 again in another session, we can see HEAP_XMIN_COMMITTED flag was setted. So my question is that before checkpoint, HEAP_XMIN_COMMITTED flag was not setted correctly, right? Or we need to move minRecoveryPoint forword to make HEAP_XMIN_COMMITTED flag setted correctly when first select from mytest6. Best Regards, LiuHuailing -- 以上 Liu Huailing -- Liu Huailing Development Department III Software Division II Nanjing Fujitsu Nanda Software Tech. Co., Ltd.(FNST) ADDR.: No.6 Wenzhu Road, Software Avenue, Nanjing, 210012, China TEL : +86+25-86630566-8439 COINS: 7998-8439 FAX : +86+25-83317685 MAIL : liuhuail...@cn.fujitsu.com --
Re: Skipping logical replication transactions on subscriber side
On Tue, Dec 14, 2021 at 1:07 PM Dilip Kumar wrote: > > On Tue, Dec 14, 2021 at 8:20 AM Amit Kapila wrote: > > > > On Mon, Dec 13, 2021 at 6:55 PM Masahiko Sawada > > wrote: > > > > In streaming cases, we don’t know when stream-commit or stream-abort > > > comes and another conflict could occur on the subscription in the > > > meanwhile. But given that (we expect) this feature is used after the > > > apply worker enters into an error loop, this is unlikely to happen in > > > practice unless the user sets the wrong XID. Similarly, in 2PC cases, > > > we don’t know when commit-prepared or rollback-prepared comes and > > > another conflict could occur in the meanwhile. But this could occur in > > > practice even if the user specified the correct XID. Therefore, if we > > > disallow to change skip_xid until the subscriber receives > > > commit-prepared or rollback-prepared, we cannot skip the second > > > transaction that conflicts with data on the subscriber. > > > > > > > I agree with this theory. Can we reflect this in comments so that in > > the future we know why we didn't pursue this direction? > > I might be missing something here, but for streaming, transaction > users can decide whether they wants to skip or not only once we start > applying no? I mean only once we start applying the changes we can > get some errors and by that time we must be having all the changes for > the transaction. > That is right and as per my understanding, the patch is trying to accomplish the same. > So I do not understand the point we are trying to > discuss here? > The point is that whether we can skip the changes while streaming itself like when we get the changes and write to a stream file. Now, it is possible that streams from multiple transactions can be interleaved and users can change the skip_xid in between. It is not that we can't handle this but that would require a more complex design and it doesn't seem worth it because we can anyway skip the changes while applying as you mentioned in the previous paragraph. -- With Regards, Amit Kapila.
Re: Adding CI to our tree
> On 14 Dec 2021, at 05:11, Andres Freund wrote: > On 2021-12-14 16:51:58 +1300, Thomas Munro wrote: >> I'd better go and figure out how to fix cfbot when this lands... > > I assume it'd be: > - stop adding the CI stuff > - adjust links to CI tasks, appveyor wouldn't be used anymore > - perhaps reference individual tasks from the cfbot page? +1 on leveraging the CI tasks in the tree in the CFBot. For a patch like the libnss TLS backend one it would be a great help to both developer and reviewer to have that codepath actually built and tested as part of the CFBot; being able to tweak the CI tasks used in the CFBot per patch would be very helpful. -- Daniel Gustafsson https://vmware.com/
Re: Skipping logical replication transactions on subscriber side
On Fri, Dec 10, 2021 at 11:14 AM Masahiko Sawada wrote: > > On Thu, Dec 9, 2021 at 6:16 PM Amit Kapila wrote: > > > > On Thu, Dec 9, 2021 at 2:24 PM Masahiko Sawada > > wrote: > > > > > > On Thu, Dec 9, 2021 at 11:47 AM Amit Kapila > > > wrote: > > > > > > > > I am thinking that we can start a transaction, update the catalog, > > > > commit that transaction. Then start a new one to update > > > > origin_lsn/timestamp, finishprepared, and commit it. Now, if it > > > > crashes after the first transaction, only commit prepared will be > > > > resent again and this time we don't need to update the catalog as that > > > > entry would be already cleared. > > > > > > Sounds good. In the crash case, it should be fine since we will just > > > commit an empty transaction. The same is true for the case where > > > skip_xid has been changed after skipping and preparing the transaction > > > and before handling commit_prepared. > > > > > > Regarding the case where the user specifies XID of the transaction > > > after it is prepared on the subscriber (i.g., the transaction is not > > > empty), we won’t skip committing the prepared transaction. But I think > > > that we don't need to support skipping already-prepared transaction > > > since such transaction doesn't conflict with anything regardless of > > > having changed or not. > > > > > > > Yeah, this makes sense to me. > > > > I've attached an updated patch. The new syntax is like "ALTER > SUBSCRIPTION testsub SKIP (xid = '123')". > > I’ve been thinking we can do something safeguard for the case where > the user specified the wrong xid. For example, can we somewhat use the > stats in pg_stat_subscription_workers? An idea is that logical > replication worker fetches the xid from the stats when reading the > subscription and skips the transaction if the xid matches to > subskipxid. That is, the worker checks the error reported by the > worker previously working on the same subscription. The error could > not be a conflict error (e.g., connection error etc.) or might have > been cleared by the reset function, But given the worker is in an > error loop, the worker can eventually get xid in question. We can > prevent an unrelated transaction from being skipped unexpectedly. It > seems not a stable solution though. Or it might be enough to warn > users when they specified an XID that doesn’t match to last_error_xid. > Anyway, I think it’s better to have more discussion on this. Any > ideas? Few comments: 1) Should we check if conflicting option is specified like others above: + else if (strcmp(defel->defname, "xid") == 0) + { + char *xid_str = defGetString(defel); + TransactionId xid; + + if (strcmp(xid_str, "-1") == 0) + { + /* Setting -1 to xid means to reset it */ + xid = InvalidTransactionId; + } + else + { 2) Currently only superusers can set skip xid, we can add this in the documentation: + if (!superuser()) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("must be superuser to set %s", "skip_xid"))); 3) There is an extra tab before "The resolution can be done ...", it can be removed. + Skip applying changes of the particular transaction. If incoming data + violates any constraints the logical replication will stop until it is + resolved. The resolution can be done either by changing data on the + subscriber so that it doesn't conflict with incoming change or by skipping + the whole transaction. The logical replication worker skips all data 4) xid with -2 is currently allowed, may be it is ok. If it is fine we can remove it from the fail section. +-- fail +ALTER SUBSCRIPTION regress_testsub SKIP (xid = 1.1); +ERROR: invalid transaction id: 1.1 +ALTER SUBSCRIPTION regress_testsub SKIP (xid = -2); +ALTER SUBSCRIPTION regress_testsub SKIP (xid = 0); +ERROR: invalid transaction id: 0 +ALTER SUBSCRIPTION regress_testsub SKIP (xid = 1); Regards, Vignesh
Re: Skipping logical replication transactions on subscriber side
On Tue, Dec 14, 2021 at 2:36 PM Amit Kapila wrote: > > > > > > > I agree with this theory. Can we reflect this in comments so that in > > > the future we know why we didn't pursue this direction? > > > > I might be missing something here, but for streaming, transaction > > users can decide whether they wants to skip or not only once we start > > applying no? I mean only once we start applying the changes we can > > get some errors and by that time we must be having all the changes for > > the transaction. > > > > That is right and as per my understanding, the patch is trying to > accomplish the same. > > > So I do not understand the point we are trying to > > discuss here? > > > > The point is that whether we can skip the changes while streaming > itself like when we get the changes and write to a stream file. Now, > it is possible that streams from multiple transactions can be > interleaved and users can change the skip_xid in between. It is not > that we can't handle this but that would require a more complex design > and it doesn't seem worth it because we can anyway skip the changes > while applying as you mentioned in the previous paragraph. Actually, I was trying to understand the use case for skipping while streaming. Actually, during streaming we are not doing any database operation that means this will not generate any error. So IIUC, there is no use case for skipping while streaming itself? Is there any use case which I am not aware of? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Add client connection check during the execution of the query
On Tue, Dec 14, 2021 at 11:18 AM Thomas Munro wrote: > On Tue, Dec 14, 2021 at 7:53 AM Andres Freund wrote: > > On 2021-12-13 17:51:00 +1300, Thomas Munro wrote: > > > I tried that. It seems OK, and gets rid of the PG_FINALLY(), which is > > > nice. Latches still have higher priority, and still have the fast > > > return if already set and you asked for only one event, but now if you > > > ask for nevents > 1 we'll make the syscall too so we'll see the > > > WL_SOCKET_CLOSED. > > > > Isn't a certain postgres committer that cares a lot about unnecessary > > syscalls > > going to be upset about this one? Even with the nevents > 1 optimization? > > Yes, > > right now there's no other paths that do so, but I don't like the corner > > this > > paints us in. > > Well, I was trying to avoid bikeshedding an API change just for a > hypothetical problem we could solve when the time comes (say, after > fixing the more egregious problems with Append's WES usage), but here > goes: we could do something like AddWaitEventToSet(FeBeWaitSet, > WL_LATCH_SET_LOPRIO, ...) that is translated to WL_LATCH_SET > internally but also sets a flag to enable this > no-really-please-poll-all-the-things-if-there-is-space behaviour. Here's one like that. From 6b994807c29157ac7053ec347a51268d60f2b3b4 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 30 Apr 2021 10:38:40 +1200 Subject: [PATCH v5 1/3] Add WL_SOCKET_CLOSED for socket shutdown events. Provide a way for WaitEventSet to report that the remote peer has shut down its socket, independently of whether there is any buffered data remaining to be read. This works only on systems where the kernel exposes that information, namely: * WAIT_USE_POLL builds using POLLRDHUP, if available * WAIT_USE_EPOLL builds using EPOLLRDHUP * WAIT_USE_KQUEUE builds using EV_EOF Reviewed-by: Zhihong Yu Reviewed-by: Maksim Milyutin Discussion: https://postgr.es/m/77def86b27e41f0efcba411460e929ae%40postgrespro.ru --- src/backend/storage/ipc/latch.c | 79 + src/include/storage/latch.h | 6 ++- 2 files changed, 74 insertions(+), 11 deletions(-) diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 1d893cf863..54e928c564 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -841,6 +841,7 @@ FreeWaitEventSet(WaitEventSet *set) * - WL_SOCKET_CONNECTED: Wait for socket connection to be established, * can be combined with other WL_SOCKET_* events (on non-Windows * platforms, this is the same as WL_SOCKET_WRITEABLE) + * - WL_SOCKET_CLOSED: Wait for socket to be closed by remote peer. * - WL_EXIT_ON_PM_DEATH: Exit immediately if the postmaster dies * * Returns the offset in WaitEventSet->events (starting from 0), which can be @@ -1043,12 +1044,16 @@ WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action) else { Assert(event->fd != PGINVALID_SOCKET); - Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)); + Assert(event->events & (WL_SOCKET_READABLE | +WL_SOCKET_WRITEABLE | +WL_SOCKET_CLOSED)); if (event->events & WL_SOCKET_READABLE) epoll_ev.events |= EPOLLIN; if (event->events & WL_SOCKET_WRITEABLE) epoll_ev.events |= EPOLLOUT; + if (event->events & WL_SOCKET_CLOSED) + epoll_ev.events |= EPOLLRDHUP; } /* @@ -1087,12 +1092,18 @@ WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event) } else { - Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)); + Assert(event->events & (WL_SOCKET_READABLE | +WL_SOCKET_WRITEABLE | +WL_SOCKET_CLOSED)); pollfd->events = 0; if (event->events & WL_SOCKET_READABLE) pollfd->events |= POLLIN; if (event->events & WL_SOCKET_WRITEABLE) pollfd->events |= POLLOUT; +#ifdef POLLRDHUP + if (event->events & WL_SOCKET_CLOSED) + pollfd->events |= POLLRDHUP; +#endif } Assert(event->fd != PGINVALID_SOCKET); @@ -1165,7 +1176,9 @@ WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int old_events) Assert(event->events != WL_LATCH_SET || set->latch != NULL); Assert(event->events == WL_LATCH_SET || event->events == WL_POSTMASTER_DEATH || - (event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))); + (event->events & (WL_SOCKET_READABLE | + WL_SOCKET_WRITEABLE | + WL_SOCKET_CLOSED))); if (event->events == WL_POSTMASTER_DEATH) { @@ -1188,9 +1201,9 @@ WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int old_events) * old event mask to the new event mask, since kevent treats readable * and writable as separate events. */ - if (old_events & WL_SOCKET_READABLE) + if (old_events & (WL_SOCKET_READABLE | WL_SOCKET_CLOSED)) old_filt_read = true; - if (event->events & WL_SOCKET_READABLE) + if (event->events & (WL_SOCKET_READABLE | WL_SOCKET_CLOSED)) new_filt_read = true; if (old_events & WL_SOCKET_WRITEABLE) old_filt_write = true; @@ -1210,7 +1223,10 @@
Re: row filtering for logical replication
On Tue, Dec 14, 2021 at 10:50 AM Amit Kapila wrote: > > On Tue, Dec 14, 2021 at 4:44 AM Peter Smith wrote: > > Few other comments: > === Few more comments: == v46-0001/0002 === 1. After rowfilter_walker() why do we need EXPR_KIND_PUBLICATION_WHERE? I thought this is primarily to identify the expressions that are not allowed in rowfilter which we are now able to detect upfront with the help of a walker. Can't we instead use EXPR_KIND_WHERE? 2. +Node * +GetTransformedWhereClause(ParseState *pstate, PublicationRelInfo *pri, + bool bfixupcollation) Can we add comments atop this function? 3. In GetTransformedWhereClause, can we change the name of variables (a) bfixupcollation to fixup_collation or assign_collation, (b) transformedwhereclause to whereclause. I think that will make the function more readable. v46-0002 4. + else if (IsA(node, List) || IsA(node, Const) || IsA(node, BoolExpr) || IsA(node, NullIfExpr) || + IsA(node, NullTest) || IsA(node, BooleanTest) || IsA(node, CoalesceExpr) || + IsA(node, CaseExpr) || IsA(node, CaseTestExpr) || IsA(node, MinMaxExpr) || + IsA(node, ArrayExpr) || IsA(node, ScalarArrayOpExpr) || IsA(node, XmlExpr)) Can we move this to a separate function say IsValidRowFilterExpr() or something on those lines and use Switch (nodetag(node)) to identify these nodes? -- With Regards, Amit Kapila.
Re: Skipping logical replication transactions on subscriber side
On Tue, Dec 14, 2021 at 3:41 PM Dilip Kumar wrote: > > On Tue, Dec 14, 2021 at 2:36 PM Amit Kapila wrote: > > > > > > > > > > I agree with this theory. Can we reflect this in comments so that in > > > > the future we know why we didn't pursue this direction? > > > > > > I might be missing something here, but for streaming, transaction > > > users can decide whether they wants to skip or not only once we start > > > applying no? I mean only once we start applying the changes we can > > > get some errors and by that time we must be having all the changes for > > > the transaction. > > > > > > > That is right and as per my understanding, the patch is trying to > > accomplish the same. > > > > > So I do not understand the point we are trying to > > > discuss here? > > > > > > > The point is that whether we can skip the changes while streaming > > itself like when we get the changes and write to a stream file. Now, > > it is possible that streams from multiple transactions can be > > interleaved and users can change the skip_xid in between. It is not > > that we can't handle this but that would require a more complex design > > and it doesn't seem worth it because we can anyway skip the changes > > while applying as you mentioned in the previous paragraph. > > Actually, I was trying to understand the use case for skipping while > streaming. Actually, during streaming we are not doing any database > operation that means this will not generate any error. > Say, there is an error the first time when we start to apply changes for such a transaction. So, such a transaction will be streamed again. Say, the user has set the skip_xid before we stream a second time, so this time, we can skip it either during the stream phase or apply phase. I think the patch is skipping it during apply phase. Sawada-San, please confirm if my understanding is correct? -- With Regards, Amit Kapila.
Confused comment about drop replica identity index
Hi hackers, When I doing development based by PG, I found the following comment have a little problem in file src/include/catalog/pg_class.h. /* * an explicitly chosen candidate key's columns are used as replica identity. * Note this will still be set if the index has been dropped; in that case it * has the same meaning as 'd'. */ #define REPLICA_IDENTITY_INDEX'i' The last sentence makes me a little confused : [..in that case it as the same meaning as 'd'.] Now, pg-doc didn't have a clear style to describe this. But if I drop relation's replica identity index like the comment, the action is not as same as default. For example: Execute the following SQL: create table tbl (col1 int primary key, col2 int not null); create unique INDEX ON tbl(col2); alter table tbl replica identity using INDEX tbl_col2_idx; drop index tbl_col2_idx; create publication pub for table tbl; delete from tbl; Actual result: ERROR: cannot delete from table "tbl" because it does not have a replica identity and publishes deletes HINT: To enable deleting from the table, set REPLICA IDENTITY using ALTER TABLE. Expected result in comment: DELETE 0 I found that in the function CheckCmdReplicaIdentity, the operation described in the comment is not considered, When relation's replica identity index is found to be InvalidOid, an error is reported. Are the comment here not accurate enough? Or we need to adjust the code according to the comments? Regards, Wang wei
Re: Add sub-transaction overflow status in pg_stat_activity
Hi, I have looked into the v2 patch and here are my comments: + PG_RETURN_INT32(local_beentry->subxact_overflowed); +} Should this be PG_RETURN_BOOL instead of PG_RETURN_INT32?? -- +{ oid => '6107', descr => 'statistics: cached subtransaction count of backend', + proname => 'pg_stat_get_backend_subxact_count', provolatile => 's', proparallel => 'r', + prorettype => 'int4', proargtypes => 'int4', + prosrc => 'pg_stat_get_backend_subxact_count' }, +{ oid => '6108', descr => 'statistics: subtransaction cache of backend overflowed', + proname => 'pg_stat_get_backend_subxact_overflow', provolatile => 's', proparallel => 'r', + prorettype => 'bool', proargtypes => 'int4', + prosrc => 'pg_stat_get_backend_subxact_overflow' }, The description says that the two new functions show the statistics for "cached subtransaction count of backend" and "subtransaction cache of backend overflowed". But, when these functions are called it shows these stats for the non-backend processes like checkpointer, walwriter etc as well. Should that happen? -- typedef struct LocalPgBackendStatus * not. */ TransactionId backend_xmin; + + /* +* Number of cached subtransactions in the current session. +*/ + int subxact_count; + + /* +* The number of subtransactions in the current session exceeded the cached +* subtransaction limit. +*/ + bool subxact_overflowed; All the variables inside this LocalPgBackendStatus structure are prefixed with "backend" word. Can we do the same for the newly added variables as well? -- + * Get the xid and xmin, nsubxid and overflow status of the backend. The Should this be put as - "xid, xmin, nsubxid and overflow" instead of "xid and xmin, nsubxid and overflow"? -- With Regards, Ashutosh Sharma. On Mon, Dec 13, 2021 at 7:58 PM Dilip Kumar wrote: > On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby > wrote: > > > > > You added this to pg_stat_activity, which already has a lot of fields. > > We talked a few months ago about not adding more fields that weren't > commonly > > used. > > > https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e > > > > Since I think this field is usually not interesting to most users of > > pg_stat_activity, maybe this should instead be implemented as a function > like > > pg_backend_get_subxact_status(pid). > > > > People who want to could use it like: > > SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) > sub; > > I have provided two function, one for subtransaction counts and other > whether subtransaction cache is overflowed or not, we can use like > this, if we think this is better way to do it then we can also add > another function for the lastOverflowedXid > > postgres[43994]=# select id, pg_stat_get_backend_pid(id) as pid, > pg_stat_get_backend_subxact_count(id) as nsubxact, > pg_stat_get_backend_subxact_overflow(id) as overflowed from > pg_stat_get_backend_idset() as id; > id | pid | nsubxact | overflowed > +---+--+ > 1 | 43806 |0 | f > 2 | 43983 | 64 | t > 3 | 43994 |0 | f > 4 | 44323 | 22 | f > 5 | 43802 |0 | f > 6 | 43801 |0 | f > 7 | 43804 |0 | f > (7 rows) > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >
RE: Failed transaction statistics to measure the logical replication progress
On Tues, Dec 14, 2021 10:28 AM Amit Kapila wrote: > On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com > wrote: > > > > On Monday, December 13, 2021 6:19 PM Amit Kapila > wrote: > > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com > > > wrote: > > > > > > Few questions and comments: > > Thank you for your comments ! > > > > > > > > 1. > > > one row per subscription worker on which errors have occurred, for > > > workers > > > applying logical replication changes and workers handling the initial > > > data > > > - copy of the subscribed tables. The statistics entry is removed when > > > the > > > - corresponding subscription is dropped. > > > + copy of the subscribed tables. Also, the row corresponding to the > > > apply > > > + worker shows all transaction statistics of both types of workers on > > > the > > > + subscription. The statistics entry is removed when the corresponding > > > + subscription is dropped. > > > > > > Why did you choose to show stats for both types of workers in one row? > > This is because if we have hundreds or thousands of tables for table sync, > > we need to create many entries to cover them and store the entries for all > > tables. > > > > If we fear a large number of entries for such workers then won't it be > better to show the value of these stats for apply workers. I > think normally the table sync workers perform only copy operation or > maybe a fixed number of xacts, so, one might not be interested in the > transaction stats of these workers. I find merging only specific stats > of two different types of workers confusing. > > What do others think about this? Personally, I agreed that merging two types of stats into one row might not be a good idea. And the xact stats of table sync workers are usually less interesting than the apply worker's, So, it's seems acceptable to me if we show stats only for apply workers and document about this. Best regards, Hou zj
Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?
On Fri, Dec 3, 2021 at 7:39 PM Bharath Rupireddy wrote: > > On Wed, Dec 1, 2021 at 9:50 PM Alvaro Herrera wrote: > > > > On 2021-Dec-01, Bharath Rupireddy wrote: > > > > > The active_pid of ReplicationSlot structure, which tells whether a > > > replication slot is active or inactive, isn't persisted to the disk > > > i.e has no entry in ReplicationSlotPersistentData structure. Isn't it > > > better if we add that info to ReplicationSlotPersistentData structure > > > and persist to the disk? This will help to know what were the inactive > > > replication slots in case the server goes down or crashes for some > > > reason. Currently, we don't have a way to interpret the replication > > > slot info in the disk but there's a patch for pg_replslotdata tool at > > > [1]. This way, one can figure out the reasons for the server > > > down/crash and figure out which replication slots to remove to bring > > > the server up and running without touching the other replication > > > slots. > > > > I think the PIDs are log-worthy for sure, but it's not clear to me that > > it is desirable to write them to the persistent state file. In case of > > crashes, the log should serve just fine to aid root cause investigation > > -- in fact even better than the persistent file, where the data would be > > lost as soon as the next client acquires that slot. > > Thanks. +1 to log a message at LOG level whenever a replication slot > becomes active (gets assigned a valid pid to active_pid) and becomes > inactive(gets assigned 0 to active_pid). Having said that, isn't it > also helpful if we write a bool (1 byte character) whenever the slot > becomes active and inactive to the disk? Here's the patch that adds a LOG message whenever a replication slot becomes active and inactive. These logs will be extremely useful on production servers to debug and analyze inactive replication slot issues. Thoughts? Regards, Bharath Rupireddy. v1-0001-Add-LOG-messages-when-replication-slots-become-ac.patch Description: Binary data
Re: Confused comment about drop replica identity index
On Tue, Dec 14, 2021 at 6:08 PM wangw.f...@fujitsu.com wrote: > > Hi hackers, > > When I doing development based by PG, I found the following comment have a > little problem in file src/include/catalog/pg_class.h. > > /* > * an explicitly chosen candidate key's columns are used as replica identity. > * Note this will still be set if the index has been dropped; in that case it > * has the same meaning as 'd'. > */ > #define REPLICA_IDENTITY_INDEX'i' > > The last sentence makes me a little confused : > [..in that case it as the same meaning as 'd'.] > > Now, pg-doc didn't have a clear style to describe this. > > > But if I drop relation's replica identity index like the comment, the action > is not as same as default. > > For example: > Execute the following SQL: > create table tbl (col1 int primary key, col2 int not null); > create unique INDEX ON tbl(col2); > alter table tbl replica identity using INDEX tbl_col2_idx; > drop index tbl_col2_idx; > create publication pub for table tbl; > delete from tbl; > > Actual result: > ERROR: cannot delete from table "tbl" because it does not have a replica > identity and publishes deletes > HINT: To enable deleting from the table, set REPLICA IDENTITY using ALTER > TABLE. I think I see where's the confusion. The table has a primary key and so when the replica identity index is dropped, per the comment in code, you expect that primary key will be used as replica identity since that's what 'd' or default means. > > Expected result in comment: > DELETE 0 > > > I found that in the function CheckCmdReplicaIdentity, the operation described > in the comment is not considered, > When relation's replica identity index is found to be InvalidOid, an error is > reported. This code in RelationGetIndexList() is not according to that comment. if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex)) relation->rd_replidindex = pkeyIndex; else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex)) relation->rd_replidindex = candidateIndex; else relation->rd_replidindex = InvalidOid; > > Are the comment here not accurate enough? > Or we need to adjust the code according to the comments? > Comment in code is one thing, but I think PG documentation is not covering the use case you tried. What happens when a replica identity index is dropped has not been covered either in ALTER TABLE https://www.postgresql.org/docs/13/sql-altertable.html or DROP INDEX https://www.postgresql.org/docs/14/sql-dropindex.html documentation. -- Best Wishes, Ashutosh Bapat
Re: more descriptive message for process termination due to max_slot_wal_keep_size
On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi wrote: > > Hello. > > As complained in pgsql-bugs [1], when a process is terminated due to > max_slot_wal_keep_size, the related messages don't mention the root > cause for *the termination*. Note that the third message does not > show for temporary replication slots. > > [pid=a] LOG: terminating process x to release replication slot "s" > [pid=x] LOG: FATAL: terminating connection due to administrator command > [pid=a] LOG: invalidting slot "s" because its restart_lsn X/X exceeds > max_slot_wal_keep_size > > The attached patch attaches a DETAIL line to the first message. > > > [17605] LOG: terminating process 17614 to release replication slot "s1" > + [17605] DETAIL: The slot's restart_lsn 0/2CA0 exceeds > max_slot_wal_keep_size. > > [17614] FATAL: terminating connection due to administrator command > > [17605] LOG: invalidating slot "s1" because its restart_lsn 0/2CA0 > > exceeds max_slot_wal_keep_size > > Somewhat the second and fourth lines look inconsistent each other but > that wouldn't be such a problem. I don't think we want to concatenate > the two lines together as the result is a bit too long. > > > LOG: terminating process 17614 to release replication slot "s1" because > > it's restart_lsn 0/2CA0 exceeds max_slot_wal_keep_size. > > What do you think about this? Agree. I think we should also specify the restart_lsn value which would be within max_slot_wal_keep_size for better understanding. -- Best Wishes, Ashutosh Bapat
Re: conchuela has some SSL issues
On 2021-12-13 20:53, Andrew Dunstan wrote: Any thoughts? It's also failing on REL_14_STABLE, so I think it must be an environmental change. I did an pkg update && pkg upgrade and it messed up the SSL-libraries. It had both libressl and openssl and when I upgraded it some how removed libressl-libraries. One would think it would still work as it would pick up the openssl library instead. But apparently not. I will take a look at it. At the moment I have disabled new builds until I have fixed it. Sorry for the inconvenience. /Mikael
Re: Defining (and possibly skipping) useless VACUUM operations
On Sun, Dec 12, 2021 at 8:47 PM Peter Geoghegan wrote: > I am currently working on decoupling advancing relfrozenxid from tuple > freezing [1]. That is, I'm teaching VACUUM to keep track of > information that it uses to generate an "optimal value" for the > table's final relfrozenxid: the most recent XID value that might still > be in the table. This patch is based on the observation that we don't > actually have to use the FreezeLimit cutoff for our new > pg_class.relfrozenxid. We need only obey the basic relfrozenxid > invariant, which is that the final value must be <= any extant XID in > the table. Using FreezeLimit is needlessly conservative. Right. > It now occurs to me to push this patch in another direction, on top of > all that: the OldestXmin behavior hints at a precise, robust way of > defining "useless vacuuming". We can condition skipping a VACUUM (i.e. > whether a VACUUM is considered "definitely won't be useful if allowed > to execute") on whether or not our preexisting pg_class.relfrozenxid > precisely equals our newly-acquired OldestXmin for an about-to-begin > VACUUM operation. (We'd also want to add an "unchangeable > pg_class.relminmxid" test, I think.) I think this is a reasonable line of thinking, but I think it's a little imprecise. In general, we could be vacuuming a relation to advance relfrozenxid, but we could also be vacuuming a relation to advance relminmxid, or we could be vacuuming a relation to fight bloat, or set pages all-visible. It is possible that there's no hope of advancing relfrozenxid but that we can still accomplish one of the other goals. In that case, the vacuuming is not useless. I think the place to put logic around this would be in the triggering logic for autovacuum. If we're going to force a relation to be vacuumed because of (M)XID wraparound danger, we could first check whether there seems to be any hope of advancing relfrozenxid(minmxid). If not, we discount that as a trigger for vacuum, but may still decide to vacuum if some other trigger warrants it. In most cases, if there's no hope of advancing relfrozenxid, there won't be any bloat to remove either, but aborted transactions are a counterexample. And the XID and MXID horizons can advance at completely different rates. One reason I haven't pursued this kind of optimization is that it doesn't really feel like it's fixing the whole problem. It would be a little bit sad if we did a perfect job preventing useless vacuuming but still allowed almost-useless vacuuming. Suppose we have a 1TB relation and we trigger autovacuum. It cleans up a few things but relfrozenxid is still old. On the next pass, we see that the system-wide xmin has not advanced, so we don't trigger autovacuum again. Then on the pass after that we see that the system-wide xmin has advanced by 1. Shall we trigger an autovacuum of the whole relation now, to be able to do relfrozenxid++? Seems dubious. Part of the problem here, for both vacuuming-for-bloat and vacuuming-for-relfrozenxid-advancement, we would really like to know the distribution of old XIDs in the table. If we knew that a lot of the inserts, updates, and deletes that are causing us to vacuum for bloat containment were in a certain relatively narrow range, then we'd probably want to not autovacuum for either purpose until the system-wide xmin has crossed through at least a good chunk of that range. And it fully crossed over that range then an immediate vacuum looks extremely appealing: we'll both remove a bunch of dead tuples and reclaim the associated line pointers, and at the same time we'll be able to advance relfrozenxid. Nice! But we have no such information. So I'm not certain of the way forward here. Just because we can't prevent almost-useless vacuuming is not a sufficient reason to continue allowing entirely-useless vacuuming that we can prevent. And it seems like we need a bunch of new bookkeeping to do any better than that, which seems like a lot of work. So maybe it's the most practical path forward for the time being, but it feels like more of a special-purpose kludge than a truly high-quality solution. -- Robert Haas EDB: http://www.enterprisedb.com
Re: more descriptive message for process termination due to max_slot_wal_keep_size
On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi wrote: > > Hello. > > As complained in pgsql-bugs [1], when a process is terminated due to > max_slot_wal_keep_size, the related messages don't mention the root > cause for *the termination*. Note that the third message does not > show for temporary replication slots. > > [pid=a] LOG: "terminating process %d to release replication slot \"%s\"" > [pid=x] LOG: FATAL: terminating connection due to administrator command > [pid=a] LOG: invalidting slot "s" because its restart_lsn X/X exceeds > max_slot_wal_keep_size > > The attached patch attaches a DETAIL line to the first message. > > > [17605] LOG: terminating process 17614 to release replication slot "s1" > + [17605] DETAIL: The slot's restart_lsn 0/2CA0 exceeds > max_slot_wal_keep_size. > > [17614] FATAL: terminating connection due to administrator command > > [17605] LOG: invalidating slot "s1" because its restart_lsn 0/2CA0 > > exceeds max_slot_wal_keep_size > > Somewhat the second and fourth lines look inconsistent each other but > that wouldn't be such a problem. I don't think we want to concatenate > the two lines together as the result is a bit too long. > > > LOG: terminating process 17614 to release replication slot "s1" because > > it's restart_lsn 0/2CA0 exceeds max_slot_wal_keep_size. > > What do you think about this? > > [1] > https://www.postgresql.org/message-id/20211214.101137.379073733372253470.horikyota.ntt%40gmail.com +1 to give more context to the "terminating process %d to release replication slot \"%s\"" message. How about having below, instead of adding errdetail: "terminating process %d to release replication slot \"%s\" whose restart_lsn %X/%X exceeds max_slot_wal_keep_size"? I think we can keep the "invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size" message as-is. We may not see "terminating process ..." and "invalidation slot ..." messages together for the same slot, so having slightly different wording is fine IMO. Regards, Bharath Rupireddy.
Re: cutting down the TODO list thread
On Wed, Dec 8, 2021 at 1:40 PM John Naylor wrote: > > It's been a while, but here are a few more suggested > removals/edits/additions to the TODO list. Any objections or new > information, let me know: > > - Auto-fill the free space map by scanning the buffer cache or by > checking pages written by the background writer > http://archives.postgresql.org/pgsql-hackers/2006-02/msg01125.php > https://www.postgresql.org/message-id/200603011716.16984.pete...@gmx.net > > Both these threads are from 2006, so have nothing to do with the current FSM. Moved to the Not Worth Doing list. > - Allow concurrent inserts to use recently created pages rather than > creating new ones > http://archives.postgresql.org/pgsql-hackers/2010-05/msg00853.php > > Skimming the first few messages, I believe this has been covered by > commit 719c84c1b? (Extend relations multiple blocks at a time to > improve scalability.) Removed. > - Allow VACUUM FULL and CLUSTER to update the visibility map > > This topic has a current CF entry which seems to have stalled, so that > newer thread would be better to list here than the one from 2013. Added. > - Bias FSM towards returning free space near the beginning of the heap > file, in hopes that empty pages at the end can be truncated by VACUUM > http://archives.postgresql.org/pgsql-hackers/2009-09/msg01124.php > https://www.postgresql.org/message-id/20150424190403.gp4...@alvh.no-ip.org > > I'm not sure what to think of this, but independently of that, the > second thread is actually talking about bringing back something like > the pre-9.0 vacuum full, so maybe it should be its own entry? Done. > - Consider a more compact data representation for dead tuple locations > within VACUUM > http://archives.postgresql.org/pgsql-patches/2007-05/msg00143.php > > Great, but let's link to this more recent thread instead: > https://www.postgresql.org/message-id/flat/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com Done. > > The second thread is really about autovacuum launcher scheduling. > > Probably still relevant, but the thread is very long and doesn't seem > > terribly helpful to someone trying to get up to speed on the issues > > that are still relevant. I don't see any more recent discussion, > > either. Thoughts? Split into two entries. On Wed, Dec 8, 2021 at 8:12 PM Masahiko Sawada wrote: > There is another discussion on autovacuum scheduling in 2018 here: > > https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1F8A4DC6%40G01JPEXMBYT05 > > Some algorithms were proposed there and I implemented a PoC patch: > > https://www.postgresql.org/message-id/CAD21AoBUaSRBypA6pd9ZD%3DU-2TJCHtbyZRmrS91Nq0eVQ0B3BA%40mail.gmail.com Added, thanks! -- John Naylor EDB: http://www.enterprisedb.com
Re: Commitfest 2021-11 Patch Triage - Part 3
> On Dec 13, 2021, at 10:18 PM, Jeff Davis wrote: > >>This is implemented but I still need to update the documentation >> before >>posting. > > We also discussed how much of the insert path we want to include in the > apply worker. The immediate need is for the patch to support RLS, but > it raised the question: "why not just use the entire ExecInsert() > path?". I went a different direction with this. The need is to prevent non-superuser subscription owners from *circumventing* RLS. For this patch set, I'm just checking whether RLS should be enforced against the subscription owner, and if so, raising an ERROR, same as for a privilege violation. This should be sufficient in practice, as the table owner, roles with bypassrls privilege, and superusers should still be able to replicate into an RLS enabled table. The problem with actually *supporting* RLS is that the current logical replication implementation is a mix of different practical (rather than theoretical) design choices. After working to get RLS working as part of that, I became concerned that there may be subtle differences between how I was making RLS work in logical replication vs. how it works for regular inserts/updates/deletes. Rather than create a mess that would need to be supported going forward, I bailed out and just forbade it. As far as completeness and a clean implementation (but not performance) are concerned, using ExecInsert is quite appealing. I think that would require reworking the logical replication protocol to send more than one row at a time, so that the overhead of such a choice is not paid *per row*. That seems quite out of scope for this release cycle, though. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Column Filtering in Logical Replication
On 2021-Dec-13, Alvaro Herrera wrote: > I think this means we need a new OBJECT_PUBLICATION_REL_COLUMN value in > the ObjectType (paralelling OBJECT_COLUMN), and no new ObjectClass > value. Looking now to confirm. After working on this a little bit more, I realized that this is a bad idea overall. It causes lots of complications and it's just not worth it. So I'm back at my original thought that we need to throw an ERROR at ALTER TABLE .. DROP COLUMN time if the column is part of a replication column filter, and suggest the user to remove the column from the filter first and reattempt the DROP COLUMN. This means that we need to support changing the column list of a table in a publication. I'm looking at implementing some form of ALTER PUBLICATION for that. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Find a bug in a program, and fix it, and the program will work today. Show the program how to find and fix a bug, and the program will work forever" (Oliver Silfridge)
Re: O(n) tasks cause lengthy startups and checkpoints
On Mon, Dec 13, 2021 at 11:05:46PM +, Bossart, Nathan wrote: > On 12/13/21, 12:37 PM, "Robert Haas" wrote: > > On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan wrote: > >> > But against all that, if these tasks are slowing down checkpoints and > >> > that's avoidable, that seems pretty important too. Interestingly, I > >> > can't say that I've ever seen any of these things be a problem for > >> > checkpoint or startup speed. I wonder why you've had a different > >> > experience. > >> > >> Yeah, it's difficult for me to justify why users should suffer long > >> periods of downtime because startup or checkpointing is taking a very > >> long time doing things that are arguably unrelated to startup and > >> checkpointing. > > > > Well sure. But I've never actually seen that happen. > > I'll admit that surprises me. As noted elsewhere [0], we were seeing > this enough with pgsql_tmp that we started moving the directory aside > before starting the server. Discussions about handling this usually > prompt questions about why there are so many temporary files in the > first place (which is fair). FWIW all four functions noted in my > original message [1] are things I've seen firsthand affecting users. Have we changed temporary file handling in any recent major releases, meaning is this a current problem or one already improved in PG 14. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
pg_upgrade operation failed if table created in --single user mode
Hi, Please refer to this scenario where pg_upgrade operation is failing if the table is create in single-user mode. PG v13 --connect to PG v13 using single user mode ( ./postgres --single -D /tmp/data13 postgres ) --create table ( backend> create table r(n int); ) --exit ( ctrl + D) -- Perform pg_upgrade ( PG v13->PG v15) )(./pg_upgrade -d data13 -D data15 -b /usr/psql-12/bin -B . ) it will fail with these messages Restoring global objects in the new cluster ok Restoring database schemas in the new cluster postgres *failure* Consult the last few lines of "pg_upgrade_dump_14174.log" for the probable cause of the failure. Failure, exiting --cat pg_upgrade_dump_14174.log -- -- -- -- pg_restore: creating TABLE "public.r" pg_restore: while PROCESSING TOC: pg_restore: from TOC entry 200; 1259 14180 TABLE r edb pg_restore: error: could not execute query: ERROR: pg_type array OID value not set when in binary upgrade mode Command was: -- For binary upgrade, must preserve pg_type oid SELECT pg_catalog.binary_upgrade_set_next_pg_type_oid('14181'::pg_catalog.oid); -- For binary upgrade, must preserve pg_class oids and relfilenodes SELECT pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('14180'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('14180'::pg_catalog.oid); CREATE TABLE "public"."r" ( "n" integer ); -- For binary upgrade, set heap's relfrozenxid and relminmxid UPDATE pg_catalog.pg_class SET relfrozenxid = '492', relminmxid = '1' WHERE oid = '"public"."r"'::pg_catalog.regclass; Is it expected ? -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Re: Defining (and possibly skipping) useless VACUUM operations
On Tue, Dec 14, 2021 at 6:05 AM Robert Haas wrote: > I think this is a reasonable line of thinking, but I think it's a > little imprecise. In general, we could be vacuuming a relation to > advance relfrozenxid, but we could also be vacuuming a relation to > advance relminmxid, or we could be vacuuming a relation to fight > bloat, or set pages all-visible. It is possible that there's no hope > of advancing relfrozenxid but that we can still accomplish one of the > other goals. In that case, the vacuuming is not useless. I think the > place to put logic around this would be in the triggering logic for > autovacuum. If we're going to force a relation to be vacuumed because > of (M)XID wraparound danger, we could first check whether there seems > to be any hope of advancing relfrozenxid(minmxid). If not, we discount > that as a trigger for vacuum, but may still decide to vacuum if some > other trigger warrants it. In most cases, if there's no hope of > advancing relfrozenxid, there won't be any bloat to remove either, but > aborted transactions are a counterexample. And the XID and MXID > horizons can advance at completely different rates. I think that you'd agree that the arguments in favor of skipping are strongest for an aggressive anti-wraparound autovacuum (as opposed to any other kind of aggressive VACUUM, including aggressive autovacuum). Aside from the big benefit I pointed out already (avoiding blocking useful anti-wraparound vacuums that starts a little later by not starting a conflicting useless anti-wraparound vacuum now), there is also more certainty about downsides. We can know the following things for sure: * We only launch an (aggressive) anti-wraparound autovacuum because we need to advance relfrozenxid. In other words, if we didn't need to advance relfrozenxid then (for better or worse) we definitely wouldn't be launching anything. * Our would-be OldestXmin exactly matches the preexisting pg_class.relfrozenxid (and pg_class.relminmxid). And so it follows that we're definitely not going to be able to do the thing that is ostensibly the whole point of anti-wraparound vacuum (advance relfrozenxid/relminmxid). > One reason I haven't pursued this kind of optimization is that it > doesn't really feel like it's fixing the whole problem. It would be a > little bit sad if we did a perfect job preventing useless vacuuming > but still allowed almost-useless vacuuming. Suppose we have a 1TB > relation and we trigger autovacuum. It cleans up a few things but > relfrozenxid is still old. On the next pass, we see that the > system-wide xmin has not advanced, so we don't trigger autovacuum > again. Then on the pass after that we see that the system-wide xmin > has advanced by 1. Shall we trigger an autovacuum of the whole > relation now, to be able to do relfrozenxid++? Seems dubious. I can see what you mean, but just fixing the most extreme case can be a useful goal. It's often enough to stop the system from going into a tailspin, which is the real underlying goal here. Things that approach the most extreme case (but don't quite hit it) don't have that quality. An anti-wraparound vacuum is supposed to be a mechanism that the system escalates to when nothing else triggers an autovacuum worker to run (which is aggressive but not anti-wraparound). That's not really true in practice, of course; anti-wraparound av often becomes a routine thing. But I think that it's a good ideal to strive for -- it should be rare. The draft patch series now adds opportunistic freezing -- I should be able to post a new version in a few days time, once I've tied up some loose ends. My testing shows an interesting effect, when opportunistic freezing is applied on top of the relfrozenxid thing: every autovacuum manages to advance relfrozenxid, and so we'll never have to run an aggressive autovacuum (much less an aggressive anti-wraparound autovacuum) in practice. And so (for example) when autovacuum runs against the pgbench_history table, it always sets its relfrozenxid to a value very close to the OldestXmin -- usually the exact OldestXmin. Opportunistic freezing makes us avoid setting the all-visible bit for a heap page without also setting the all-frozen bit -- when we're about to do that, we go freeze the heap tuples and then set the entire page all-frozen (so we freeze anything <= OldestXmin, not <= FreezeLimit). We also freeze based on this more aggressive <= OldestXmin cutoff when pruning had to delete some tuples. The patch still needs more polishing, but I think that we can make anti-wraparound vacuums truly exceptional with this design -- which would make autovacuum a lot easier to deal with operationally. This seems like a feasible goal for Postgres 15, even (though still quite ambitious). The opportunistic freezing stuff isn't free (the WAL records aren't tiny), but it's still not all that expensive. Plus I think that the cost can be further reduced, with a little more work. > Part of the problem here, for bo
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
On Tue, Dec 14, 2021 at 2:35 AM Robert Haas wrote: > > On Mon, Dec 13, 2021 at 9:40 AM Shruthi Gowda wrote: > > > I am reviewing another patch > > > "v5-0001-Preserve-relfilenode-and-tablespace-OID-in-pg_upg" as well > > > and will provide the comments soon if any... > > I spent much of today reviewing 0001. Here's an updated version, so > far only lightly tested. Please check whether I've broken anything. > Here are the changes: Thanks, Robert for the updated version. I reviewed the changes and it looks fine. I also tested the patch. The patch works as expected. > - I adjusted the function header comment for heap_create. Your > proposed comment seemed like it was pretty detailed but not 100% > correct. It also made one of the lines kind of long because you didn't > wrap the text in the surrounding style. I decided to make it simpler > and shorter instead of longer still and 100% correct. The comment update looks fine. However, I still feel it would be good to mention on which (rare) circumstance a valid relfilenode can get passed. > - I removed a one-line comment that said /* Override the toast > relfilenode */ because it preceded an error check, not a line of code > that would have done what the comment claimed. > > - I removed a one-line comment that said /* Override the relfilenode > */ because the following code would only sometimes override the > relfilenode. The code didn't seem complex enough to justify a a longer > and more accurate comment, so I just took it out. Fine > - I changed a test for (relkind == RELKIND_RELATION || relkind == > RELKIND_SEQUENCE || relkind == RELKIND_MATVIEW) to use > RELKIND_HAS_STORAGE(). It's true that not all of the storage types > that RELKIND_HAS_STORAGE() tests are possible here, but that's not a > reason to avoiding using the macro. If somebody adds a new relkind > with storage in the future, they might miss the need to manually > update this place, but they will not likely miss the need to update > RELKIND_HAS_STORAGE() since, if they did, their code probably wouldn't > work at all. I agree. > - I changed the way that you were passing create_storage down to > heap_create. I think I said before that you should EITHER fix it so > that we set create_storage = true only when the relation actually has > storage OR ELSE have heap_create() itself override the value to false > when there is no storage. You did both. There are times when it's > reasonable to ensure the same thing in multiple places, but this > doesn't seem to be one of them. So I took that out. I chose to retain > the code in heap_create() that overrides the value to false, added a > comment explaining that it does that, and then adjusted the callers to > ignore the storage type. I then added comments, and in one place an > assertion, to make it clearer what is happening. The changes are fine. Thanks for the fine-tuning. > - In pg_dump.c, I adjusted the comment that says "Not every relation > has storage." and the test that immediately follows, to ignore the > relfilenode when relkind says it's a partitioned table. Really, > partitioned tables should never have had relfilenodes, but as it turns > out, they used to have them. > Fine. Understood. Thanks & Regards, Shruthi KC EnterpriseDB: http://www.enterprisedb.com
Re: preserving db/ts/relfilenode OIDs across pg_upgrade (was Re: storing an explicit nonce)
On 12/14/21 2:35 AM, Robert Haas wrote: I spent much of today reviewing 0001. Here's an updated version, so far only lightly tested. Please check whether I've broken anything. Thanks Robert, I tested from v96/12/13/v14 -> v15( with patch) things are working fine i.e table /index relfilenode is preserved, not changing after pg_upgrade. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Re: pg_upgrade operation failed if table created in --single user mode
tushar writes: > Please refer to this scenario where pg_upgrade operation is failing if > the table is create in single-user mode. Fixed in v14: https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=f7f70d5e2 regards, tom lane
Re: Defining (and possibly skipping) useless VACUUM operations
On Tue, Dec 14, 2021 at 1:16 PM Peter Geoghegan wrote: > I think that you'd agree that the arguments in favor of skipping are > strongest ... Well I just don't understand why you insist on using the word "skipping." I think what we're talking about - or at least what we should be talking about - is whether relation_needs_vacanalyze() sets *wraparound = true right after the comment that says /* Force vacuum if table is at risk of wraparound */. And adding some kind of exception to the logic that's there now. > What I see with the draft patch series is that the oldest XID just > isn't that old anymore, consistently -- we literally never fail to > advance relfrozenxid, in any autovacuum, for any table. And the value > that we end up with is consistently quite recent. This is something > that I see both with BenchmarkSQL, and pgbench. There is a kind of > virtuous circle, which prevents us from ever getting anywhere near > having any table age in the tens of millions of XIDs. Yeah, I hadn't thought about it from that perspective, but that does seem very good. I think it's inevitable that there will be cases where that doesn't work out - e.g. you can always force the bad case by holding a table lock until your newborn heads off to college, or just by overthrottling autovacuum so that it can't get through the database in any reasonable amount of time - but it will be nice when it does work out, for sure. > I guess that that makes avoiding useless vacuuming seem like less of a > priority. ISTM that it should be something that is squarely aimed at > keeping things stable in truly pathological cases. Yes. I think "pathological cases" is a good summary of what's wrong with autovacuum. When there's nothing too crazy happening, it actually does pretty well. But, when resources are tight or other corner cases occur, really dumb things start to happen. So it's reasonable to think about how we can install guard rails that prevent complete insanity. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Commitfest 2021-11 Patch Triage - Part 3
On Tue, 2021-12-14 at 07:41 -0800, Mark Dilger wrote: > I went a different direction with this. The need is to prevent non- > superuser subscription owners from *circumventing* RLS. For this > patch set, I'm just checking whether RLS should be enforced against > the subscription owner, and if so, raising an ERROR, same as for a > privilege violation. This should be sufficient in practice, as the > table owner, roles with bypassrls privilege, and superusers should > still be able to replicate into an RLS enabled table. I agree with this. Let's consider subscription targets involving RLS a separate feature that is just blocked for now (except for bypassrls roles and superusers). > As far as completeness and a clean implementation (but not > performance) are concerned, using ExecInsert is quite appealing. I > think that would require reworking the logical replication protocol > to send more than one row at a time, so that the overhead of such a > choice is not paid *per row*. That seems quite out of scope for this > release cycle, though. Are you sure overhead is a problem? INSERT INTO ... SELECT goes through that path. And the existing logical replication insert path does some extra stuff, like opening the table for each row, so it's not clear to me that logical replication is much faster. Also, the current patch does per-row ACL checks. Did you consider making the checks a part of logicalrep_rel_open()? That already has a path to consider relcache invalidation, so it would also be a good place to check if the table has RLS enabled. Regards, Jeff Davis
Re: Add id's to various elements in protocol.sgml
On Sun, Dec 5, 2021 at 11:15 AM Daniel Gustafsson wrote: > > On 5 Dec 2021, at 16:51, Brar Piening wrote: > > > The attached patch adds id's to various elements in protocol.sgml to > > make them more accesssible via the public html documentation interface. > > Off the cuff without having checked the compiled results yet, it seems > like a good idea. > > — > Daniel Gustafsson > I wanted to do something similar for every function specification in the docs. This may inspire me to take another shot at that.
Re: Column Filtering in Logical Replication
On 12/14/21 17:43, Alvaro Herrera wrote: On 2021-Dec-13, Alvaro Herrera wrote: I think this means we need a new OBJECT_PUBLICATION_REL_COLUMN value in the ObjectType (paralelling OBJECT_COLUMN), and no new ObjectClass value. Looking now to confirm. After working on this a little bit more, I realized that this is a bad idea overall. It causes lots of complications and it's just not worth it. So I'm back at my original thought that we need to throw an ERROR at ALTER TABLE .. DROP COLUMN time if the column is part of a replication column filter, and suggest the user to remove the column from the filter first and reattempt the DROP COLUMN. This means that we need to support changing the column list of a table in a publication. I'm looking at implementing some form of ALTER PUBLICATION for that. Yeah. I think it's not clear if this should behave more like an index or a view. When an indexed column gets dropped we simply drop the index. But if you drop a column referenced by a view, we fail with an error. I think we should handle this more like a view, because publications are externally visible objects too (while indexes are pretty much just an implementation detail). But why would it be easier not to add new object type? We still need to check there is no publication referencing the column - either you do that automatically through a dependency, or you do that by custom code. Using a dependency seems better to me, but I don't know what are the complications you mentioned. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Column Filtering in Logical Replication
On 2021-Dec-14, Tomas Vondra wrote: > Yeah. I think it's not clear if this should behave more like an index or a > view. When an indexed column gets dropped we simply drop the index. But if > you drop a column referenced by a view, we fail with an error. I think we > should handle this more like a view, because publications are externally > visible objects too (while indexes are pretty much just an implementation > detail). I agree -- I think it's more like a view than like an index. (The original proposal was that if you dropped a column that was part of the column list of a relation in a publication, the entire relation is dropped from the view, but that doesn't seem very friendly behavior -- you break the replication stream immediately if you do that, and the only way to fix it is to send a fresh copy of the remaining subset of columns.) > But why would it be easier not to add new object type? We still need to > check there is no publication referencing the column - either you do that > automatically through a dependency, or you do that by custom code. Using a > dependency seems better to me, but I don't know what are the complications > you mentioned. The problem is that we need a way to represent the object "column of a table in a publication". I found myself adding a lot of additional code to support OBJECT_PUBLICATION_REL_COLUMN and that seemed like too much. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Add id's to various elements in protocol.sgml
On 2021-Dec-05, Brar Piening wrote: > When working with the Frontend/Backend Protocol implementation in Npgsql > and discussing things with the team, I often struggle with the fact that > you can't set deep links to individual message formats in the somewhat > lengthy html docs pages. > > The attached patch adds id's to various elements in protocol.sgml to > make them more accesssible via the public html documentation interface. > > I've tried to follow the style that was already there in a few of the > elements. Hmm, I think we tend to avoid xreflabels; see https://www.postgresql.org/message-id/8315c0ca-7758-8823-fcb6-f37f9413e...@2ndquadrant.com -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/14/21, 9:00 AM, "Bruce Momjian" wrote: > Have we changed temporary file handling in any recent major releases, > meaning is this a current problem or one already improved in PG 14. I haven't noticed any recent improvements while working in this area. Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/14/21, 12:09 PM, "Bossart, Nathan" wrote: > On 12/14/21, 9:00 AM, "Bruce Momjian" wrote: >> Have we changed temporary file handling in any recent major releases, >> meaning is this a current problem or one already improved in PG 14. > > I haven't noticed any recent improvements while working in this area. On second thought, the addition of the remove_temp_files_after_crash GUC is arguably an improvement since it could prevent files from accumulating after repeated crashes. Nathan
Re: WIN32 pg_import_system_collations
On Mon, Dec 13, 2021 at 9:54 PM Thomas Munro wrote: > > I haven't tested yet but +1 for the feature. I guess the API didn't > exist at the time collation support was added. > > Good to hear. > This conversion makes sense, to keep the user experience the same > across platforms. Nitpick on the comment: why ANSI? I think we can > call "en_NZ" a POSIX locale identifier[1], and I think we can call > "en-NZ" a BCP 47 language tag. > > POSIX also works for me. > When would the full set of locales not be installed on a Windows > system, and why does this need Visual Studio? Wondering if this test > will work with some of the frankenstein/cross toolchains tool chains > (not objecting if it doesn't and could be skipped, just trying to > understand the comment). > > What I meant to say is that to run the test, you need a database that has successfully run pg_import_system_collations. This would be also possible in Mingw for _WIN32_WINNT> = 0x0600, but the current value in src\include\port\win32.h is _WIN32_WINNT = 0x0501 when compiling with Mingw. > Regards, Juan José Santamaría Flecha
Re: Defining (and possibly skipping) useless VACUUM operations
On Tue, Dec 14, 2021 at 10:47 AM Robert Haas wrote: > Well I just don't understand why you insist on using the word > "skipping." I think what we're talking about - or at least what we > should be talking about - is whether relation_needs_vacanalyze() sets > *wraparound = true right after the comment that says /* Force vacuum > if table is at risk of wraparound */. And adding some kind of > exception to the logic that's there now. Actually, I agree. Skipping is the wrong term, especially because the phrase "VACUUM skips..." is already too overloaded. Not necessarily in vacuumlazy.c itself, but certainly on the mailing list. > Yeah, I hadn't thought about it from that perspective, but that does > seem very good. I think it's inevitable that there will be cases where > that doesn't work out - e.g. you can always force the bad case by > holding a table lock until your newborn heads off to college, or just > by overthrottling autovacuum so that it can't get through the database > in any reasonable amount of time - but it will be nice when it does > work out, for sure. Right. But when the patch doesn't manage to totally prevent anti-wraparound VACUUMs, things still work out a lot better than they would now. I would expect that in practice this will usually only happen when non-aggressive autovacuums keep getting canceled. And sure, it's still not ideal that things have come to that. But because we now do freezing earlier (when it's relatively inexpensive), and because we set all-frozen bits incrementally, the anti-wraparound autovacuum will at least be able to reuse any freezing that we manage to do in all those canceled autovacuums. I think that this tends to make anti-wraparound VACUUMs mostly about not being cancelable -- not so much about reliably advancing relfrozenxid. I mean it doesn't change the basic rules (there is no change to the definition of aggressive VACUUM), but in practice I think that it'll just work that way. Which makes a great deal of sense. I hope to be able to totally get rid of vacuum_freeze_table_age. The freeze map work in PostgreSQL 9.6 was really great, and very effective. But I think that it had an undesirable interaction with vacuum_freeze_min_age: if we set a heap page as all-visible (but not all frozen) before some of its tuples reached that age (which is very likely), then tuples < vacuum_freeze_min_age aren't going to get frozen until whenever we do an aggressive autovacuum. Very often, this will only happen when we next do an anti-wraparound VACUUM (at least before Postgres 13). I suspect we risk running into a "debt cliff" in the eventual anti-wraparound autovacuum. And so while vacuum_freeze_min_age kinda made sense prior to 9.6, it now seems to make a lot less sense. > > I guess that that makes avoiding useless vacuuming seem like less of a > > priority. ISTM that it should be something that is squarely aimed at > > keeping things stable in truly pathological cases. > > Yes. I think "pathological cases" is a good summary of what's wrong > with autovacuum. This is 100% my focus, in general. The main goal of the patch I'm working on isn't so much improving performance as making it more predictable over time. Focussing on freezing while costs are low has a natural tendency to spread the costs out over time. The system should never "get in over its head" with debt that vacuum is expected to eventually deal with. > When there's nothing too crazy happening, it actually > does pretty well. But, when resources are tight or other corner cases > occur, really dumb things start to happen. So it's reasonable to think > about how we can install guard rails that prevent complete insanity. Another thing that I really want to stamp out is anything involving a tiny, seemingly-insignificant adverse event that has the potential to cause disproportionate impact over time. For example, right now a non-aggressive VACUUM will never be able to advance relfrozenxid when it cannot get a cleanup lock on one heap page. It's actually extremely unlikely that that should have much of any impact, at least when you determine the new relfrozenxid for the table intelligently. Not acquiring one cleanup lock on one heap page on a huge table should not have such an extreme impact. It's even worse when the systemic impact over time is considered. Let's say you only have a 20% chance of failing to acquire one or more cleanup locks during a non-aggressive autovacuum for a given large table, meaning that you'll fail to advance relfrozenxid in at least 20% of all non-aggressive autovacuums. I think that that might be a lot worse than it sounds, because the impact compounds over time -- I'm not sure that 20% is much worse than 60%, or much better than 5% (very hard to model it). But if we make the high-level, abstract idea of "aggressiveness" more of a continuous thing, and not something that's defined by sharp (and largely meaningless) XID-based cutoffs, we have every chance of nipping these problems i
Life cycles of tuple descriptors
Hi, There are some things about the life cycle of the common TupleDesc that I'm not 100% sure about. 1. In general, if you get one from relcache or typcache, it is reference-counted, right? 2. The only exception I know of is if you ask the typcache for a blessed one (RECORD+typmod), and it is found in the shared-memory cache. It is not refcounted then. If it is found only locally, it is refcounted. 3. Is that shared case the only way you could see a non-refcounted TupleDesc handed to you by the typcache? 4. How long can such a non-refcounted TupleDesc from the typcache be counted on to live? There is a comment in expandedrecord.c: "It it's not refcounted, just assume it will outlive the expanded object." That sounds like confidence, but is it confidence in the longevity of shared TupleDescs, or in the fleeting lives of expanded records? The same comment says "(This can happen for shared record types, for instance.)" Does the "for instance" mean there are now, or just in future might be, other cases where the typcache hands you a non-refcounted TupleDesc? 5. When a constructed TupleDesc is blessed, the copy placed in the cache by assign_record_type_typmod is born with a refcount of 1. Assuming every later user of that TupleDesc plays nicely with balanced pin and release, what event(s), if any, could ever occur to decrease that initial 1 to 0? Thanks for any help getting these points straight! Regards, -Chap
Re: Remove pg_strtouint64(), use strtoull() directly
On 13.12.21 15:44, Tom Lane wrote: Our current hard-coded uses of long long are all written on the assumption that it's*at least* 64 bits, so we'd survive OK on such a platform so long as we don't start confusing it with *exactly* 64 bits. OK, makes sense. Here is an alternative patch. It introduces two light-weight macros strtoi64() and strtou64() (compare e.g., strtoimax() in POSIX) in c.h and removes pg_strtouint64(). This moves the portability layer from numutils.c to c.h, so it's closer to the rest of the int64 portability code. And that way it is available to not just server code. And it resolves the namespace collision with the pg_strtointNN() functions in numutils.c.From e723f9480be4c466f9c3f59a75af951d94f2 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 14 Dec 2021 08:38:35 +0100 Subject: [PATCH v2] Simplify the general-purpose 64-bit integer parsing APIs pg_strtouint64() is a wrapper around strtoull/strtoul/_strtoui64, but it seems no longer necessary to have this indirection. msvc/Solution.pm claims HAVE_STRTOULL, so the "MSVC only" part seems unnecessary. Also, we have code in c.h to substitute alternatives for strtoull() if not found, and that would appear to cover all currently supported platforms, so having a further fallback in pg_strtouint64() seems unnecessary. Therefore, we could remove pg_strtouint64(), and use strtoull() directly in all call sites. However, it seems useful to keep a separate notation for parsing exactly 64-bit integers, matching the type definition int64/uint64. For that, add new macros strtoi64() and strtou64() in c.h as thin wrappers around strtol()/strtoul() or strtoll()/stroull(). This makes these functions available everywhere instead of just in the server code, and it makes the function naming notably different from the pg_strtointNN() functions in numutils.c, which have a different API. --- src/backend/nodes/readfuncs.c | 2 +- src/backend/utils/adt/numutils.c | 22 -- src/backend/utils/adt/xid.c | 2 +- src/backend/utils/adt/xid8funcs.c | 6 +++--- src/backend/utils/misc/guc.c | 2 +- src/include/c.h | 13 + src/include/utils/builtins.h | 1 - 7 files changed, 19 insertions(+), 29 deletions(-) diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c index dcec3b728f..76cff8a2b1 100644 --- a/src/backend/nodes/readfuncs.c +++ b/src/backend/nodes/readfuncs.c @@ -80,7 +80,7 @@ #define READ_UINT64_FIELD(fldname) \ token = pg_strtok(&length); /* skip :fldname */ \ token = pg_strtok(&length); /* get field value */ \ - local_node->fldname = pg_strtouint64(token, NULL, 10) + local_node->fldname = strtou64(token, NULL, 10) /* Read a long integer field (anything written as ":fldname %ld") */ #define READ_LONG_FIELD(fldname) \ diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c index b93096f288..6a9c00fdd3 100644 --- a/src/backend/utils/adt/numutils.c +++ b/src/backend/utils/adt/numutils.c @@ -606,25 +606,3 @@ pg_ultostr(char *str, uint32 value) return str + len; } - -/* - * pg_strtouint64 - * Converts 'str' into an unsigned 64-bit integer. - * - * This has the identical API to strtoul(3), except that it will handle - * 64-bit ints even where "long" is narrower than that. - * - * For the moment it seems sufficient to assume that the platform has - * such a function somewhere; let's not roll our own. - */ -uint64 -pg_strtouint64(const char *str, char **endptr, int base) -{ -#ifdef _MSC_VER/* MSVC only */ - return _strtoui64(str, endptr, base); -#elif defined(HAVE_STRTOULL) && SIZEOF_LONG < 8 - return strtoull(str, endptr, base); -#else - return strtoul(str, endptr, base); -#endif -} diff --git a/src/backend/utils/adt/xid.c b/src/backend/utils/adt/xid.c index 24c1c93732..a09096d018 100644 --- a/src/backend/utils/adt/xid.c +++ b/src/backend/utils/adt/xid.c @@ -158,7 +158,7 @@ xid8in(PG_FUNCTION_ARGS) { char *str = PG_GETARG_CSTRING(0); - PG_RETURN_FULLTRANSACTIONID(FullTransactionIdFromU64(pg_strtouint64(str, NULL, 0))); + PG_RETURN_FULLTRANSACTIONID(FullTransactionIdFromU64(strtou64(str, NULL, 0))); } Datum diff --git a/src/backend/utils/adt/xid8funcs.c b/src/backend/utils/adt/xid8funcs.c index f1870a7366..68985dca5a 100644 --- a/src/backend/utils/adt/xid8funcs.c +++ b/src/backend/utils/adt/xid8funcs.c @@ -295,12 +295,12 @@ parse_snapshot(const char *str) char *endp; StringInfo buf; - xmin = FullTransactionIdFromU64(pg_strtouint64(str, &endp, 10)); + xmin = FullTransactionIdFromU64(strtou64(str, &endp, 10)); if (*endp != ':') goto bad_format; str = endp + 1; - xmax = FullTransactionIdFromU64(pg_strtouint64(str, &endp, 10)); + xmax = FullTransactionIdFromU64
Re: Column Filtering in Logical Replication
On 12/14/21 20:35, Alvaro Herrera wrote: On 2021-Dec-14, Tomas Vondra wrote: Yeah. I think it's not clear if this should behave more like an index or a view. When an indexed column gets dropped we simply drop the index. But if you drop a column referenced by a view, we fail with an error. I think we should handle this more like a view, because publications are externally visible objects too (while indexes are pretty much just an implementation detail). I agree -- I think it's more like a view than like an index. (The original proposal was that if you dropped a column that was part of the column list of a relation in a publication, the entire relation is dropped from the view, but that doesn't seem very friendly behavior -- you break the replication stream immediately if you do that, and the only way to fix it is to send a fresh copy of the remaining subset of columns.) Right, that's my reasoning too. But why would it be easier not to add new object type? We still need to check there is no publication referencing the column - either you do that automatically through a dependency, or you do that by custom code. Using a dependency seems better to me, but I don't know what are the complications you mentioned. The problem is that we need a way to represent the object "column of a table in a publication". I found myself adding a lot of additional code to support OBJECT_PUBLICATION_REL_COLUMN and that seemed like too much. My experience with dependencies is pretty limited, but can't we simply make a dependency between the whole publication and the column? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Column Filtering in Logical Replication
Hi, I went through the v9 patch, and I have a couple comments / questions. Apologies if some of this was already discussed earlier, it's hard to cross-check in such a long thread. Most of the comments are in 0002 to make it easier to locate, and it also makes proposed code changes clearer I think. 1) check_publication_add_relation - the "else" branch is not really needed, because the "if (replidentfull)" always errors-out 2) publication_add_relation has a FIXME about handling cases with different column list So what's the right behavior for ADD TABLE with different column list? I'd say we should allow that, and that it should be mostly the same thing as adding/removing columns to the list incrementally, i.e. we should replace the column lists. We could also prohibit such changes, but that seems like a really annoying limitation, forcing people to remove/add the relation. I added some comments to the attmap translation block, and replaced <0 check with AttrNumberIsForUserDefinedAttr. But I wonder if we could get rid of the offset, considering we're dealing with just user-defined attributes. That'd make the code clearer, but it would break if we're comparing it to other bitmaps with offsets. But I don't think we do. 3) I doubt "att_map" is the right name, though. AFAICS it's just a list of columns for the relation, not a map, right? So maybe attr_list? 4) AlterPublication talks about "publication status" for a column, but do we actually track that? Or what does that mean? 5) PublicationDropTables does a check if (pubrel->columns) ereport(ERROR, errcode(ERRCODE_SYNTAX_ERROR), Shouldn't this be prevented by the grammar, really? Also, it should be in regression tests. 6) Another thing that should be in the test is partitioned table with attribute mapping and column list, to see how map and attr_map interact. 7) There's a couple places doing this if (att_map != NULL && !bms_is_member(att->attnum - FirstLowInvalidHeapAttributeNumber, att_map) && !bms_is_member(att->attnum - FirstLowInvalidHeapAttributeNumber, idattrs) && !replidentfull) which is really hard to understand (even if we get rid of the offset), so maybe let's move that to a function with sensible name. Also, some places don't check indattrs - seems a bit suspicious. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom fb5ce02d36b46f92ab01c9a823cc4e315cfcb73c Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Tue, 14 Dec 2021 20:53:28 +0100 Subject: [PATCH 1/2] v9 --- doc/src/sgml/ref/alter_publication.sgml | 4 +- doc/src/sgml/ref/create_publication.sgml | 11 +- src/backend/catalog/dependency.c | 8 +- src/backend/catalog/objectaddress.c | 8 + src/backend/catalog/pg_depend.c | 50 ++ src/backend/catalog/pg_publication.c | 106 +++- src/backend/commands/publicationcmds.c | 94 ++- src/backend/commands/tablecmds.c | 10 +- src/backend/nodes/copyfuncs.c| 1 + src/backend/nodes/equalfuncs.c | 1 + src/backend/parser/gram.y| 36 - src/backend/replication/logical/proto.c | 97 --- src/backend/replication/logical/tablesync.c | 117 +- src/backend/replication/pgoutput/pgoutput.c | 80 +++-- src/bin/pg_dump/pg_dump.c| 41 - src/bin/pg_dump/pg_dump.h| 1 + src/bin/psql/describe.c | 26 ++- src/bin/psql/tab-complete.c | 2 + src/include/catalog/dependency.h | 3 + src/include/catalog/pg_publication.h | 1 + src/include/catalog/pg_publication_rel.h | 3 + src/include/commands/publicationcmds.h | 2 +- src/include/nodes/parsenodes.h | 1 + src/include/replication/logicalproto.h | 6 +- src/test/regress/expected/publication.out| 39 - src/test/regress/sql/publication.sql | 19 ++- src/test/subscription/t/021_column_filter.pl | 162 +++ 27 files changed, 857 insertions(+), 72 deletions(-) create mode 100644 src/test/subscription/t/021_column_filter.pl diff --git a/doc/src/sgml/ref/alter_publication.sgml b/doc/src/sgml/ref/alter_publication.sgml index bb4ef5e5e2..c86055b93c 100644 --- a/doc/src/sgml/ref/alter_publication.sgml +++ b/doc/src/sgml/ref/alter_publication.sgml @@ -30,7 +30,7 @@ ALTER PUBLICATION name RENAME TO where publication_object is one of: -TABLE [ ONLY ] table_name [ * ] [, ... ] +TABLE [ ONLY ] table_name [ * ] [ ( column_name, [, ... ] ) ] [, ... ] ALL TABLES IN SCHEMA { schema_name | CURRENT_SCHEMA } [, ... ] @@ -110,6 +110,8 @@ ALTER PUBLICATION name RENAME TO * can be specified after the table name to explicitly indicate that descendant tables
Re: WIN32 pg_import_system_collations
On Wed, Dec 15, 2021 at 9:14 AM Juan José Santamaría Flecha wrote: > What I meant to say is that to run the test, you need a database that has > successfully run pg_import_system_collations. This would be also possible in > Mingw for _WIN32_WINNT> = 0x0600, but the current value in > src\include\port\win32.h is _WIN32_WINNT = 0x0501 when compiling with Mingw. Ah, right. I hope we can make the leap to 0x0A00 (Win10) soon and just stop thinking about these old ghosts, as mentioned by various people in various threads. Do you happen to know if there are complications for that, with the non-MSVC tool chains?
Re: Synchronizing slots from primary to standby
On 31.10.21 11:08, Peter Eisentraut wrote: I want to reactivate $subject. I took Petr Jelinek's patch from [0], rebased it, added a bit of testing. It basically works, but as mentioned in [0], there are various issues to work out. The idea is that the standby runs a background worker to periodically fetch replication slot information from the primary. On failover, a logical subscriber would then ideally find up-to-date replication slots on the new publisher and can just continue normally. So, again, this isn't anywhere near ready, but there is already a lot here to gather feedback about how it works, how it should work, how to configure it, and how it fits into an overall replication and HA architecture. Here is an updated patch. The main changes are that I added two configuration parameters. The first, synchronize_slot_names, is set on the physical standby to specify which slots to sync from the primary. By default, it is empty. (This also fixes the recovery test failures that I had to disable in the previous patch version.) The second, standby_slot_names, is set on the primary. It holds back logical replication until the listed physical standbys have caught up. That way, when failover is necessary, the promoted standby is not behind the logical replication consumers. In principle, this works now, I think. I haven't made much progress in creating more test cases for this; that's something that needs more attention. It's worth pondering what the configuration language for standby_slot_names should be. Right now, it's just a list of slots that all need to be caught up. More complicated setups are conceivable. Maybe you have standbys S1 and S2 that are potential failover targets for logical replication consumers L1 and L2, and also standbys S3 and S4 that are potential failover targets for logical replication consumers L3 and L4. Viewed like that, this setting could be a replication slot setting. The setting might also have some relationship with synchronous_standby_names. Like, if you have synchronous_standby_names set, then that's a pretty good indication that you also want some or all of those standbys in standby_slot_names. (But note that one is slots and one is application names.) So there are a variety of possibilities.From f1d306bfe3eb657b5d14b0ef3024586083beb4ed Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 14 Dec 2021 22:20:59 +0100 Subject: [PATCH v2] Synchronize logical replication slots from primary to standby Discussion: https://www.postgresql.org/message-id/flat/514f6f2f-6833-4539-39f1-96cd1e011f23%40enterprisedb.com --- doc/src/sgml/config.sgml | 34 ++ src/backend/commands/subscriptioncmds.c | 4 +- src/backend/postmaster/bgworker.c | 3 + .../libpqwalreceiver/libpqwalreceiver.c | 96 src/backend/replication/logical/Makefile | 1 + src/backend/replication/logical/launcher.c| 202 ++--- .../replication/logical/reorderbuffer.c | 85 src/backend/replication/logical/slotsync.c| 412 ++ src/backend/replication/logical/tablesync.c | 13 +- src/backend/replication/repl_gram.y | 32 +- src/backend/replication/repl_scanner.l| 1 + src/backend/replication/slotfuncs.c | 2 +- src/backend/replication/walsender.c | 196 - src/backend/utils/activity/wait_event.c | 3 + src/backend/utils/misc/guc.c | 23 + src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/commands/subscriptioncmds.h | 3 + src/include/nodes/nodes.h | 1 + src/include/nodes/replnodes.h | 9 + src/include/replication/logicalworker.h | 9 + src/include/replication/slot.h| 4 +- src/include/replication/walreceiver.h | 16 + src/include/replication/worker_internal.h | 8 +- src/include/utils/wait_event.h| 1 + src/test/recovery/t/030_slot_sync.pl | 58 +++ 25 files changed, 1148 insertions(+), 70 deletions(-) create mode 100644 src/backend/replication/logical/slotsync.c create mode 100644 src/test/recovery/t/030_slot_sync.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index afbb6c35e3..2b2a21a251 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -4406,6 +4406,23 @@ Primary Server + + standby_slot_names (string) + + standby_slot_names configuration parameter + + + + +List of physical replication slots that logical replication waits for. +If a logical replication connection is meant to switch to a physical +standby after the standby is promoted, the physical replication slot +for the standby should be listed here. This ensures that logical +replication is not ahead of the physical standby. + + +
Re: Synchronizing slots from primary to standby
On 24.11.21 07:11, Masahiko Sawada wrote: I haven’t looked at the patch deeply but regarding 007_sync_rep.pl, the tests seem to fail since the tests rely on the order of the wal sender array on the shared memory. Since a background worker for synchronizing replication slots periodically connects to the walsender on the primary and disconnects, it breaks the assumption of the order. Regarding 010_logical_decoding_timelines.pl, I guess that the patch breaks the test because the background worker for synchronizing slots on the replica periodically advances the replica's slot. I think we need to have a way to disable the slot synchronization or to specify the slot name to sync with the primary. I'm not sure we already discussed this topic but I think we need it at least for testing purposes. This has been addressed by patch v2 that adds such a setting.
Re: pg_dump versus ancient server versions
I wrote: > Anyway, it seems like there's some consensus that 9.2 is a good > stopping place for today. I'll push forward with > (1) back-patching as necessary to make 9.2 and up build cleanly > on the platforms I have handy; > (2) ripping out pg_dump's support for pre-9.2 servers; > (3) ripping out psql's support for pre-9.2 servers. I've completed the pg_dump/pg_dumpall part of that, but while updating the docs I started to wonder whether we shouldn't nuke pg_dump's --no-synchronized-snapshots option. As far as I can make out, the remaining use case for that is to let you perform an unsafe parallel dump from a standby server of an out-of-support major version. I'm not very clear why we allowed that at all, ever, rather than saying you can't parallelize in such cases. But for sure that remaining use case is paper thin, and leaving the option available seems way more likely to let people shoot themselves in the foot than to let them do anything helpful. Barring objections, I'll remove that option in a day or two. regards, tom lane
Re: Synchronizing slots from primary to standby
On 24.11.21 17:25, Dimitri Fontaine wrote: Is there a case to be made about doing the same thing for physical replication slots too? It has been considered. At the moment, I'm not doing it, because it would add more code and complexity and it's not that important. But it could be added in the future. Given the admitted state of the patch, I didn't focus on tests. I could successfully apply the patch on-top of current master's branch, and cleanly compile and `make check`. Then I also updated pg_auto_failover to support Postgres 15devel [2] so that I could then `make NODES=3 cluster` there and play with the new replication command: $ psql -d "port=5501 replication=1" -c "LIST_SLOTS;" psql:/Users/dim/.psqlrc:24: ERROR: XX000: cannot execute SQL commands in WAL sender for physical replication LOCATION: exec_replication_command, walsender.c:1830 ... I'm not too sure about this idea of running SQL in a replication protocol connection that you're mentioning, but I suppose that's just me needing to brush up on the topic. FWIW, the way the replication command parser works, if there is a parse error, it tries to interpret the command as a plain SQL command. But that only works for logical replication connections. So in physical replication, if you try to run anything that does not parse, you will get this error. But that has nothing to do with this feature. The above command works for me, so maybe something else went wrong in your situation. Maybe the first question about configuration would be about selecting which slots a standby should maintain from the primary. Is it all of the slots that exists on both the nodes, or a sublist of that? Is it possible to have a slot with the same name on a primary and a standby node, in a way that the standby's slot would be a completely separate entity from the primary's slot? If yes (I just don't know at the moment), well then, should we continue to allow that? This has been added in v2. Also, do we want to even consider having the slot management on a primary node depend on the ability to sync the advancing on one or more standby nodes? I'm not sure to see that one as a good idea, but maybe we want to kill it publically very early then ;-) I don't know what you mean by this.
Re: Synchronizing slots from primary to standby
On 28.11.21 07:52, Bharath Rupireddy wrote: 1) Instead of a new LIST_SLOT command, can't we use READ_REPLICATION_SLOT (slight modifications needs to be done to make it support logical replication slots and to get more information from the subscriber). I looked at that but didn't see an obvious way to consolidate them. This is something we could look at again later. 2) How frequently the new bg worker is going to sync the slot info? How can it ensure that the latest information exists say when the subscriber is down/crashed before it picks up the latest slot information? The interval is currently hardcoded, but could be a configuration setting. In the v2 patch, there is a new setting that orders physical replication before logical so that the logical subscribers cannot get ahead of the physical standby. 3) Instead of the subscriber pulling the slot info, why can't the publisher (via the walsender or a new bg worker maybe?) push the latest slot info? I'm not sure we want to add more functionality to the walsender, if yes, isn't it going to be much simpler? This sounds like the failover slot feature, which was rejected.
Re: Granting SET and ALTER SYSTE privileges for GUCs
On Mon, Dec 13, 2021 at 5:34 PM Mark Dilger wrote: > > > > > On Dec 13, 2021, at 1:33 PM, Mark Dilger > > wrote: > > > > but will repost in a few hours > > ... and here it is: currently there is a failure in check-world (not sure if it's known): diff -U3 /src/postgres/src/test/regress/expected/guc_privs.out /src/postgres/src/test/regress/results/guc_privs.out --- /src/postgres/src/test/regress/expected/guc_privs.out 2021-12-14 14:11:45.0 + +++ /src/postgres/src/test/regress/results/guc_privs.out 2021-12-14 15:50:52.219583421 + @@ -39,8 +39,10 @@ ALTER SYSTEM RESET autovacuum; -- ok -- PGC_SUSET SET lc_messages = 'en_US.UTF-8'; -- ok +ERROR: invalid value for parameter "lc_messages": "en_US.UTF-8" RESET lc_messages; -- ok ALTER SYSTEM SET lc_messages = 'en_US.UTF-8'; -- ok +ERROR: invalid value for parameter "lc_messages": "en_US.UTF-8" ALTER SYSTEM RESET lc_messages; -- ok -- PGC_SU_BACKEND SET jit_debugging_support = OFF; -- fail, cannot be set after connection start Aside from that I've tested this and it seems to function as advertised and in my view is worth adding to PG. One thing that seems like an omission to me is the absence of a InvokeObjectPostAlterHook in pg_setting_acl_aclcheck or pg_setting_acl_aclmask so that MAC extensions can also block this, InvokeObjectPostCreateHook is already in the create path so a PostAlter hook seems appropriate. Thank you.
Re: Remove pg_strtouint64(), use strtoull() directly
Peter Eisentraut writes: > OK, makes sense. Here is an alternative patch. It introduces two > light-weight macros strtoi64() and strtou64() (compare e.g., strtoimax() > in POSIX) in c.h and removes pg_strtouint64(). This moves the > portability layer from numutils.c to c.h, so it's closer to the rest of > the int64 portability code. And that way it is available to not just > server code. And it resolves the namespace collision with the > pg_strtointNN() functions in numutils.c. Works for me. I'm not in a position to verify that this'll work on Windows, but the buildfarm will tell us that quickly enough. regards, tom lane
Re: daitch_mokotoff module
Tomas Vondra writes: [...] > > Thanks, looks interesting. A couple generic comments, based on a quick > code review. Thank you for the constructive review! > > 1) Can the extension be marked as trusted, just like fuzzystrmatch? I have now moved the daitch_mokotoff function into the fuzzystrmatch module, as suggested by Andrew Dunstan. > > 2) The docs really need an explanation of what the extension is for, > not just a link to fuzzystrmatch. Also, a couple examples would be > helpful, I guess - similarly to fuzzystrmatch. The last line in the > docs is annoyingly long. Please see the updated documentation for the fuzzystrmatch module. > > 3) What's daitch_mokotov_header.pl for? I mean, it generates the > header, but when do we need to run it? It only has to be run if the soundex rules are changed. I have now made the dependencies explicit in the fuzzystrmatch Makefile. > > 4) It seems to require perl-open, which is a module we did not need > until now. Not sure how well supported it is, but maybe we can use a > more standard module? I believe Perl I/O layers have been part of Perl core for two decades now :-) > > 5) Do we need to keep DM_MAIN? It seems to be meant for some kind of > testing, but our regression tests certainly don't need it (or the > palloc mockup). I suggest to get rid of it. Done. BTW this was modeled after dmetaphone.c > > 6) I really don't understand some of the comments in > daitch_mokotov.sql, like for example: > > -- testEncodeBasic > -- Tests covered above are omitted. > > Also, comments with names of Java methods seem pretty confusing. It'd > be better to actually explain what rules are the tests checking. The tests were copied from various web sites and implementations. I have cut down on the number of tests and made the comments more to the point. > > 7) There are almost no comments in the .c file (ignoring the comment > on top). Short functions like initialize_node are probably fine > without one, but e.g. update_node would deserve one. More comments are added to both the .h and the .c file. > > 8) Some of the lines are pretty long (e.g. the update_node signature > is almost 170 chars). That should be wrapped. Maybe try running > pgindent on the code, that'll show which parts need better formatting > (so as not to get broken later). Fixed. I did run pgindent earlier, however it didn't catch those long lines. > > 9) I'm sure there's better way to get the number of valid chars than this: > > for (i = 0, ilen = 0; (c = read_char(&str[i], &ilen)) && (c < 'A' || > c > ']'); i += ilen) > { > } > > Say, a while loop or something? The code gets to the next encodable character, skipping any other characters. I have now added a comment which should hopefully make this clearer, and broken up the for loop for readability. Please find attached the revised patch. Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..826e529e3e 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,11 +3,12 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" REGRESS = fuzzystrmatch @@ -22,3 +23,8 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< > $@ diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..302e9a6d86 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,516 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - "J" is considered a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat
Re: Life cycles of tuple descriptors
Chapman Flack writes: > There are some things about the life cycle of the common TupleDesc > that I'm not 100% sure about. > 1. In general, if you get one from relcache or typcache, it is >reference-counted, right? Tupdescs for named composite types should be, since those are potentially modifiable by DDL. The refcount allows a stale tupdesc to go away when it's no longer referenced anywhere. Tupdescs for RECORD types are a different story: there's no way to change them once created, so they'll live as long as the process does (at least for tupdescs kept in the typcache). Refcounting isn't terribly necessary in that case; and at least for the shared tupdesc case, we don't do it, to avoid questions of modifying a piece of shared state. > 3. Is that shared case the only way you could see a non-refcounted >TupleDesc handed to you by the typcache? I'm not sure what happens for a non-shared RECORD tupdesc, but it probably wouldn't be wise to assume anything either way. > 5. When a constructed TupleDesc is blessed, the copy placed in the cache >by assign_record_type_typmod is born with a refcount of 1. Assuming every >later user of that TupleDesc plays nicely with balanced pin and release, >what event(s), if any, could ever occur to decrease that initial 1 to 0? That refcount is describing the cache's own reference, so as long as that reference remains it'd be incorrect to decrease the refcount to 0. regards, tom lane
Re: Life cycles of tuple descriptors
On 12/14/21 18:03, Tom Lane wrote: > Tupdescs for RECORD types are a different story: ... Refcounting > isn't terribly necessary in that case; and at least for the shared > tupdesc case, we don't do it, to avoid questions of modifying a > piece of shared state. Ok, that's kind of what I was getting at here: >> 5. When a constructed TupleDesc is blessed, the copy placed in the cache >>by assign_record_type_typmod is born with a refcount of 1. ... >>what event(s), if any, could ever occur to decrease that initial 1 ... > > That refcount is describing the cache's own reference, so as long as > that reference remains it'd be incorrect to decrease the refcount to 0. In the case of a blessed RECORD+typmod tupdesc, *is* there any event that could ever cause the cache to drop that reference? Or is the tupdesc just going to live there for the life of the backend, its refcount sometimes going above and back down to but never below 1? That would fit with your "refcounting isn't terribly necessary in that case". If that's how it works, it's interesting having the two different patterns: if it's a shared one, it has refcount -1 and you never fuss with it and it never goes away; if it's a local one it has a non-negative refcount and you go through all the motions and it never goes away anyway. >> 3. Is that shared case the only way you could see a non-refcounted >>TupleDesc handed to you by the typcache? > > I'm not sure what happens for a non-shared RECORD tupdesc, but it > probably wouldn't be wise to assume anything either way. If I'm reading this right, for the non-shared case, the copy that goes into the cache is made refcounted. (The copy you presented for blessing gets the assigned typmod written in it, and no change to its refcount field.) There's really just one thing I'm interested in assuming: *In general*, if I encounter a tupdesc with -1 refcount, I had better not assume much about its longevity. It might be in a context that's about to go away. If I'll be wanting it later, I had better defensively copy it into some context I've chosen. (Ok, I guess if it's a tupdesc provided to me in a function call, I can assume it is good for the duration of the call, or if it's part of an SPI result, I can assume it's good until SPI_finish.) But if I have gone straight to the typcache to ask for a RECORD tupdesc, and the one it gives me has -1 refcount, is it reasonable to assume I can retain a reference to that without the defensive copy? Regards, -Chap
Re: Question about 001_stream_rep.pl recovery test
Thanks a lot Michael for the explanation. On 2021-12-13 4:05 a.m., Michael Paquier wrote: On Fri, Dec 10, 2021 at 01:44:40PM -0800, David Zhang wrote: Inside the test script `src/test/recovery/t/001_stream_rep.pl`, a comment at line 30 says `my_backup` is "not mandatory", 30 # Take backup of standby 1 (not mandatory, but useful to check if 31 # pg_basebackup works on a standby). 32 $node_standby_1->backup($backup_name); however if remove the backup folder "my_backup" after line 32, something like below, Well, the comment is not completely incorrect IMO. In this context, I read taht that we don't need a backup from a standby and we could just take a backup from the primary instead. If I were to fix something, I would suggest to reword the comment as follows: # Take backup of standby 1 (could be taken from the primary, but this # is useful to check if pg_basebackup works on a standby). And then the test script takes another backup `my_backup_2`, but it seems this backup is not used anywhere. This had better stay around, per commit f267c1c. And as far as I can see, we don't have an equivalent test in a different place, where pg_basebackup runs on a standby while its *primary* is stopped. -- Michael -- David Software Engineer Highgo Software Inc. (Canada) www.highgo.ca
Re: Skipping logical replication transactions on subscriber side
On Tue, Dec 14, 2021 at 8:24 PM Amit Kapila wrote: > > On Tue, Dec 14, 2021 at 3:41 PM Dilip Kumar wrote: > > > > On Tue, Dec 14, 2021 at 2:36 PM Amit Kapila wrote: > > > > > > > > > > > > > I agree with this theory. Can we reflect this in comments so that in > > > > > the future we know why we didn't pursue this direction? > > > > > > > > I might be missing something here, but for streaming, transaction > > > > users can decide whether they wants to skip or not only once we start > > > > applying no? I mean only once we start applying the changes we can > > > > get some errors and by that time we must be having all the changes for > > > > the transaction. > > > > > > > > > > That is right and as per my understanding, the patch is trying to > > > accomplish the same. > > > > > > > So I do not understand the point we are trying to > > > > discuss here? > > > > > > > > > > The point is that whether we can skip the changes while streaming > > > itself like when we get the changes and write to a stream file. Now, > > > it is possible that streams from multiple transactions can be > > > interleaved and users can change the skip_xid in between. It is not > > > that we can't handle this but that would require a more complex design > > > and it doesn't seem worth it because we can anyway skip the changes > > > while applying as you mentioned in the previous paragraph. > > > > Actually, I was trying to understand the use case for skipping while > > streaming. Actually, during streaming we are not doing any database > > operation that means this will not generate any error. > > > > Say, there is an error the first time when we start to apply changes > for such a transaction. So, such a transaction will be streamed again. > Say, the user has set the skip_xid before we stream a second time, so > this time, we can skip it either during the stream phase or apply > phase. I think the patch is skipping it during apply phase. > Sawada-San, please confirm if my understanding is correct? My understanding is the same. The patch doesn't skip the streaming phase but starts skipping when starting to apply changes. That is, we receive streamed changes and write them to the stream file anyway regardless of skip_xid. When receiving the stream-commit message, we check whether or not we skip this transaction, and if so we apply all messages in the stream file other than all data modification messages. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: generalized conveyor belt storage
On Tue, Dec 14, 2021 at 3:00 PM Robert Haas wrote: > I got interested in this problem again because of the > idea discussed in > https://www.postgresql.org/message-id/CA%2BTgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A%40mail.gmail.com > of having a "dead TID" relation fork in which to accumulate TIDs that > have been marked as dead in the table but not yet removed from the > indexes, so as to permit a looser coupling between table vacuum and > index vacuum. That's yet another case where you accumulate new data > and then at a certain point the oldest data can be thrown away because > its intended purpose has been served. Thanks for working on this! It seems very strategically important to me. > So here's a patch. Basically, it lets you initialize a relation fork > as a "conveyor belt," and then you can add pages of basically > arbitrary data to the conveyor belt and then throw away old ones and, > modulo bugs, it will take care of recycling space for you. There's a > fairly detailed README in the patch if you want a more detailed > description of how the whole thing works. How did you test this? I ask because it would be nice if there was a convenient way to try this out, as somebody with a general interest. Even just a minimal test module, that you used for development work. > When I was chatting with Andres about this, he jumped to the question > of whether this could be used to replace SLRUs. To be honest, it's not > really designed for applications that are quite that intense. I personally think that SLRUs (and the related problem of hint bits) are best addressed by tracking which transactions have modified what heap blocks (perhaps only approximately), and then eagerly cleaning up aborted transaction IDs, using a specialized version of VACUUM that does something like heap pruning for aborted xacts. It just seems weird that we keep around clog in order to not have to run VACUUM too frequently, which (among other things) already cleans up after aborted transactions. Having a more efficient data structure for commit status information doesn't seem all that promising, because the problem is actually our insistence on remembering which XIDs aborted almost indefinitely. There is no fundamental reason why it has to work that way. I don't mean that it could in principle be changed (that's almost always true); I mean that it seems like an accident of history, that could have easily gone another way: the ancestral design of clog existing in a world without MVCC. It seems like a totally vestigial thing to me, which I wouldn't say about a lot of other things (just clog and freezing). Something like the conveyor belt seems like it would help with this other kind of VACUUM, mind you. We probably don't want to do anything like index vacuuming for these aborted transactions (actually maybe we want to do retail deletion, which is much more likely to work out there). But putting the TIDs into a store used for dead TIDs could make sense. Especially in extreme cases. -- Peter Geoghegan
Re: Life cycles of tuple descriptors
Chapman Flack writes: > But if I have gone straight to the typcache to ask for a RECORD tupdesc, > and the one it gives me has -1 refcount, is it reasonable to assume > I can retain a reference to that without the defensive copy? The API contract for lookup_rowtype_tupdesc specifies that you must "call ReleaseTupleDesc or DecrTupleDescRefCount when done using the tupdesc". It's safe to assume that the tupdesc will stick around as long as you haven't done that. APIs that don't mention a refcount are handing you a tupdesc of uncertain lifespan (no more than the current query, likely), so if you want the tupdesc to last a long time you'd better copy it into storage you control. regards, tom lane
Re: row filtering for logical replication
On Tue, Dec 14, 2021 at 10:12 PM Amit Kapila wrote: > > On Tue, Dec 14, 2021 at 10:50 AM Amit Kapila wrote: > > > > On Tue, Dec 14, 2021 at 4:44 AM Peter Smith wrote: > > > > Few other comments: > > === > > Few more comments: > == > v46-0001/0002 > === > 1. After rowfilter_walker() why do we need > EXPR_KIND_PUBLICATION_WHERE? I thought this is primarily to identify > the expressions that are not allowed in rowfilter which we are now > able to detect upfront with the help of a walker. Can't we instead use > EXPR_KIND_WHERE? FYI - I have tried this locally and all tests pass. ~~ If the EXPR_KIND_PUBLICATION_WHERE is removed then there will be some differences: - we would get errors for aggregate/grouping functions from the EXPR_KIND_WHERE - we would get errors for windows functions from the EXPR_KIND_WHERE - we would get errors for set-returning functions from the EXPR_KIND_WHERE Actually, IMO this would be a *good* change because AFAIK those are not all being checked by the row-filter walker. I think the only reason all tests pass is that there are no specific regression tests for these cases. OTOH, there would also be a difference where an error message would not be as nice. Please see the review comment from Vignesh. [1] The improved error message is only possible by checking the EXPR_KIND_PUBLICATION_WHERE. ~~ I think the best thing to do here is to leave the EXPR_KIND_PUBLICATION_WHERE but simplify code so that the improved error message remains as the *only* difference in behaviour from the EXPR_KIND_WHERE. i.e. we should let the other aggregate/grouping/windows/set function checks give errors exactly the same as for the EXPR_KIND_WHERE case. -- [1] https://www.postgresql.org/message-id/CALDaNm08Ynr_FzNg%2BdoHj%3D_nBet%2BKZAvNbqmkEEw7M2SPpPEAw%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set
On Mon, Dec 13, 2021 at 10:14:49PM -0800, Andres Freund wrote: > Tom's point was that the buildfarm scripts do > if ($self->{bfconf}->{using_msvc}) > @checklog = run_log("perl vcregress.pl upgradecheck"); > else > "cd $self->{pgsql}/src/bin/pg_upgrade && $make > $instflags check"; > > if we don't want to break every buildfarm member that has TestUpgrade enabled > the moment this is committed, we need to have a backward compat path. Missed that, thanks! I'll think about all that a bit more before sending a long-overdue rebased version. -- Michael signature.asc Description: PGP signature
Re: Life cycles of tuple descriptors
On 12/14/21 20:02, Tom Lane wrote: > The API contract for lookup_rowtype_tupdesc specifies that you must "call > ReleaseTupleDesc or DecrTupleDescRefCount when done using the tupdesc". > It's safe to assume that the tupdesc will stick around as long as you > haven't done that. I think what threw me was having a function whose API contract mentions reference counts, but that sometimes gives me things that don't have them. But I guess, making the between-the-lines of the contract explicit, if lookup_rowtype_tupdesc is contracted to give me a tupdesc that sticks around for as long as I haven't called ReleaseTupleDesc, and it sometimes elects to give me one for which ReleaseTupleDesc is a no-op, the contract is still that the thing sticks around for (at least) as long as I haven't done that. Cool. :) Oh, hmm, maybe one thing in that API comment ought to be changed. It says I must call ReleaseTupleDesc *or* DecrTupleDescRefCount. Maybe that dates from before the shared registry? ReleaseTupleDesc is safe, but anybody who uses DecrTupleDescRefCount on a lookup_rowtype_tupdesc result could be in for an assertion failure if a non-refcounted tupdesc is returned. Regards, -Chap
Re: Life cycles of tuple descriptors
Chapman Flack writes: > On 12/14/21 20:02, Tom Lane wrote: >> The API contract for lookup_rowtype_tupdesc specifies that you must "call >> ReleaseTupleDesc or DecrTupleDescRefCount when done using the tupdesc". >> It's safe to assume that the tupdesc will stick around as long as you >> haven't done that. > I think what threw me was having a function whose API contract mentions > reference counts, but that sometimes gives me things that don't have them. That's supposed to be hidden under ReleaseTupleDesc; you shouldn't have to think about it. > Oh, hmm, maybe one thing in that API comment ought to be changed. It says > I must call ReleaseTupleDesc *or* DecrTupleDescRefCount. Maybe that dates > from before the shared registry? ReleaseTupleDesc is safe, but anybody who > uses DecrTupleDescRefCount on a lookup_rowtype_tupdesc result could be > in for an assertion failure if a non-refcounted tupdesc is returned. Yeah, I was just wondering the same. I think DecrTupleDescRefCount is safe if you know you are looking up a named composite type, but maybe that's still too much familiarity with typcache innards. regards, tom lane
Re: Skipping logical replication transactions on subscriber side
On Tue, Dec 14, 2021 at 2:35 PM Greg Nancarrow wrote: > > On Tue, Dec 14, 2021 at 3:23 PM vignesh C wrote: > > > > While the worker is skipping one of the skip transactions specified by > > the user and immediately if the user specifies another skip > > transaction while the skipping of the transaction is in progress this > > new value will be reset by the worker while clearing the skip xid. I > > felt once the worker has identified the skip xid and is about to skip > > the xid, the worker can acquire a lock to prevent concurrency issues: > > That's a good point. > If only the last_error_xid could be skipped, then this wouldn't be an > issue, right? > If a different xid to skip is specified while the worker is currently > skipping a transaction, should that even be allowed? > We don't expect such usage but yes, it could happen and seems not good. I thought we can acquire Share lock on pg_subscription during the skip but not sure it's a good idea. It would be better if we can find a way to allow users to specify only XID that has failed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: WIN32 pg_import_system_collations
On Wed, Dec 15, 2021 at 10:45:28AM +1300, Thomas Munro wrote: > Ah, right. I hope we can make the leap to 0x0A00 (Win10) soon and > just stop thinking about these old ghosts, as mentioned by various > people in various threads. Seeing your message here.. My apologies for the short digression. Would that mean that we could use CreateSymbolicLinkA() as a mapper for pgreadlink() rather than junction points? I am wondering how much code in src/port/ such a move could allow us to do. -- Michael signature.asc Description: PGP signature
Re: parallel vacuum comments
On Mon, Dec 13, 2021 at 7:03 PM Amit Kapila wrote: > Thanks, I can take care of this before committing. The v9-0001* looks > good to me as well, so, I am planning to commit that tomorrow unless I > see more comments or any objection to that. I would like to thank both Masahiko and yourself for working on this. It's important. > There is still pending > work related to moving parallel vacuum code to a separate file and a > few other pending comments that are still under discussion. We can > take care of those in subsequent patches. Do, let me know if you or > others think differently? I'm +1 on moving it into a new file. I think that that division makes perfect sense. It will make the design of parallel VACUUM easier to understand. I believe that index vacuuming (whether or not it involves parallel workers) ought to be a more or less distinct operation to heap vacuuming. With a distinct autovacuum schedule (well, the schedule would be related, but still distinct). -- Peter Geoghegan
Re: parallel vacuum comments
On Wed, Dec 15, 2021 at 8:23 AM Peter Geoghegan wrote: > > On Mon, Dec 13, 2021 at 7:03 PM Amit Kapila wrote: > > Thanks, I can take care of this before committing. The v9-0001* looks > > good to me as well, so, I am planning to commit that tomorrow unless I > > see more comments or any objection to that. > > I would like to thank both Masahiko and yourself for working on this. > It's important. > > > There is still pending > > work related to moving parallel vacuum code to a separate file and a > > few other pending comments that are still under discussion. We can > > take care of those in subsequent patches. Do, let me know if you or > > others think differently? > > I'm +1 on moving it into a new file. I think that that division makes > perfect sense. It will make the design of parallel VACUUM easier to > understand. > Agreed and thanks for your support. -- With Regards, Amit Kapila.
Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?
At Tue, 14 Dec 2021 19:04:09 +0530, Bharath Rupireddy wrote in > On Fri, Dec 3, 2021 at 7:39 PM Bharath Rupireddy > wrote: > > > > On Wed, Dec 1, 2021 at 9:50 PM Alvaro Herrera > > wrote: > > > I think the PIDs are log-worthy for sure, but it's not clear to me that > > > it is desirable to write them to the persistent state file. In case of > > > crashes, the log should serve just fine to aid root cause investigation > > > -- in fact even better than the persistent file, where the data would be > > > lost as soon as the next client acquires that slot. > > > > Thanks. +1 to log a message at LOG level whenever a replication slot > > becomes active (gets assigned a valid pid to active_pid) and becomes > > inactive(gets assigned 0 to active_pid). Having said that, isn't it > > also helpful if we write a bool (1 byte character) whenever the slot > > becomes active and inactive to the disk? > > Here's the patch that adds a LOG message whenever a replication slot > becomes active and inactive. These logs will be extremely useful on > production servers to debug and analyze inactive replication slot > issues. > > Thoughts? If I create a replication slot, I saw the following lines in server log. [22054:client backend] LOG: replication slot "s1" becomes active [22054:client backend] DETAIL: The process with PID 22054 acquired it. [22054:client backend] STATEMENT: select pg_drop_replication_slot('s1'); [22054:client backend] LOG: replication slot "s1" becomes inactive [22054:client backend] DETAIL: The process with PID 22054 released it. [22054:client backend] STATEMENT: select pg_drop_replication_slot('s1'); They are apparently too much as if they were DEBUG3 lines. The process PID shown is of the process the slot operations took place so I think it conveys no information. The STATEMENT lines are also noisy for non-ERROR emssages. Couldn't we hide that line? That is, how about making the log lines as simple as the follows? [17156:walsender] LOG: acquired replication slot "s1" [17156:walsender] LOG: released replication slot "s1" I think in the first place we don't need this log lines at slot creation since it is actually not acquirement nor releasing of a slot. It behaves in a strange way when executing pg_basebackup. [22864:walsender] LOG: replication slot "pg_basebackup_22864" becomes active [22864:walsender] DETAIL: The process with PID 22864 acquired it. [22864:walsender] STATEMENT: CREATE_REPLICATION_SLOT "pg_basebackup_22864" TEMPORARY PHYSICAL ( RESERVE_WAL) [22864:walsender] LOG: replication slot "pg_basebackup_22864" becomes active [22864:walsender] DETAIL: The process with PID 22864 acquired it. [22864:walsender] STATEMENT: START_REPLICATION SLOT "pg_basebackup_22864" 0/600 TIMELINE 1 [22864:walsender] LOG: replication slot "pg_basebackup_22864" becomes inactive [22864:walsender] DETAIL: The process with PID 22864 released it. The slot is acquired twice then released once. It is becuase the patch doesn't emit "becomes inactive" line when releasing a temporary slot. However, I'd rather think we don't need the first 'become active' line like the previous example. @@ -658,6 +690,13 @@ ReplicationSlotDropPtr(ReplicationSlot *slot) slot->active_pid = 0; slot->in_use = false; LWLockRelease(ReplicationSlotControlLock); + + if (pid > 0) + ereport(LOG, + (errmsg("replication slot \"%s\" becomes inactive", + NameStr(slot->data.name)), +errdetail("The process with PID %d released it.", pid))); + This is wrong. I see a "become inactive" message if I droped an "inactive" replication slot. The reason the inactive slot looks as if it were acquired is it is temporarily aquired as a preparing step of dropping. Even assuming that the log lines are simplified to this extent, I still see it a bit strange that the "becomes acitve (or acruied)" message shows alone without having a message like "replication connection accepted". But that would be another issue even if it is true. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Confused comment about drop replica identity index
On Tue, Dec 14, 2021 at 07:10:49PM +0530, Ashutosh Bapat wrote: > This code in RelationGetIndexList() is not according to that comment. > >if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex)) > relation->rd_replidindex = pkeyIndex; > else if (replident == REPLICA_IDENTITY_INDEX && > OidIsValid(candidateIndex)) > relation->rd_replidindex = candidateIndex; > else > relation->rd_replidindex = InvalidOid; Yeah, the comment is wrong. If the index of a REPLICA_IDENTITY_INDEX is dropped, I recall that the behavior is the same as REPLICA_IDENTITY_NOTHING. > Comment in code is one thing, but I think PG documentation is not > covering the use case you tried. What happens when a replica identity > index is dropped has not been covered either in ALTER TABLE > https://www.postgresql.org/docs/13/sql-altertable.html or DROP INDEX > https://www.postgresql.org/docs/14/sql-dropindex.html documentation. Not sure about the DROP INDEX page, but I'd be fine with mentioning that in the ALTER TABLE page in the paragraph related to REPLICA IDENTITY. While on it, I would be tempted to switch this stuff to use a list of for all the option values. That would be much easier to read. [ ... thinks a bit ... ] FWIW, this brings back some memories, as of this thread: https://www.postgresql.org/message-id/20200522035028.go2...@paquier.xyz See also commit fe7fd4e from August 2020, where some tests have been added. I recall seeing this incorrect comment from last year's thread and it may have been mentioned in one of the surrounding threads.. Maybe I just let it go back then. I don't know. -- Michael signature.asc Description: PGP signature
Re: row filtering for logical replication
On Wed, Dec 15, 2021 at 6:47 AM Peter Smith wrote: > > On Tue, Dec 14, 2021 at 10:12 PM Amit Kapila wrote: > > > > On Tue, Dec 14, 2021 at 10:50 AM Amit Kapila > > wrote: > > > > > > On Tue, Dec 14, 2021 at 4:44 AM Peter Smith wrote: > > > > > > Few other comments: > > > === > > > > Few more comments: > > == > > v46-0001/0002 > > === > > 1. After rowfilter_walker() why do we need > > EXPR_KIND_PUBLICATION_WHERE? I thought this is primarily to identify > > the expressions that are not allowed in rowfilter which we are now > > able to detect upfront with the help of a walker. Can't we instead use > > EXPR_KIND_WHERE? > > FYI - I have tried this locally and all tests pass. > > ~~ > > If the EXPR_KIND_PUBLICATION_WHERE is removed then there will be some > differences: > - we would get errors for aggregate/grouping functions from the > EXPR_KIND_WHERE > - we would get errors for windows functions from the EXPR_KIND_WHERE > - we would get errors for set-returning functions from the EXPR_KIND_WHERE > > Actually, IMO this would be a *good* change because AFAIK those are > not all being checked by the row-filter walker. I think the only > reason all tests pass is that there are no specific regression tests > for these cases. > > OTOH, there would also be a difference where an error message would > not be as nice. Please see the review comment from Vignesh. [1] The > improved error message is only possible by checking the > EXPR_KIND_PUBLICATION_WHERE. > > ~~ > > I think the best thing to do here is to leave the > EXPR_KIND_PUBLICATION_WHERE but simplify code so that the improved > error message remains as the *only* difference in behaviour from the > EXPR_KIND_WHERE. i.e. we should let the other > aggregate/grouping/windows/set function checks give errors exactly the > same as for the EXPR_KIND_WHERE case. > I am not sure if "the better error message" is a good enough reason to introduce this new kind. I thought it is better to deal with that in rowfilter_walker. -- With Regards, Amit Kapila.
Re: Skipping logical replication transactions on subscriber side
On Tue, Dec 14, 2021 at 4:54 PM Amit Kapila wrote: > > > Actually, I was trying to understand the use case for skipping while > > streaming. Actually, during streaming we are not doing any database > > operation that means this will not generate any error. > > > > Say, there is an error the first time when we start to apply changes > for such a transaction. So, such a transaction will be streamed again. > Say, the user has set the skip_xid before we stream a second time, so > this time, we can skip it either during the stream phase or apply > phase. I think the patch is skipping it during apply phase. > Sawada-San, please confirm if my understanding is correct? > Got it, thanks for clarifying. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: WIN32 pg_import_system_collations
On Wed, Dec 15, 2021 at 3:52 PM Michael Paquier wrote: > On Wed, Dec 15, 2021 at 10:45:28AM +1300, Thomas Munro wrote: > > Ah, right. I hope we can make the leap to 0x0A00 (Win10) soon and > > just stop thinking about these old ghosts, as mentioned by various > > people in various threads. > > Seeing your message here.. My apologies for the short digression. > Would that mean that we could use CreateSymbolicLinkA() as a mapper > for pgreadlink() rather than junction points? I am wondering how much > code in src/port/ such a move could allow us to do. Sadly, (1) it wouldn't work unless running with a special privilege or as admin, and (2) it wouldn't work on non-NTFS filesystems. I think it's mostly intended to allow things like unpacking tarballs, checking out git repos etc etc etc that came from Unix systems, which is why it works with 'developer mode' enabled[1], though obviously it wouldn't be totally impossible for us to require that privilege. Didn't seem great to me, though, that's why I gave up on it over in https://commitfest.postgresql.org/36/3090/ where this was recently discussed. [1] https://blogs.windows.com/windowsdeveloper/2016/12/02/symlinks-windows-10/
Re: Skipping logical replication transactions on subscriber side
On Wed, Dec 15, 2021 at 8:19 AM Masahiko Sawada wrote: > > On Tue, Dec 14, 2021 at 2:35 PM Greg Nancarrow wrote: > > > > On Tue, Dec 14, 2021 at 3:23 PM vignesh C wrote: > > > > > > While the worker is skipping one of the skip transactions specified by > > > the user and immediately if the user specifies another skip > > > transaction while the skipping of the transaction is in progress this > > > new value will be reset by the worker while clearing the skip xid. I > > > felt once the worker has identified the skip xid and is about to skip > > > the xid, the worker can acquire a lock to prevent concurrency issues: > > > > That's a good point. > > If only the last_error_xid could be skipped, then this wouldn't be an > > issue, right? > > If a different xid to skip is specified while the worker is currently > > skipping a transaction, should that even be allowed? > > > > We don't expect such usage but yes, it could happen and seems not > good. I thought we can acquire Share lock on pg_subscription during > the skip but not sure it's a good idea. It would be better if we can > find a way to allow users to specify only XID that has failed. > Yeah, but as we don't have a definite way to allow specifying only failed XID, I think it is better to use share lock on that particular subscription. We are already using it for add/update rel state (see, AddSubscriptionRelState, UpdateSubscriptionRelState), so this will be another place to use a similar technique. -- With Regards, Amit Kapila.
Re: more descriptive message for process termination due to max_slot_wal_keep_size
At Tue, 14 Dec 2021 19:31:21 +0530, Ashutosh Bapat wrote in > On Tue, Dec 14, 2021 at 9:35 AM Kyotaro Horiguchi > wrote: > > > [17605] LOG: terminating process 17614 to release replication slot "s1" > > + [17605] DETAIL: The slot's restart_lsn 0/2CA0 exceeds > > max_slot_wal_keep_size. > > > [17614] FATAL: terminating connection due to administrator command > > > [17605] LOG: invalidating slot "s1" because its restart_lsn 0/2CA0 > > > exceeds max_slot_wal_keep_size > > > > Somewhat the second and fourth lines look inconsistent each other but > > that wouldn't be such a problem. I don't think we want to concatenate > > the two lines together as the result is a bit too long. > > > > > LOG: terminating process 17614 to release replication slot "s1" because > > > it's restart_lsn 0/2CA0 exceeds max_slot_wal_keep_size. > > > > What do you think about this? > > Agree. I think we should also specify the restart_lsn value which > would be within max_slot_wal_keep_size for better understanding. Thanks! It seems to me the main message of the "invalidating" log has no room for further detail. So I split the reason out to DETAILS line the same way with the "terminating" message in the attached second patch. (It is separated from the first patch just for review) I believe someone can make the DETAIL message simpler or more natural. The attached patch set emits the following message. > LOG: invalidating slot "s1" > DETAIL: The slot's restart_lsn 0/1D68 is behind the limit 0/1100 > defined by max_slot_wal_keep_size. The second line could be changed like the following or anything other. > DETAIL: The slot's restart_lsn 0/1D68 got behind the limit 0/1100 > determined by max_slot_wal_keep_size. . regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 6909d854a0d48f2504cbb6dfbc7eb571fd32bcd8 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 14 Dec 2021 10:50:00 +0900 Subject: [PATCH v2 1/2] Make an error message about process termination more descriptive If checkpointer kills a process due to a temporary replication slot exceeding max_slot_wal_keep_size, the messages fails to describe the cause of the termination. It is finally shown for persistent slots in a subsequent message, but that message doesn't show for temporary slots. It's better to attach an explainaion to the termination message directly. Reported-by: Alex Enachioaie Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org --- src/backend/replication/slot.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 90ba9b417d..cba9a29113 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1254,7 +1254,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, { ereport(LOG, (errmsg("terminating process %d to release replication slot \"%s\"", -active_pid, NameStr(slotname; +active_pid, NameStr(slotname)), + errdetail("The slot's restart_lsn %X/%X exceeds max_slot_wal_keep_size.", LSN_FORMAT_ARGS(restart_lsn; (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; -- 2.27.0 >From 265a84fa6a20ee7c5178ace41096df1a4b00465c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 15 Dec 2021 13:08:20 +0900 Subject: [PATCH v2 2/2] Make the message further detailed --- src/backend/replication/slot.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index cba9a29113..695151e484 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1255,7 +1255,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, ereport(LOG, (errmsg("terminating process %d to release replication slot \"%s\"", active_pid, NameStr(slotname)), - errdetail("The slot's restart_lsn %X/%X exceeds max_slot_wal_keep_size.", LSN_FORMAT_ARGS(restart_lsn; + errdetail("The slot's restart_lsn %X/%X is behind the limit %X/%X defined by max_slot_wal_keep_size.", + LSN_FORMAT_ARGS(restart_lsn), + LSN_FORMAT_ARGS(oldestLSN; (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; @@ -1292,9 +1294,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN, ReplicationSlotRelease(); ereport(LOG, - (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size", - NameStr(slotname), - LSN_FORMAT_ARGS(restart_lsn; + (errmsg("invalidating slot \"%s\"", + NameStr(slotname)), + errdetail("The slot's restart_lsn %X/%X is behind the limit %X/%X defined by max_slot_wal_keep_size.", + LSN_FORMAT_ARGS(restart_lsn), + LSN_FORMAT_ARGS(oldestLSN; /* done with this
Re: Skipping logical replication transactions on subscriber side
On Tue, Dec 14, 2021 at 11:40 AM Dilip Kumar wrote: > > On Fri, Dec 3, 2021 at 12:12 PM Masahiko Sawada wrote: > > > > Skipping a whole transaction by specifying xid would be a good start. > > Ideally, we'd like to automatically skip only operations within the > > transaction that fail but it seems not easy to achieve. If we allow > > specifying operations and/or relations, probably multiple operations > > or relations need to be specified in some cases. Otherwise, the > > subscriber cannot continue logical replication if the transaction has > > multiple operations on different relations that fail. But similar to > > the idea of specifying multiple xids, we need to note the fact that > > user wouldn't know of the second operation failure unless the apply > > worker applies the change. So I'm not sure there are many use cases in > > practice where users can specify multiple operations and relations in > > order to skip applies that fail. > > I think there would be use cases for specifying the relations or > operation, e.g. if the user finds an issue in inserting in a > particular relation then maybe based on some manual investigation he > founds that the table has some constraint due to that it is failing on > the subscriber side but on the publisher side that constraint is not > there so maybe the user is okay to skip the changes for this table and > not for other tables, or there might be a few more tables which are > designed based on the same principle and can have similar error so > isn't it good to provide an option to give the list of all such > tables. > That's right and I agree there could be some use case for it and even specifying the operation but I think we can always extend the existing feature for it if the need arises. Note that the user can anyway only specify a single relation or an operation because there is a way to know only one error and till that is resolved, the apply process won't proceed. We have discussed providing these additional options in this thread but thought of doing it later once we have the base feature and based on the feedback from users. -- With Regards, Amit Kapila.
Re: Skipping logical replication transactions on subscriber side
On Wed, Dec 15, 2021 at 9:46 AM Amit Kapila wrote: > > On Tue, Dec 14, 2021 at 11:40 AM Dilip Kumar wrote: > > That's right and I agree there could be some use case for it and even > specifying the operation but I think we can always extend the existing > feature for it if the need arises. Note that the user can anyway only > specify a single relation or an operation because there is a way to > know only one error and till that is resolved, the apply process won't > proceed. We have discussed providing these additional options in this > thread but thought of doing it later once we have the base feature and > based on the feedback from users. Yeah, I only wanted to make the point that this could be useful, it seems we are on the same page. I agree we can extend it in the future as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: row filtering for logical replication
On Mon, Dec 13, 2021 at 8:49 PM Peter Smith wrote: > > PSA the v46* patch set. > 0001 (1) "If a subscriber is a pre-15 version, the initial table synchronization won't use row filters even if they are defined in the publisher." Won't this lead to data inconsistencies or errors that otherwise wouldn't happen? Should such subscriptions be allowed? (2) In the 0001 patch comment, the term "publication filter" is used in one place, and in others "row filter" or "row-filter". src/backend/catalog/pg_publication.c (3) GetTransformedWhereClause() is missing a function comment. (4) The following comment seems incomplete: + /* Fix up collation information */ + whereclause = GetTransformedWhereClause(pstate, pri, true); src/backend/parser/parse_relation.c (5) wording? consistent? Shouldn't it be "publication WHERE expression" for consistency? + errmsg("publication row-filter WHERE invalid reference to table \"%s\"", + relation->relname), src/backend/replication/logical/tablesync.c (6) (i) Improve wording: BEFORE: /* * Get information about remote relation in similar fashion the RELATION - * message provides during replication. + * message provides during replication. This function also returns the relation + * qualifications to be used in COPY command. */ AFTER: /* - * Get information about remote relation in similar fashion the RELATION - * message provides during replication. + * Get information about a remote relation, in a similar fashion to how the RELATION + * message provides information during replication. This function also returns the relation + * qualifications to be used in the COPY command. */ (ii) fetch_remote_table_info() doesn't currently account for ALL TABLES and ALL TABLES IN SCHEMA. src/backend/replication/pgoutput/pgoutput.c (7) pgoutput_tow_filter() I think that the "ExecDropSingleTupleTableSlot(entry->scantuple);" is not needed in pgoutput_tow_filter() - I don't think it can be non-NULL when entry->exprstate_valid is false (8) I am a little unsure about this "combine filters on copy (irrespective of pubaction)" functionality. What if a filter is specified and the only pubaction is DELETE? 0002 src/backend/catalog/pg_publication.c (1) rowfilter_walker() One of the errdetail messages doesn't begin with an uppercase letter: + errdetail_msg = _("user-defined types are not allowed"); src/backend/executor/execReplication.c (2) CheckCmdReplicaIdentity() Strictly speaking, the following: + if (invalid_rfcolnum) should be: + if (invalid_rfcolnum != InvalidAttrNumber) 0003 src/backend/replication/logical/tablesync.c (1) Column name in comment should be "puballtables" not "puballtable": + * If any publication has puballtable true then all row-filtering is (2) pgoutput_row_filter_init() There should be a space before the final "*/" (so the asterisks align). Also, should say "... treated the same". /* + * If the publication is FOR ALL TABLES then it is treated same as if this + * table has no filters (even if for some other publication it does). + */ Regards, Greg Nancarrow Fujitsu Australia
Re: Skipping logical replication transactions on subscriber side
On Wed, Dec 15, 2021 at 1:49 PM Masahiko Sawada wrote: > > We don't expect such usage but yes, it could happen and seems not > good. I thought we can acquire Share lock on pg_subscription during > the skip but not sure it's a good idea. It would be better if we can > find a way to allow users to specify only XID that has failed. > Yes, I agree that would be better. If you didn't do that, I think you'd need to queue the XIDs to be skipped (rather than locking). Regards, Greg Nancarrow Fujitsu Australia
Re: Failed transaction statistics to measure the logical replication progress
On Tue, Dec 14, 2021 at 11:28 AM Amit Kapila wrote: > > On Mon, Dec 13, 2021 at 5:48 PM osumi.takami...@fujitsu.com > wrote: > > > > On Monday, December 13, 2021 6:19 PM Amit Kapila > > wrote: > > > On Tue, Dec 7, 2021 at 3:12 PM osumi.takami...@fujitsu.com > > > wrote: > > > > > > Few questions and comments: > > Thank you for your comments ! > > > > > > > > 1. > > > The pg_stat_subscription_workers view will > > > contain > > > one row per subscription worker on which errors have occurred, for > > > workers > > > applying logical replication changes and workers handling the initial > > > data > > > - copy of the subscribed tables. The statistics entry is removed when > > > the > > > - corresponding subscription is dropped. > > > + copy of the subscribed tables. Also, the row corresponding to the > > > apply > > > + worker shows all transaction statistics of both types of workers on > > > the > > > + subscription. The statistics entry is removed when the corresponding > > > + subscription is dropped. > > > > > > Why did you choose to show stats for both types of workers in one row? > > This is because if we have hundreds or thousands of tables for table sync, > > we need to create many entries to cover them and store the entries for all > > tables. > > > > If we fear a large number of entries for such workers then won't it be > better to show the value of these stats only for apply workers. I > think normally the table sync workers perform only copy operation or > maybe a fixed number of xacts, so, one might not be interested in the > transaction stats of these workers. I find merging only specific stats > of two different types of workers confusing. > > What do others think about this? I understand the concern to have a large number of entries but I agree that merging only specific stats would confuse users. As Amit suggested, it'd be better to show only apply workers' transaction stats. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: generalized conveyor belt storage
On Wed, Dec 15, 2021 at 6:33 AM Peter Geoghegan wrote: > > How did you test this? I ask because it would be nice if there was a > convenient way to try this out, as somebody with a general interest. > Even just a minimal test module, that you used for development work. > I have tested this using a simple extension over the conveyor belt APIs, this extension provides wrapper apis over the conveyor belt APIs. These are basic APIs which can be extended even further for more detailed testing, e.g. as of now I have provided an api to read complete page from the conveyor belt but that can be easily extended to read from a particular offset and also the amount of data to read. Parallelly I am also testing this by integrating it with the vacuum, which is still a completely WIP patch and needs a lot of design level improvement so not sharing it, so once it is in better shape I will post that in the separate thread. Basically, we have decoupled different vacuum phases (only for manual vacuum ) something like below, VACUUM (first_pass) t; VACUUM idx; VACUUM (second_pass) t; So in the first pass we are just doing the first pass of vacuum and wherever we are calling lazy_vacuum() we are storing those dead tids in the conveyor belt. In the index pass, user can vacuum independent index and therein it will just fetch the last conveyor belt point upto which it has already vacuum, then from there load dead tids which can fit in maintenance_work_mem and then call the index bulk delete (this will be done in loop until we complete the index vacuum). In the second pass, we check all the indexes and find the minimum conveyor belt point upto which all indexes have vacuumed. We also fetch the last point where we left the second pass of the heap. Now we fetch the dead tids from the conveyor belt (which fits in maintenance_work_mem) from the last vacuum point of heap upto the min index vacuum point. And perform the second heap pass. I have given the highlights of the decoupling work just to show what sort of testing we are doing for the conveyor belt. But we can discuss this on a separate thread when I am ready to post that patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com From b6affd2a778b7b4cff5738ad99f34ea21a816562 Mon Sep 17 00:00:00 2001 From: Dilip Kumar Date: Wed, 15 Dec 2021 10:28:49 +0530 Subject: [PATCH v1] Conveyor belt testing extention This extention provide wrapper over the conveyor belt infrastructure for testing the conveyor belt. --- contrib/pg_conveyor/Makefile | 23 +++ contrib/pg_conveyor/expected/pg_conveyor.out | 185 contrib/pg_conveyor/pg_conveyor--1.0.sql | 32 + contrib/pg_conveyor/pg_conveyor.c| 207 +++ contrib/pg_conveyor/pg_conveyor.control | 5 + contrib/pg_conveyor/sql/pg_conveyor.sql | 125 src/common/relpath.c | 3 +- src/include/common/relpath.h | 5 +- 8 files changed, 582 insertions(+), 3 deletions(-) create mode 100644 contrib/pg_conveyor/Makefile create mode 100644 contrib/pg_conveyor/expected/pg_conveyor.out create mode 100644 contrib/pg_conveyor/pg_conveyor--1.0.sql create mode 100644 contrib/pg_conveyor/pg_conveyor.c create mode 100644 contrib/pg_conveyor/pg_conveyor.control create mode 100644 contrib/pg_conveyor/sql/pg_conveyor.sql diff --git a/contrib/pg_conveyor/Makefile b/contrib/pg_conveyor/Makefile new file mode 100644 index 000..8c29ffd --- /dev/null +++ b/contrib/pg_conveyor/Makefile @@ -0,0 +1,23 @@ +# contrib/pg_conveyor/Makefile + +MODULE_big = pg_conveyor +OBJS = \ + $(WIN32RES) \ + pg_conveyor.o + +EXTENSION = pg_conveyor +DATA = pg_conveyor--1.0.sql +PGFILEDESC = "pg_conveyor - conveyor belt test" + +REGRESS = pg_conveyor + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_conveyor +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/pg_conveyor/expected/pg_conveyor.out b/contrib/pg_conveyor/expected/pg_conveyor.out new file mode 100644 index 000..6ae5cd3 --- /dev/null +++ b/contrib/pg_conveyor/expected/pg_conveyor.out @@ -0,0 +1,185 @@ +CREATE EXTENSION pg_conveyor; +CREATE TABLE test(a int); +SELECT pg_conveyor_init('test'::regclass::oid, 4); + pg_conveyor_init +-- + +(1 row) + +SELECT pg_conveyor_insert('test'::regclass::oid, 'test_data'); + pg_conveyor_insert + + +(1 row) + +SELECT pg_conveyor_read('test'::regclass::oid, 0); + pg_conveyor_read +-- + test_data +(1 row) + +--CASE1 +do $$ +<> +declare + i int := 0; + data varchar; +begin + for i in 1..1000 loop + data := 'test_dataa' || i; + PERFORM pg_conveyor_insert('test'::regclass::oid, data); + end loop; +end first_block $$; +-- read from some random blocks +
Re: row filtering for logical replication
On Wed, Dec 15, 2021 at 10:20 AM Greg Nancarrow wrote: > > On Mon, Dec 13, 2021 at 8:49 PM Peter Smith wrote: > > > > PSA the v46* patch set. > > > > 0001 > > (1) > > "If a subscriber is a pre-15 version, the initial table > synchronization won't use row filters even if they are defined in the > publisher." > > Won't this lead to data inconsistencies or errors that otherwise > wouldn't happen? > How? The subscribers will get all the initial data. > Should such subscriptions be allowed? > I am not sure what you have in mind here? How can we change the already released code pre-15 for this new feature? -- With Regards, Amit Kapila.
RE: [Proposal] Add foreign-server health checks infrastructure
Dear Kato-san, Thank you for giving comments! And sorry for late reply. I rebased my patches. > Even for local-only transaction, I thought it useless to execute > CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I > make it so that it determines outside whether it contains SQL to the > remote or not? Yeah, remote-checking timeout will be enable even ifa local transaction is opened. In my understanding, postgres cannot distinguish whether opening transactions are using only local object or not. My first idea was that turning on the timeout when GetFdwRoutineXXX functions were called, but in some cases FDWs may not use remote connection even if they call such a handler function. e.g. postgresExplainForeignScan(). Hence I tried another approach that unregister all checking callback the end of each transactions. Only FDWs can notice that transactions are remote or local, so they must register callback when they really connect to other database. This implementation will avoid calling remote checking in the case of local transaction, but multiple registering and unregistering may lead another overhead. I attached which implements that. > The following points are minor. > 1. In foreign.c, some of the lines are too long and should be broken. > 2. In pqcomm.c, the newline have been removed, what is the intention of > this? > 3. In postgres.c, > 3-1. how about inserting a comment between lines 2713 and 2714, similar > to line 2707? > 3-2. the line breaks in line 3242 and line 3243 should be aligned. > 3-3. you should change > /* > * Skip checking foreign servers while reading > messages. > */ > to > /* >* Skip checking foreign servers while reading > messages. >*/ > 4. In connection.c, There is a typo in line 1684, so "fucntion" should > be changed to "function". Maybe all of them were fixed. Thanks! How do you think? Best Regards, Hayato Kuroda FUJITSU LIMITED <> v05_0001_add_checking_infrastracture.patch Description: v05_0001_add_checking_infrastracture.patch v05_0002_add_doc.patch Description: v05_0002_add_doc.patch