pg can create duplicated index without any errors even warnning
postgres=# create table t (a int, b int); CREATE TABLE postgres=# create index m on t(a); CREATE INDEX postgres=# create index m2 on t(a); CREATE INDEX postgres=# \d t Table "demo.t" Column | Type | Collation | Nullable | Default +-+---+--+- a | integer | | | b | integer | | | Indexes: "m" btree (a) "m2" btree (a) is this by design?
understand the pg locks in in an simple case
I have troubles to understand the pg lock in the following simple situation. Session 1: begin; update t set a = 1 where a = 10; Session 2: begin; update t set a = 2 where a = 10; They update the same row and session 2 is blocked by session 1 without surprise. The pretty straight implementation is: Session 1 lock the the *tuple (ExclusiveLock)* mode. when session 2 lock it in exclusive mode, it is blocked. But when I check the pg_locks: session 1. I can see *no tuple lock* there, when I check the session 2, I can see a *tuple(ExclusiveLock) is granted*, but it is waiting for a transactionid. since every tuple has txn information, so it is not hard to implement it this way. but is there any benefits over the the straight way? with the current implementation, what is the point of tuple(ExclusiveLock) for session 2?
Serialization questions
Before understanding how postgres implements the serializable isolation level (I have see many paper related to it), I have question about how it should be. I mainly read the ideas from https://www.postgresql.org/docs/11/transaction-iso.html. In fact, this isolation level works exactly the same as Repeatable Read except that it monitors for conditions which could make execution of a concurrent set of serializable transactions behave in a manner inconsistent with all possible serial (one at a time) executions of those transactions. in repeatable read, every statement will use the transaction start timestamp, so is it in serializable isolation level? When relying on Serializable transactions to prevent anomalies, it is important that any data read from a permanent user table not be considered valid until the transaction which read it has successfully committed. This is true even for read-only transactions ... What does the "not be considered valid" mean? and if it is a read-only transaction (assume T1), I think it is ok to let other transaction do anything with the read set of T1, since it is invisible to T1(use the transaction start time as statement timestamp). Thanks
Re: understand the pg locks in in an simple case
On Tue, Aug 20, 2019 at 4:59 PM Heikki Linnakangas wrote: > On 20/08/2019 10:23, Alex wrote: > > I have troubles to understand the pg lock in the following simple > > situation. > > > > > > Session 1: > > > > > > begin; update tset a= 1 where a= 10; > > > > > > Session 2: > > > > > > begin; update tset a= 2 where a= 10; > > > > > > They update the same row and session 2 is blocked by session 1 without > > surprise. > > > > > > The pretty straight implementation is: > > > > Session 1 lock the the *tuple (ExclusiveLock)* mode. > > > > when session 2 lock it in exclusive mode, it is blocked. > > > > > > But when I check the pg_locks: session 1. I can see *no tuple > > lock*there, when I check the session 2, I can see a > > *tuple(ExclusiveLock) is granted*, but it is waiting for a > transactionid. > > > > > > since every tuple has txn information, so it is not hard to implement > > it this way. but is there any benefits over the the straight way? > > with the current implementation, what is the point > > of tuple(ExclusiveLock) for session 2? > > The reason that tuple locking works with XIDs, rather than directly > acquiring a lock on the tuple, is that the memory allocated for the lock > manager is limited. One transaction can lock millions of tuples, and if > it had to hold a normal lock on every tuple, you would run out of memory > very quickly. > Thank you! so can I understand that we don't need a lock on every tuple we updated since 1). the number of lock may be huge, if we do so, it will consume a lot of memory 2). the tuple header which includes xid info are unavoidable due to MVCC requirement, and it can be used here, so we saved the individual lock and in my above example, when session 2 waiting for a xid lock, it is *granted* with a tuple lock with ExclusiveLock mode, what is the purpose of this lock? > So it may seem that we don't need heavy-weight locks on individual > tuples at all. But we still them to establish the order that backends > are waiting. The locking protocol is: > > 1. Check if a tuple's xmax is set. > 2. If it's set, obtain a lock on the tuple's TID. > 3. Wait on the transaction to finish, by trying to acquire lock on the XID. > 4. Update the tuple, release the lock on the XID, and on the TID. > > It gets more complicated if there are multixids, or you update a row you > have earlier locked in a weaker mode, but that's the gist of it. > > We could skip the lock on the tuple's TID, but then if you have multiple > backends trying to update or lock a row, it would be not be > deterministic, who gets the lock first. For example: > > Session A: BEGIN; UPDATE foo SET col='a' WHERE id = 123; > Session B: UPDATE foo SET col='b' WHERE id = 123; > Session C: UPDATE foo SET col='c' WHERE id = 123; > Session A: ROLLBACK; > > Without the lock on the TID, it would be indeterministic, whether > session B or C gets to update the tuple, when A rolls back. With the > above locking protocol, B will go first. B will acquire the lock on the > TID, and block on the XID lock, while C will block on the TID lock held > by B. If there were more backends trying to do the same, they would > queue for the TID lock, too. > > - Heikki >
Re: Serialization questions
On Tue, Aug 20, 2019 at 4:47 PM Alex wrote: > Before understanding how postgres implements the serializable isolation > level (I have see many paper related to it), I have question about how it > should be. > > > I mainly read the ideas from > https://www.postgresql.org/docs/11/transaction-iso.html. > > > In fact, this isolation level works exactly the same as Repeatable Read > except that it monitors for conditions which could make execution of a > concurrent set of serializable transactions behave in a manner inconsistent > with all possible serial (one at a time) executions of those transactions. > > > > in repeatable read, every statement will use the transaction start > timestamp, so is it in serializable isolation level? > > > When relying on Serializable transactions to prevent anomalies, it is > important that any data read from a permanent user table not be considered > valid until the transaction which read it has successfully committed. This > is true even for read-only transactions ... > > > What does the "not be considered valid" mean? and if it is a read-only > transaction (assume T1), I think it is ok to let other transaction do > anything with the read set of T1, since it is invisible to T1(use the > transaction start time as statement timestamp). > first issue "set default_transaction_isolation to 'serializable';" on the both sessions, then run: Session 1: begin; select * from t; (2 rows selected); Session 2: delete from t; (committed automatically) Session 1: commit; (commit successfully). looks the reads in session 1 has no impact on the session 2 at all which is conflicted with the document > Thanks >
when the IndexScan reset to the next ScanKey for in operator
given the following example: postgres=# create table t2 as select generate_series(1, 10) as a, generate_series(1, 10) as b; SELECT 10 postgres=# create index t2_idx on t2(a); CREATE INDEX postgres=# set enable_seqscan = 0; SET postgres=# select * from t2 where a in (1, 10); a| b + 1 | 1 10 | 10 (2 rows) I can see the plan stores the "1 and 10" information in IndexScan->indexqual, which is an SCALARARRAYOPEXPR expression. suppose the executor should scan 1 first, If all the tuples for 1 has been scanned, then **it should be reset to 10** and scan again. however I can't find out the code for that. looks index_rescan is not for this. am I miss something? thanks
Re: understand the pg locks in in an simple case
On Tue, Aug 20, 2019 at 10:52 PM Alex wrote: > > > On Tue, Aug 20, 2019 at 4:59 PM Heikki Linnakangas > wrote: > >> On 20/08/2019 10:23, Alex wrote: >> > I have troubles to understand the pg lock in the following simple >> > situation. >> > >> > >> > Session 1: >> > >> > >> > begin; update tset a= 1 where a= 10; >> > >> > >> > Session 2: >> > >> > >> > begin; update tset a= 2 where a= 10; >> > >> > >> > They update the same row and session 2 is blocked by session 1 without >> > surprise. >> > >> > >> > The pretty straight implementation is: >> > >> > Session 1 lock the the *tuple (ExclusiveLock)* mode. >> > >> > when session 2 lock it in exclusive mode, it is blocked. >> > >> > >> > But when I check the pg_locks: session 1. I can see *no tuple >> > lock*there, when I check the session 2, I can see a >> > *tuple(ExclusiveLock) is granted*, but it is waiting for a >> transactionid. >> > >> > >> > since every tuple has txn information, so it is not hard to implement >> > it this way. but is there any benefits over the the straight way? >> > with the current implementation, what is the point >> > of tuple(ExclusiveLock) for session 2? >> >> The reason that tuple locking works with XIDs, rather than directly >> acquiring a lock on the tuple, is that the memory allocated for the lock >> manager is limited. One transaction can lock millions of tuples, and if >> it had to hold a normal lock on every tuple, you would run out of memory >> very quickly. >> > > Thank you! > > so can I understand that we don't need a lock on every tuple we updated > since > 1). the number of lock may be huge, if we do so, it will consume a lot > of memory > 2). the tuple header which includes xid info are unavoidable due to MVCC > requirement, and it can be used here, so we saved the individual lock > > and in my above example, when session 2 waiting for a xid lock, it is > *granted* with a tuple lock with ExclusiveLock mode, what is the purpose > of this lock? > I will try to answer this question myself. the purpose of the tuple lock (with ExclusiveLock mode) is to protect there is no more than 1 client to add the transaction lock on the same tuple at the same time. once the txn lock is added, the tuple lock can be released. So it may seem that we don't need heavy-weight locks on individual >> tuples at all. But we still them to establish the order that backends >> are waiting. The locking protocol is: >> >> 1. Check if a tuple's xmax is set. >> 2. If it's set, obtain a lock on the tuple's TID. >> 3. Wait on the transaction to finish, by trying to acquire lock on the >> XID. >> 4. Update the tuple, release the lock on the XID, and on the TID. >> >> It gets more complicated if there are multixids, or you update a row you >> have earlier locked in a weaker mode, but that's the gist of it. >> >> We could skip the lock on the tuple's TID, but then if you have multiple >> backends trying to update or lock a row, it would be not be >> deterministic, who gets the lock first. For example: >> >> Session A: BEGIN; UPDATE foo SET col='a' WHERE id = 123; >> Session B: UPDATE foo SET col='b' WHERE id = 123; >> Session C: UPDATE foo SET col='c' WHERE id = 123; >> Session A: ROLLBACK; >> >> Without the lock on the TID, it would be indeterministic, whether >> session B or C gets to update the tuple, when A rolls back. With the >> above locking protocol, B will go first. B will acquire the lock on the >> TID, and block on the XID lock, while C will block on the TID lock held >> by B. If there were more backends trying to do the same, they would >> queue for the TID lock, too. >> >> - Heikki >> >
any suggestions to detect memory corruption
I can get the following log randomly and I am not which commit caused it. I spend one day but failed at last. 2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18 2019-05-08 21:37:46.692 CST [60110] WARNING: idx: 2 problem in alloc set index info: bad single-chunk 0x2a33a78 in block 0x2a33a18, chsize: 1408, chunkLimit: 1024, chunkHeaderSize: 24, block_used: 768 request size: 2481 2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set index info: found inconsistent memory block 0x2a33a18 it looks like the memory which is managed by "index info" memory context is written by some other wrong codes. I didn't change any AllocSetXXX related code and I think I just use it wrong in some way. Thanks
Re: any suggestions to detect memory corruption
Thanks you Tom and Robert! I tried valgrind, and looks it help me fix the issue. Someone add some code during backend init which used palloc. but at that time, the CurrentMemoryContext is PostmasterContext. at the end of backend initialization, the PostmasterContext is deleted, then the error happens. the reason why it happens randomly is before the palloc, there are some other if clause which may skip the palloc. I still can't explain why PostmasterContext may have impact "index info" MemoryContext sometime, but now I just can't reproduce it (before the fix, it may happen in 30% cases). On Thu, May 9, 2019 at 1:21 AM Robert Haas wrote: > On Wed, May 8, 2019 at 10:34 AM Tom Lane wrote: > > Alex writes: > > > I can get the following log randomly and I am not which commit caused > it. > > > > > 2019-05-08 21:37:46.692 CST [60110] WARNING: problem in alloc set > index > > > info: req size > alloc size for chunk 0x2a33a78 in block 0x2a33a18 > > > > I've had success in finding memory stomp causes fairly quickly by setting > > a hardware watchpoint in gdb on the affected location. Then you just let > > it run to see when the value changes, and check whether that's a "legit" > > or "not legit" modification point. > > > > The hard part of that, of course, is to know in advance where the > affected > > location is. You may be able to make things sufficiently repeatable by > > doing the problem query in a fresh session each time. > > valgrind might also be a possibility, although that has a lot of overhead. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Re: any suggestions to detect memory corruption
On Thu, May 9, 2019 at 9:30 PM Tom Lane wrote: > Alex writes: > > Someone add some code during backend init which used palloc. but at that > > time, the CurrentMemoryContext is PostmasterContext. at the end of > > backend initialization, the PostmasterContext is deleted, then the error > > happens. the reason why it happens randomly is before the palloc, there > > are some other if clause which may skip the palloc. > > > I still can't explain why PostmasterContext may have impact "index info" > > MemoryContext sometime, but now I just can't reproduce it (before the > > fix, it may happen in 30% cases). > > Well, once the context is deleted, that memory is available for reuse. > Everything will seem fine until it *is* reused, and then boom! > > The error would have been a lot more obvious if you'd enabled > MEMORY_CONTEXT_CHECKING, which would overwrite freed data with garbage. > Thanks! I didn't know this before and " once the context is deleted, that memory is available for reuse. Everything will seem fine until it *is* reused". I have enabled enable-cassert now. That is normally turned on in --enable-cassert builds. Anybody who's been > hacking Postgres for more than a week does backend code development in > --enable-cassert mode as a matter of course; it turns on a *lot* of > helpful cross-checks. > > > regards, tom lane >
some questions about fast-path-lock
I got some idea from the README under storage/lmgr and read some code of LockAcquireExtended , but I still have some questions now. LWLockAcquire(&MyProc->backendLock, LW_EXCLUSIVE); if (FastPathStrongRelationLocks->count[fasthashcode] != 0) acquired = false; else acquired = FastPathGrantRelationLock(locktag->locktag_field2, lockmode); 1. In the README, it says: "A key point of this algorithm is that it must be possible to verify the absence of possibly conflicting locks without fighting over a shared LWLock or spinlock. Otherwise, this effort would simply move the contention bottleneck from one place to another." but in the code, there is LWLockAcquire in the above code. Actually I can't think out how can we proceed without a lock. 2. Why does the MyProc->backendLock work? it is MyProc not a global lock. 3. for the line,acquired = FastPathGrantRelationLock(locktag->locktag_field2, lockmode);I think it should be able to replaced with "acquired = true" (but obviously I'm wrong) . I read "FastPathGrantRelationLock" but can't understand it. Any hint will be helpful. thanks!
Why to index a "Recently DEAD" tuple when creating index
HEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */ It is a tuple which has been deleted AND committed but before the delete there is a transaction started but not committed. Let call this transaction as Transaction A. if we create index on this time, Let's call this index as Index A, it still index this record. my question is why need this. The only reason I can think out (maybe also not reasonable enough) is: If we index like this and the isolate level of transaction A is serializable, it is possible that the query in transaction A can use Index A since it contains the snapshot data when the transaction A was began. this reason may be not reasonable enough is because the transaction A may be should not see the index A at all.
Re: Why to index a "Recently DEAD" tuple when creating index
On Mon, Jun 10, 2019 at 3:28 PM Kuntal Ghosh wrote: > On Mon, Jun 10, 2019 at 12:15 PM Alex wrote: > >> HEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */ >> >> It is a tuple which has been deleted AND committed but before the delete >> there is a transaction started but not committed. Let call this transaction >> as Transaction A. >> >> if we create index on this time, Let's call this index as Index A, it >> still index this record. my question is why need this. >> >> In this case, the changes of the tuple is not visible yet. Now suppose, > your transaction A is serializable and you've another serializable > transaction B which can see the index A. It generates a plan that requires > to fetch the deleted tuple through an index scan. If the tuple is not > present in the index, how are you going to create a conflict edge between > transaction A and transaction B? > > Basically, you need to identify the following clause to detect > serializable conflicts: > Transaction A precedes transaction B. (Because, transaction A has deleted > a tuple and it's not visible to transaction B) > > thanks Ghosh. Looks your answer is similar with my previous point (transaction is serializable). actually if the transaction B can't see the “deleted" which has been committed, should it see the index A which is created after the "delete" transaction? -- > Thanks & Regards, > Kuntal Ghosh > EnterpriseDB: http://www.enterprisedb.com >
Re: Why to index a "Recently DEAD" tuple when creating index
On Mon, Jun 10, 2019 at 4:10 PM Kuntal Ghosh wrote: > On Mon, Jun 10, 2019 at 1:30 PM Alex wrote: > > > > > > > > On Mon, Jun 10, 2019 at 3:28 PM Kuntal Ghosh > wrote: > >> > >> On Mon, Jun 10, 2019 at 12:15 PM Alex wrote: > >>> > >>> HEAPTUPLE_RECENTLY_DEAD, /* tuple is dead, but not deletable yet */ > >>> > >>> It is a tuple which has been deleted AND committed but before the > delete there is a transaction started but not committed. Let call this > transaction as Transaction A. > >>> > >>> if we create index on this time, Let's call this index as Index A, it > still index this record. my question is why need this. > >>> > >> In this case, the changes of the tuple is not visible yet. Now suppose, > your transaction A is serializable and you've another serializable > transaction B which can see the index A. It generates a plan that requires > to fetch the deleted tuple through an index scan. If the tuple is not > present in the index, how are you going to create a conflict edge between > transaction A and transaction B? > >> > >> Basically, you need to identify the following clause to detect > serializable conflicts: > >> Transaction A precedes transaction B. (Because, transaction A has > deleted a tuple and it's not visible to transaction B) > >> > > > > thanks Ghosh. Looks your answer is similar with my previous point > (transaction is serializable). actually if the transaction B can't see > the “deleted" which has been committed, should it see the index A which is > created after the "delete" transaction? > > > I think what I'm trying to say is different. > > For my case, the sequence is as following: > 1. Transaction A has deleted a tuple, say t1 and got committed. > 2. Index A has been created successfully. > 3. Now, transaction B starts and use the index A to fetch the tuple > t1. While doing visibility check, transaction B gets to know that t1 > has been deleted by a committed transaction A, so it can't see the > tuple. But, it creates a dependency edge that transaction A precedes > transaction B. This edge is required to detect a serializable conflict > failure. > > If you don't create the index entry, it'll not be able to create that edge. > Thanks, I got the difference now, but still not get the necessity of it. 1. Assume we don't index it, in which situation we can get a wrong result? 2. If we only support "Read Committed" isolation level, is there a safe way to not index such data? -- > Thanks & Regards, > Kuntal Ghosh > EnterpriseDB: http://www.enterprisedb.com >
Re: Why to index a "Recently DEAD" tuple when creating index
Thanks! Appreciate it for your time! On Mon, Jun 10, 2019 at 5:34 PM Kuntal Ghosh wrote: > On Mon, Jun 10, 2019 at 2:12 PM Alex wrote: > > On Mon, Jun 10, 2019 at 4:10 PM Kuntal Ghosh > wrote: > >> I think what I'm trying to say is different. > >> > >> For my case, the sequence is as following: > >> 1. Transaction A has deleted a tuple, say t1 and got committed. > >> 2. Index A has been created successfully. > >> 3. Now, transaction B starts and use the index A to fetch the tuple > >> t1. While doing visibility check, transaction B gets to know that t1 > >> has been deleted by a committed transaction A, so it can't see the > >> tuple. But, it creates a dependency edge that transaction A precedes > >> transaction B. This edge is required to detect a serializable conflict > >> failure. > >> > >> If you don't create the index entry, it'll not be able to create that > edge. > > > > > > Thanks, I got the difference now, but still not get the necessity of it. > > 1. Assume we don't index it, in which situation we can get a wrong > result? > > Consider the following sequence of three different transactions X,A and B: > > 1. Transaction X reads a tuple t2. > 2. Transaction A updates the tuple t2, deletes a tuple t1 and gets > committed. So, there transaction X precedes transaction A, i.e., X <- > A. > 3. Index A is created successfully. > 4. Transaction B starts and use the index A to fetch tuple t1. But, > it's already deleted by the committed transaction A. So, transaction A > precedes transaction B, i.e., A<-B. > 5. At this point you've a dangerous structure X<-A<-B (definition of > dangerous structure src/backend/storage/lmgr/README-SSI) in the graph > which can produce an anomaly. For example now, if X tries to update > another tuple previously read by B, you'll have a dependency B<-X. > But, you already have X<-B which leads to serializable conflict. > Postgres tries to resolve this anomaly by rolling back one of the > transaction. > > In your case, it'll be difficult to detect. > > > 2. If we only support "Read Committed" isolation level, is there a > safe way to not index such data? > > > I can't think of a case where the RECENTLY_DELETED tuple needs to be > indexed in "Read Committed" case. So, your suggestion likely to work > logically in "Read committed" isolation level. But, I'm not sure > whether you'll encounter any assertion failures in vacuum path or > concurrent index paths. > > > -- > Thanks & Regards, > Kuntal Ghosh > EnterpriseDB: http://www.enterprisedb.com >
run os command in pg_regress?
In my case, I want to sleep 3 seconds in xxx.sql for pg_regress program. but I don't want to run 'select pg_sleep(3)' . so it is possible for pg_regress? in psql, I can run \! sleep(3); exit; but looks pg_regress doesn't support it.
libpq async command processing methods are difficult to use with edge-triggered epoll
Hi, I've been using libpq to access postgres from within a system that uses an edge-triggered epoll in its event-loop. The instructions on https://www.postgresql.org/docs/current/libpq-async.html are pretty good, but I've run into a bit of an edge case that would be much easier to handle if the interfaces exposed by libpq were a little more ergonomic. If it makes a difference, I'm running with single-row mode enabled on the client. Specifically, PQconsumeInput returns no information that allows the caller to distinguish whether there might be more data available to be read on the network socket if PQconsumeInput were to be called again (it just returns 0 on error and 1 otherwise), and PQisBusy doesn't return false until a full row's worth of data has been read by PQconsumeInput. This is a bad combination if a result set contains rows larger than PGconn's default buffer size, since calling PQconsumeInput followed by PQisBusy can suggest that we need to wait on the socket's readability even if there's already more data available to be read on the socket. If more detail helps, here's a slightly more detailed summary based on my trawling through the code: * The recommended pattern for processing responses to async commands is to wait until the socket is readable, then call PQconsumeInput, then check PQisBusy. If PQisBusy returns true, the docs suggest waiting on the socket again. * When PQconsumeInput is called, it doubles the PGconn's buffer size if less than 8k of space is unused in it. * PQconsumeInput will then read either until its remaining buffer space fills up or until the socket has no more data in it ready to be read. * If the buffer fills up, PQconsumeInput will return to the caller even if there's still more data to be read * PQconsumeInput's return value only indicates whether or not there was an error, not whether any data was read. * PQisBusy will return true unless the buffer contains an entire row; it does not actually check the status of the socket. * If the PGconn's buffer wasn't large enough to fit an entire row in it when you called PQconsumeInput, PQisBusy will return true, suggesting that you ought to wait on socket readability, when really the right thing to do would be to call PQconsumeInput again (potentially multiple times) until the buffer finally grows to be large enough to contain the whole row before PQisBusy can return false. This can be worked around by making a poll() syscall on the socket without timeout 0 before handing the socket off to epoll, but libpq could make this case easier to deal with with a slightly different interface. The function that PQconsumeInput uses internally, pqReadData, has a much more helpful interface -- it returns 1 if it successfully read at least 1 byte, 0 if no data was available, or -1 if there was an error. If PQconsumeInput had a similar range of return values, this ability to distinguish between whether we read data or not would be enough information to know whether we ought to call PQconsumeInput again when PQisBusy returns true. As for what we can realistically do about it, I imagine it may be disruptive to add an additional possible return value to PQconsumeInput (e.g. return 1 if at least one byte was read or 2 if no data was available). And I'm not sure edge-triggered polling is a use case the community cares enough about to justify adding a separate method that does basically the same thing as PQconsumeInput but with more information conveyed by its return value. So maybe there isn't any change worth making here, but I wanted to at least mention the problem, since it was pretty disappointing to me that calling PQconsumeInput followed by PQisBusy and then waiting on readability occasionally caused a hang on large rows. I'd expect other developers using edge-triggered epoll to run into problems like this too. If there's a backwards-compatible way of making it less error prone, it'd be nice to do so. Alex
TOAST corruption in standby database
00 00 00 00 00 || * 0490 a6 07 7e 02 00 00 00 00 00 00 00 00 1b 00 61 5c |..~...a\| 04a0 02 00 03 00 02 09 18 00 ae 9d d4 03 01 00 00 00 || 04b0 d0 0a 00 00 23 25 10 07 88 02 13 0f 2c 04 78 01 |#%..,.x.| Based on the above observations it seems to me that occasionally some of the changes aren't replicating to or persisting by the standby database. In the past I've seen some TCP packets get mangled or dropped between our EC2 instances, leading to sudden disconnects. The standby connects to the primary using SSL (sslmode=require sslcompression=1) so I would think if there's any network-level corruption SSL would catch it, causing the connection to fail and reconnect. Outside of any SSL disconnects (which don't happen often), this database is stopped and restarted twice a week so we can clone it (using cp -a --reflink=always). Any ideas on what might be causing this? Thanks, Alex
Corruption with duplicate primary key
We have a Postgres 10 database that we recently upgraded to Postgres 12 using pg_upgrade. We recently discovered that there are rows in one of the tables that have duplicate primary keys: record_loader=# \d loader.sync Table "loader.sync" Column | Type | Collation | Nullable | Default ---+--+---+--+- source| text | | not null | natural_key | text | | not null | payload | jsonb| | | dispatched| timestamp with time zone | | not null | now() initial_load_id | text | | | deleted_load_id | text | | | created_timestamp | timestamp with time zone | | | now() updated_timestamp | timestamp with time zone | | | now() deleted_timestamp | timestamp with time zone | | | Indexes: "sync_pkey" PRIMARY KEY, btree (source, natural_key) Publications: "debezium" This table is modified via triggers that fire off when a COPY command inserts many rows into another table. Here are two example duplicate rows: # SELECT xmin, xmax, cmin, cmax, source, md5(natural_key) AS natural_key_hash, dispatched, created_timestamp, updated_timestamp, deleted_timestamp FROM loader.sync WHERE (source, natural_key) = ('ok_lease', '...') ORDER BY xmin::text::int, cmin::text::int; -[ RECORD 1 ]-+- xmin | 116649 xmax | 0 cmin | 5304404 cmax | 5304404 source| ok_lease natural_key_hash | de3e9a567b90025c3399c4c63c823fe9 dispatched| 2019-11-24 05:09:36.099686+00 created_timestamp | 2019-11-24 05:09:36.099686+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | -[ RECORD 2 ]-+- xmin | 116649 xmax | 118583 cmin | 5312208 cmax | 5312208 source| ok_lease natural_key_hash | de3e9a567b90025c3399c4c63c823fe9 dispatched| 2019-11-10 05:09:24.214964+00 created_timestamp | 2019-05-17 21:24:19.558219+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | 2019-11-24 05:09:36.099686+00 It appears that the second row was in place originally, then got updated by a trigger (and even deleted later on, although it doesn't appear that the delete transaction got committed), and then the first row was inserted within the same transaction that updated the second row. Another example: -[ RECORD 1 ]-+- xmin | 116649 xmax | 0 cmin | 5304403 cmax | 5304403 source| ok_lease natural_key_hash | 1c8031348701a32cb5fee26839d6b0b4 dispatched| 2019-11-10 05:09:24.214964+00 created_timestamp | 2019-05-31 06:00:33.765547+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | 2019-11-24 05:09:36.099686+00 -[ RECORD 2 ]-+- xmin | 116649 xmax | 0 cmin | 5304404 cmax | 5304404 source| ok_lease natural_key_hash | 1c8031348701a32cb5fee26839d6b0b4 dispatched| 2019-11-24 05:09:36.099686+00 created_timestamp | 2019-11-24 05:09:36.099686+00 updated_timestamp | 2019-11-24 05:09:36.099686+00 deleted_timestamp | Both examples have in common that the two duplicate rows were touched within the same transaction. This database runs inside Docker, with the data directory bind-mounted to a reflink-enabled XFS filesystem. The VM is running Debian's 4.19.16-1~bpo9+1 kernel inside an AWS EC2 instance. We have Debezium stream data from this database via pgoutput. Recreating the primary key confirms that the constraint doesn't (or at least shouldn't) permit these duplicate rows: record_loader=# BEGIN; BEGIN record_loader=# ALTER TABLE loader.sync DROP CONSTRAINT sync_pkey; ALTER TABLE record_loader=# ALTER TABLE loader.sync ADD CONSTRAINT sync_pkey PRIMARY KEY (source, natural_key); ERROR: could not create unique index "sync_pkey" DETAIL: Key (source, natural_key)=(ok_lease, ...) is duplicated. CONTEXT: parallel worker Any ideas on what might cause this behavior? Thanks, Alex
Re: Corruption with duplicate primary key
On Thu, December 5, 2019 at 5:34 PM Peter Geoghegan wrote: > > We have a Postgres 10 database that we recently upgraded to Postgres 12 > > using pg_upgrade. We recently discovered that there are rows in one of the > > tables that have duplicate primary keys: > > What's the timeline here? In other words, does it look like these rows > were updated and/or deleted before, around the same time as, or after > the upgrade? The Postgres 12 upgrade was performed on 2019-11-22, so the affected rows were modified after this upgrade (although some of the rows were originally inserted before then, before they were modified/duplicated). > > This database runs inside Docker, with the data directory bind-mounted to a > > reflink-enabled XFS filesystem. The VM is running Debian's 4.19.16-1~bpo9+1 > > kernel inside an AWS EC2 instance. We have Debezium stream data from this > > database via pgoutput. > > That seems suspicious, since reflink support for XFS is rather immature. Good point. Looking at kernel commits since 4.19.16 it appears that there have been a few bug fixes in later kernel versions that address a few XFS corruption issues. Regardless of whether FS bugs are responsible of this corruption I'll plan on upgrading to a newer kernel. > How did you invoke pg_upgrade? Did you use the --link (hard link) option? Yes, we first created a backup using "cp -a --reflink=always", ran initdb on the new directory, and then upgraded using "pg_upgrade -b ... -B ... -d ... -D -k". Alex
Re: Corruption with duplicate primary key
On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote: > At first I thought maybe this might be due to collations > changing and breaking the index silently. What collation are you using? We're using en_US.utf8. We did not make any collation changes to my knowledge. > 1) When you do the queries, do they use index scan or sequential scan? > Perhaps it does sequential scan, and if you force index scan (e.g. by > rewriting the query) it'll only find one of those rows. By default it used an index scan. When I re-ran the query today (and confirmed that the query used an index only scan) I did not see any duplicates. If I force a sequential scan using "SET enable_index[only]scan = false" the duplicates reappear. However, using a backup from a week ago I see duplicates in both the query that uses an index only scan as well as the query that uses the sequential scan. So somehow over the past week the index got changed to eliminate duplicates. > 2) Can you check in backups if this data corruption was present in the > PG10 cluster, before running pg_upgrade? Sure. I just checked and did not see any corruption in the PG10 pre-upgrade backup. I also re-upgraded that PG10 backup to PG12, and right after the upgrade I did not see any corruption either. I checked using both index scans and sequential scans. Alex
Re: Corruption with duplicate primary key
On Mon, December 9, 2019 at 11:05 AM Finnerty, Jim wrote: > If you have BEFORE triggers, and a BEFORE trigger signaled failure with > RETURN NULL, then this is one known (and documented) issue that I think could > cause the behavior you're reporting: > > https://www.postgresql-archive.org/BEFORE-triggers-that-return-NULL-can-circumvent-referential-integrity-tt6056390.html#none > > It's hard to say if this is the cause or not, but if you have any BEFORE > triggers that RETURN NULL, you might want to review the documentation very > carefully. We do have a BEFORE INSERT trigger, but it should never return NULL. This trigger INSERTs into a different table using an ON CONFLICT DO NOTHING clause and then does a RETURN NEW. Alex
Re: Corruption with duplicate primary key
On Thu., December 12, 2019 at 5:25 PM, Tomas Vondra wrote: >On Wed, Dec 11, 2019 at 11:46:40PM +0000, Alex Adriaanse wrote: >>On Thu., December 5, 2019 at 5:45 PM, Tomas Vondra wrote: >>> At first I thought maybe this might be due to collations changing and >>> breaking the index silently. What collation are you using? >> >>We're using en_US.utf8. We did not make any collation changes to my >>knowledge. > >Well, the idea was more that glibc got updated and the collations >changed because of that (without PostgreSQL having a chance to even >notice that). Closing the loop on this, I've investigated this some more and it turns out this is exactly what happened. As you suspected, the issue had nothing to do with pg_upgrade or PG12, but rather the glibc upgrade that was seen in Debian Buster. The postgres:10 and postgres:11 images are based on Debian Stretch, whereas postgres:12 is based on Buster. When I kept the database on an older version of Postgres (10 or 11) but switched from the older Docker image to the postgres:12 or debian:buster(-slim) image, manually installing older Postgres packages inside those images, I saw index corruption there too. Thanks for the input! Alex
Help to review the with X cursor option.
The cursor means something like declare c cursor for select * from t; The holdable cursor means declare c cursor WITH HOLD for select * from t; Holdable cursor is good at transaction, user can still access it after the transaction is commit. But it is bad at it have to save all the record to tuple store before we fetch 1 row. what I want is: 1. The cursor is still be able to fetch after the transaction is committed. 2. the cursor will not fetch the data when fetch statement is issue (just like non-holdable cursor). I called this as with X cursor.. I check the current implementation and think it would be possible with the following methods: 1. allocate the memory in a {LongerMemoryContext}, like EState to prevent they are 2. allocate a more bigger resource owner to prevent the LockReleaseAll during CommitTransaction. 3. add the "with X" option to cursor so that Precommit_portals will not drop it during CommitTransaction. Before I implement it, could you give some suggestions? Thanks!
Re: Help to review the with X cursor option.
On Wed, Apr 24, 2019 at 11:30 PM Tom Lane wrote: > alex lock writes: > > The cursor means something like declare c cursor for select * from t; > > The holdable cursor means declare c cursor WITH HOLD for select * from t; > > > Holdable cursor is good at transaction, user can still access it after > the > > transaction is commit. But it is bad at it have to save all the record > to > > tuple store before we fetch 1 row. > > > what I want is: > > 1. The cursor is still be able to fetch after the transaction is > > committed. > > 2. the cursor will not fetch the data when fetch statement is issue > (just > > like non-holdable cursor). > > > I called this as with X cursor.. > > > I check the current implementation and think it would be possible with > the > > following methods: > > 1. allocate the memory in a {LongerMemoryContext}, like EState to > > prevent they are > > 2. allocate a more bigger resource owner to prevent the LockReleaseAll > > during CommitTransaction. > > 3. add the "with X" option to cursor so that Precommit_portals will not > > drop it during CommitTransaction. > > > Before I implement it, could you give some suggestions? > > You don't actually understand the problem. > Thanks tones. I know that and that's just something I want to change. > The reason a holdable cursor forcibly reads all the data before commit is > that the data might not be there to read any later than that. I think this can be done with snapshot read, like we want the data at time 1, even the data is not there at time 2, we provide the snapshot, we can read the data. Oracle has a similar function called flashback query https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm#1008580 . > Once we end > the transaction and release its snapshot (specifically, advance the > backend's advertised global xmin), it's possible and indeed desirable for > obsoleted row versions to be vacuumed. that's something I want to change, as I said at the beginning. include avoid some memory release (like the EState and so on), snapshot release. > The only way to avoid that would > be to not advance xmin, which is pretty much just as bad as not committing > the transaction. there is something different between "not advance xmin" or "not committing the transaction" for me. "not commit the transaction" will take up the connection, but "not advance xmin" one not. without this reason, non-holdable cursor is good for me. > Not releasing the transaction's locks is also bad. Assume that if the table was dropped among the fetches, we can just raise error, we can releasing the lock? I am still not sure about this part, but keep the lock is still acceptable for me since it will not take up the connection already(my purpose). but releasing the lock can be better. > So it doesn't seem like there's anything to be gained here that you don't > have today by just not committing yet. > it is connection:) I want to run dml or other stuff on the current connection. > > If you're concerned about not losing work due to possible errors later in > the transaction, you could prevent those from causing problems through > subtransactions (savepoints). > > Thanks for your tip, I have thought the possibility but I can think more. the business model is a bit of complex and I don't want to talk more here. > regards, tom lane >
Re: Help to review the with X cursor option.
On Thu, Apr 25, 2019 at 9:53 AM alex lock wrote: > > > that's something I want to change, as I said at the beginning. include > avoid some memory release (like the EState and so on), snapshot release. > > I check my original statement, I found "snapshot release" was missed, that obviously is a key point..
RE: Direct converting numeric types to bool
-Original Message- From: n.zhuch...@postgrespro.ru [mailto:n.zhuch...@postgrespro.ru] Sent: Wednesday, February 28, 2018 6:04 PM To: pgsql-hackers Subject: Direct converting numeric types to bool Attached patch allow direct convertion of numeric types to bool like integer::bool. Supported types: - smallint; - bigint; - real; - double precision; - decimal(numeric). This functionality is helped with migration from Oracle. -- Nikita Zhuchkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company Hello! What prevent us from: postgres=# select 1::bigint::int::boolean; bool -- t (1 row) It is just one additional casting and required no additional patching -- Alex Ignatov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: All Taxi Services need Index Clustered Heap Append
https://aws.amazon.com/dms/ DMS might be helpful if you need to move off of RDS Alex Kane On Mon, Mar 5, 2018 at 11:48 AM, Ants Aasma wrote: > On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski > wrote: > >> This approach mixes well with hash > >> partitioning. It would be neat indeed if PostgreSQL do something > >> equivalent on its own, and pluggable storage work being done could > >> enable index organized tables that would help. But you probably need > >> something right now. > > > > > > Fixing glaring issues (no vacuum and thus no Index-Only Scan on > append-only > > tables, vacuum processing all of the eternity of btree) by 11 will get > most > > of spike-nails out of the microservice code, and we can probably live > with > > them until 11 gets to RDS. > > > > I also don't see why a pluggable storage is a must for the clustered > write. > > Postgres does have a mechanism for selecting the next page to write tuple > > to, right now it's just looking at FSM - but what if it just peeked at > > existing index that already has enough the data to route tuple to correct > > page on write? > > The mechanism you outlined would likely work for your use case, but it > has many issues that prevent it from being universally useful. From > the top of my head: > > * One extra index descent per insertion (I/O for this is necessary > anyway, but CPU work is duplicated). > * We don't currently track the amount of bloat. A mechanism that does > this needs to be added. > * If table hits the bloat limit there will be a sudden change in > behavior. This is pretty nasty from an operations point of view. > * With your (id,ts) clustering and data coming in mostly ordered by > timestamp, after initial warmup, each page will contain rows from a > single id, but different ids are arbitrarily interleaved. This is > better than current state, but people might want to have an > interleaving step bigger than 8kB to better utilize storage hardware. > * It seems that with a common (ts) clustering and age of timestamp > coming from an exponential distribution, this will quickly bloat to > threshold and then insert data in a rather arbitrary order. This is > much worse than the default behavior. > > At least in my opinion these problems make it a special case > optimization that is hard to justify in core. A decent alternative > would be a plugin mechanism for locating free space for a tuple where > you can write your extension to find a suitable location for the row. > > >> I guess I don't have to tell you that it looks like your needs have > >> outgrown what RDS works well with and you are in for a painful move > >> sooner or later. > > > > > > Painful move where to? If we just run a Postgres instance without RDS > we'll > > get the pain of setting up Postgres and replication and backups and > > autofailover, with no visible gain except if we get some private / > > unaccepted patches applied to it. If we can get these things right > upstream > > why would we want to switch? > > EC2 for example. Mainly because I3 instances and ephemeral provide an > order of magnitude or two of performance improvement while costing > less. Being able to run custom extensions and patches if necessary is > a nice bonus. Yes, setting up replication, autofailover and backups is > extra work that you have to weigh against the benefits. But don't > overestimate the effort - there are some pretty nice tools available > that make a proper cluster relatively simple to set up. > > > Per my colleagues, MySQL offers clustered index, also MySQL is available > on > > RDS without the need of "painful move", which is doable by writing to two > > locations for a day and then pointing readers to new DB. But if we can > > instead do no move and be sure the issues are gone upstream before we hit > > the limit of spike-nails we're running on currently, wouldn't that be > > better? :) > > The move off of RDS is painful because getting data out of RDS > involves either downtime or building an ad-hoc logical replication > solution. You need to solve that regardless of where you move to. > > Providing an out-of-the-box solution in core PostgreSQL would of > course be best, but realistically you will be waiting at least 2 years > to get it on RDS. In the meanwhile either the buffer partition > approach I described, or a buffering microservice in front of > PostgreSQL like Aleksander recommended should fix data locality for > you. If you weren't running on RDS I would even propose using Redis as > the buffer with one key per driver and redis_fdw to make the data > accessible from within PostgreSQL. > > Regards, > Ants Aasma > -- > +43-670-6056265 > Cybertec Schönig & Schönig GmbH > Gröhrmühlgasse 26, A-2700 Wiener Neustadt > Web: https://www.cybertec-postgresql.com > >
Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs
First of all thank you for looking into this. At the moment we workaround the problem by altering `acc` ROLE into a SUPERUSER in PostgreSQL 16 instances. It sidestep the problem and having the lowest cost to implement for us. While at first we think this feels like opening a security hole, it does not introduce side effects for **our use case** by the way our application make use of this `acc` ROLE. Of course we cannot recommend the workaround we took to others having similar situation. On Fri, Mar 22, 2024 at 7:59 AM Tom Lane wrote: > > Nathan Bossart writes: > > On Thu, Mar 21, 2024 at 03:40:12PM -0500, Nathan Bossart wrote: > >> On Thu, Mar 21, 2024 at 04:31:45PM -0400, Tom Lane wrote: > >>> I don't think we have any really cheap way to de-duplicate the role > >>> OIDs, especially seeing that it has to be done on-the-fly within the > >>> collection loop, and the order of roles_list is at least potentially > >>> interesting. Not sure how to make further progress without a lot of > >>> work. > > >> Assuming these are larger lists, this might benefit from optimizations > >> involving SIMD intrinsics. > > > Never mind. With the reproduction script, I'm only seeing a ~2% > > improvement with my patches. > > Yeah, you cannot beat an O(N^2) problem by throwing SIMD at it. > > However ... I just remembered that we have a Bloom filter implementation > in core now (src/backend/lib/bloomfilter.c). How about using that > to quickly reject (hopefully) most role OIDs, and only do the > list_member_oid check if the filter passes? > > regards, tom lane
Re: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.
cpp#L235>, so there is no abi issue. Big end or little end is encoded in target triple like ppc64 (big endian), ppc64le (little endian), and a recent riscv64be patch <https://reviews.llvm.org/D128612>. I guess that is why there are no endian issues. -- *From:* Thomas Munro *Sent:* Thursday, December 15, 2022 9:59:39 AM *To:* David Rowley *Cc:* Alex Fan ; pgsql-hack...@postgresql.org < pgsql-hack...@postgresql.org>; and...@anarazel.de ; geidav...@gmail.com ; l...@swarm64.com *Subject:* Re: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld. On Thu, Nov 24, 2022 at 12:08 AM David Rowley wrote: > On Wed, 23 Nov 2022 at 23:13, Alex Fan wrote: > > I am new to the postgres community and apologise for resending this as the previous one didn't include patch properly and didn't cc reviewers (maybe the reason it has been buried in mailing list for months) > > Welcome to the community! +1 I don't know enough about LLVM or RISCV to have any strong opinions here, but I have a couple of questions... It looks like we have two different things in this patch: 1. Optionally use JITLink instead of RuntimeDyld for relocation. >From what I can tell from some quick googling, that is necessary for RISCV because they haven't got around to doing this yet: https://reviews.llvm.org/D127842 Independently of that, it seems that https://llvm.org/docs/JITLink.html is the future and RuntimeDyld will eventually be obsolete, so one question I have is: why should we do this only for riscv? You mentioned that this change might be necessary to support COFF and thus Windows. I'm not a Windows user and I think it would be beyond my pain threshold to try to get this working there by using CI alone, but I'm just curious... wouldn't https://github.com/llvm/llvm-project/blob/main/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldCOFF.cpp work for that already? (I haven't heard about anyone successfully using PostgreSQL/LLVM on Windows; it would certainly be cool to hear some more about what would be needed for that.) 2. Manually adjust the CPU features and ABI/subtarget. +#if defined(__riscv) +/* getHostCPUName returns "generic-rv[32|64]", which lacks all features */ +Features.AddFeature("m", true); +Features.AddFeature("a", true); +Features.AddFeature("c", true); +# if defined(__riscv_float_abi_single) +Features.AddFeature("f", true); +# endif +# if defined(__riscv_float_abi_double) +Features.AddFeature("d", true); +# endif +#endif I'm trying to understand this, and the ABI name selection logic. Maybe there are two categories of features here? The ABI bits, "f" and "d" are not just "which instructions can I used", but they also affect the ABI (I guess something like: where floats go in the calling convention), and they have to match the ABI of the main executable to allow linking to succeed, right? Probably a stupid question: wouldn't the subtarget/ABI be the same as the one that the LLVM library itself was compiled for (which must also match the postgres executable), and doesn't it know that somewhere? I guess I'm confused about why we don't need to deal with this kind of manual subtarget selection on any other architecture: for PPC it automatically knows whether to be big endian/little endian, 32 or 64 bit, etc. Then for "m", "a", "c", I guess these are code generation options -- I think "c" is compressed instructions for example? Can we get a comment to say what they are? Why do you think that all RISCV chips have these features? Perhaps these are features that are part of some kind of server chip profile (ie features not present in a tiny microcontroller chip found in a toaster, but expected in any system that would actually run PostgreSQL) -- in which case can we get a reference to explain that? I remembered the specific reason why we have that LLVMGethostCPUFeatures() call: it's because the list of default features that would apply otherwise based on CPU "name" alone turned out to assume that all x86 chips had AVX, but some low end parts don't, so we have to check for AVX etc presence that way. But your patch seems to imply that LLVM is not able to get features reliably for RISCV -- why not, immaturity or technical reason why it can't? +assert(ES && "ES must not be null"); We use our own Assert() macro (capital A). From 4e7a65d868de016cde7a8719c64eedcf66cc3405 Mon Sep 17 00:00:00 2001 From: Alex Fan Date: Mon, 29 Aug 2022 15:24:16 +0800 Subject: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld. This brings the bonus of support jitting on riscv64 (included in this patch) Assume gc(imafdac) extension since generally it is
Re: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.
There is discussion in https://github.com/riscv-non-isa/riscv-toolchain-conventions/issues/13 to change the abi default, but not much attention for some time. The consensus seems to be set the abi and extension explicitly. > I recommend proposing a patch for adding such an API to LLVM. I would like to try some time later. Jitlink allows lots of flexibility to inspect each linking process, I feel myself don't know enough use cases to propose a good enough c-abi for it. The thing I am thinking is these patch to llvm will take some time to land especially for abi and extension default. But jitlink and orc for riscv is very mature since llvm-15, and even llvm-14 with two minor patches. It would be good to have these bits, though ugly, so that postgresql jit can work with llvm-15 as most distros are still moving to it. cheers, Alex Fan On Sun, Dec 25, 2022 at 11:02 PM Andres Freund wrote: > Hi, > > On 2022-11-23 21:13:04 +1100, Alex Fan wrote: > > > @@ -241,6 +246,40 @@ llvm_mutable_module(LLVMJitContext *context) > > > context->module = LLVMModuleCreateWithName("pg"); > > > LLVMSetTarget(context->module, llvm_triple); > > > LLVMSetDataLayout(context->module, llvm_layout); > > > +#ifdef __riscv > > > +#if __riscv_xlen == 64 > > > +#ifdef __riscv_float_abi_double > > > + abiname = "lp64d"; > > > +#elif defined(__riscv_float_abi_single) > > > + abiname = "lp64f"; > > > +#else > > > + abiname = "lp64"; > > > +#endif > > > +#elif __riscv_xlen == 32 > > > +#ifdef __riscv_float_abi_double > > > + abiname = "ilp32d"; > > > +#elif defined(__riscv_float_abi_single) > > > + abiname = "ilp32f"; > > > +#else > > > + abiname = "ilp32"; > > > +#endif > > > +#else > > > + elog(ERROR, "unsupported riscv xlen %d", __riscv_xlen); > > > +#endif > > > + /* > > > +* set this manually to avoid llvm defaulting to soft > > > float and > > > +* resulting in linker error: `can't link double-float > > > modules > > > +* with soft-float modules` > > > +* we could set this for TargetMachine via MCOptions, > but > > > there > > > +* is no C API for it > > > +* ref: > > I think this is something that should go into the llvm code, rather than > postgres. > > > > > @@ -820,16 +861,21 @@ llvm_session_initialize(void) > > > elog(DEBUG2, "LLVMJIT detected CPU \"%s\", with features > \"%s\"", > > > cpu, features); > > > > > > +#ifdef __riscv > > > + reloc=LLVMRelocPIC; > > > + codemodel=LLVMCodeModelMedium; > > > +#endif > > Same. > > > > > > > +#ifdef USE_JITLINK > > > +/* > > > + * There is no public C API to create ObjectLinkingLayer for JITLINK, > > > create our own > > > + */ > > > +DEFINE_SIMPLE_CONVERSION_FUNCTIONS(llvm::orc::ExecutionSession, > > > LLVMOrcExecutionSessionRef) > > > +DEFINE_SIMPLE_CONVERSION_FUNCTIONS(llvm::orc::ObjectLayer, > > > LLVMOrcObjectLayerRef) > > I recommend proposing a patch for adding such an API to LLVM. > > > > Greetings, > > Andres Freund >
Re: bad wal on replica / incorrect resource manager data checksum in record / zfs
On Wed, Feb 19, 2020 at 4:35 PM Alex Malek wrote: > > Hello Postgres Hackers - > > We are having a reoccurring issue on 2 of our replicas where replication > stops due to this message: > "incorrect resource manager data checksum in record at ..." > This has been occurring on average once every 1 to 2 weeks during large > data imports (100s of GBs being written) > on one of two replicas. > Fixing the issue has been relatively straight forward: shutdown replica, > remove the bad wal file, restart replica and > the good wal file is retrieved from the master. > We are doing streaming replication using replication slots. > However twice now, the master had already removed the WAL file so the file > had to retrieved from the wal archive. > > The WAL log directories on the master and the replicas are on ZFS file > systems. > All servers are running RHEL 7.7 (Maipo) > PostgreSQL 10.11 > ZFS v0.7.13-1 > > The issue seems similar to > https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com > and to https://github.com/timescale/timescaledb/issues/1443 > > One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS > sees our array as a single device. > > > An update in case someone else encounters the same issue. About 5 weeks ago, on the master database server, we turned off ZFS compression for the volume where the WAL log resides. The error has not occurred on any replica since. Best, Alex
Re: bad wal on replica / incorrect resource manager data checksum in record / zfs
On Thu, Apr 2, 2020 at 2:10 PM Andres Freund wrote: > Hi, > > On 2020-02-19 16:35:53 -0500, Alex Malek wrote: > > We are having a reoccurring issue on 2 of our replicas where replication > > stops due to this message: > > "incorrect resource manager data checksum in record at ..." > > Could you show the *exact* log output please? Because this could > temporarily occur without signalling anything bad, if e.g. the > replication connection goes down. > Feb 23 00:02:02 wrds-pgdata10-2-w postgres[68329]: [12491-1] 5e4aac44.10ae9 (@) LOG: incorrect resource manager data checksum in record at 39002/57AC0338 When it occurred replication stopped. The only way to resume replication was to stop server and remove bad WAL file. > > > > Right before the issue started we did some upgrades and altered some > > postgres configs and ZFS settings. > > We have been slowly rolling back changes but so far the the issue > continues. > > > > Some interesting data points while debugging: > > We had lowered the ZFS recordsize from 128K to 32K and for that week the > > issue started happening every other day. > > Using xxd and diff we compared "good" and "bad" wal files and the > > differences were not random bad bytes. > > > > The bad file either had a block of zeros that were not in the good file > at > > that position or other data. Occasionally the bad data has contained > > legible strings not in the good file at that position. At least one of > > those exact strings has existed elsewhere in the files. > > However I am not sure if that is the case for all of them. > > > > This made me think that maybe there was an issue w/ wal file recycling > and > > ZFS under heavy load, so we tried lowering > > min_wal_size in order to "discourage" wal file recycling but my > > understanding is a low value discourages recycling but it will still > > happen (unless setting wal_recycle in psql 12). > > This sounds a lot more like a broken filesystem than anythingon the PG > level. > Probably. In my recent updated comment turning off ZFS compression on master seems to have fixed the issue. However I will note that the WAL file stored on the master was always fine upon inspection. > > > > When using replication slots, what circumstances would cause the master > to > > not save the WAL file? > > What do you mean by "save the WAL file"? > Typically, when using replication slots, when replication stops the master will save the next needed WAL file. However once or twice when this error occurred the master recycled/removed the WAL file needed. I suspect perhaps b/c the replica had started to read the WAL file it sent some signal to the master that the WAL file was already consumed. I am guessing, not knowing exactly what is happening and w/ the caveat that this situation was rare and not the norm. It is also possible caused by a different error. Thanks. Alex
Re: [PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.
Hi, I am new to the postgres community and apologise for resending this as the previous one didn't include patch properly and didn't cc reviewers (maybe the reason it has been buried in mailing list for months) Adding to previous email, this patch exposes its own C API for creating ObjectLinkingLayer in a similar fashion as LLVMOrcCreateRTDyldObjectLinkingLayerWithSectionMemoryManager since orc doesn't expose it yet. Thanks and really appreciate if someone can offer a review to this and help get it merged. Cheers, Alex On Mon, Aug 29, 2022 at 5:46 PM Alex Fan wrote: > This brings the bonus of support jitting on riscv64 (included in this > patch) > and other platforms Rtdyld doesn't support, e.g. windows COFF. > > Currently, llvm doesn't expose jitlink (ObjectLinkingLayer) via C API, so > a wrapper is added. This also adds minor llvm 15 compat fix that is needed > --- > config/llvm.m4| 1 + > src/backend/jit/llvm/llvmjit.c| 67 +-- > src/backend/jit/llvm/llvmjit_wrap.cpp | 35 ++ > src/include/jit/llvmjit.h | 9 > 4 files changed, 108 insertions(+), 4 deletions(-) > > diff --git a/config/llvm.m4 b/config/llvm.m4 > index 3a75cd8b4d..a31b8b304a 100644 > --- a/config/llvm.m4 > +++ b/config/llvm.m4 > @@ -75,6 +75,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT], >engine) pgac_components="$pgac_components $pgac_component";; >debuginfodwarf) pgac_components="$pgac_components $pgac_component";; >orcjit) pgac_components="$pgac_components $pgac_component";; > + jitlink) pgac_components="$pgac_components $pgac_component";; >passes) pgac_components="$pgac_components $pgac_component";; >native) pgac_components="$pgac_components $pgac_component";; >perfjitevents) pgac_components="$pgac_components $pgac_component";; > diff --git a/src/backend/jit/llvm/llvmjit.c > b/src/backend/jit/llvm/llvmjit.c > index 6c72d43beb..d8b840da8c 100644 > --- a/src/backend/jit/llvm/llvmjit.c > +++ b/src/backend/jit/llvm/llvmjit.c > @@ -229,6 +229,11 @@ llvm_release_context(JitContext *context) > LLVMModuleRef > llvm_mutable_module(LLVMJitContext *context) > { > +#ifdef __riscv > + const char* abiname; > + const char* target_abi = "target-abi"; > + LLVMMetadataRef abi_metadata; > +#endif > llvm_assert_in_fatal_section(); > > /* > @@ -241,6 +246,40 @@ llvm_mutable_module(LLVMJitContext *context) > context->module = LLVMModuleCreateWithName("pg"); > LLVMSetTarget(context->module, llvm_triple); > LLVMSetDataLayout(context->module, llvm_layout); > +#ifdef __riscv > +#if __riscv_xlen == 64 > +#ifdef __riscv_float_abi_double > + abiname = "lp64d"; > +#elif defined(__riscv_float_abi_single) > + abiname = "lp64f"; > +#else > + abiname = "lp64"; > +#endif > +#elif __riscv_xlen == 32 > +#ifdef __riscv_float_abi_double > + abiname = "ilp32d"; > +#elif defined(__riscv_float_abi_single) > + abiname = "ilp32f"; > +#else > + abiname = "ilp32"; > +#endif > +#else > + elog(ERROR, "unsupported riscv xlen %d", __riscv_xlen); > +#endif > + /* > +* set this manually to avoid llvm defaulting to soft > float and > +* resulting in linker error: `can't link double-float > modules > +* with soft-float modules` > +* we could set this for TargetMachine via MCOptions, but > there > +* is no C API for it > +* ref: > https://github.com/llvm/llvm-project/blob/afa520ab34803c82587ea6759bfd352579f741b4/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp#L90 > +*/ > + abi_metadata = LLVMMDStringInContext2( > + LLVMGetModuleContext(context->module), > + abiname, strlen(abiname)); > + LLVMAddModuleFlag(context->module, > LLVMModuleFlagBehaviorOverride, > + target_abi, strlen(target_abi), abi_metadata); > +#endif > } > > return context->module; > @@ -786,6 +825,8 @@ llvm_session_initialize(void) > char *error = NULL; > char *cpu = NULL; > char *features = NULL; > + LLVMRelocMode reloc=LLVMRelocDefault; > + LLVMCodeModel codemodel=LLVMCodeModelJITDefault; > LLVMTargetMachineRef opt0_tm; >
[PATCH] Enable using llvm jitlink as an alternative llvm jit linker of old Rtdyld.
This brings the bonus of support jitting on riscv64 (included in this patch) and other platforms Rtdyld doesn't support, e.g. windows COFF. Currently, llvm doesn't expose jitlink (ObjectLinkingLayer) via C API, so a wrapper is added. This also adds minor llvm 15 compat fix that is needed --- config/llvm.m4| 1 + src/backend/jit/llvm/llvmjit.c| 67 +-- src/backend/jit/llvm/llvmjit_wrap.cpp | 35 ++ src/include/jit/llvmjit.h | 9 4 files changed, 108 insertions(+), 4 deletions(-) diff --git a/config/llvm.m4 b/config/llvm.m4 index 3a75cd8b4d..a31b8b304a 100644 --- a/config/llvm.m4 +++ b/config/llvm.m4 @@ -75,6 +75,7 @@ AC_DEFUN([PGAC_LLVM_SUPPORT], engine) pgac_components="$pgac_components $pgac_component";; debuginfodwarf) pgac_components="$pgac_components $pgac_component";; orcjit) pgac_components="$pgac_components $pgac_component";; + jitlink) pgac_components="$pgac_components $pgac_component";; passes) pgac_components="$pgac_components $pgac_component";; native) pgac_components="$pgac_components $pgac_component";; perfjitevents) pgac_components="$pgac_components $pgac_component";; diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c index 6c72d43beb..d8b840da8c 100644 --- a/src/backend/jit/llvm/llvmjit.c +++ b/src/backend/jit/llvm/llvmjit.c @@ -229,6 +229,11 @@ llvm_release_context(JitContext *context) LLVMModuleRef llvm_mutable_module(LLVMJitContext *context) { +#ifdef __riscv + const char* abiname; + const char* target_abi = "target-abi"; + LLVMMetadataRef abi_metadata; +#endif llvm_assert_in_fatal_section(); /* @@ -241,6 +246,40 @@ llvm_mutable_module(LLVMJitContext *context) context->module = LLVMModuleCreateWithName("pg"); LLVMSetTarget(context->module, llvm_triple); LLVMSetDataLayout(context->module, llvm_layout); +#ifdef __riscv +#if __riscv_xlen == 64 +#ifdef __riscv_float_abi_double + abiname = "lp64d"; +#elif defined(__riscv_float_abi_single) + abiname = "lp64f"; +#else + abiname = "lp64"; +#endif +#elif __riscv_xlen == 32 +#ifdef __riscv_float_abi_double + abiname = "ilp32d"; +#elif defined(__riscv_float_abi_single) + abiname = "ilp32f"; +#else + abiname = "ilp32"; +#endif +#else + elog(ERROR, "unsupported riscv xlen %d", __riscv_xlen); +#endif + /* +* set this manually to avoid llvm defaulting to soft float and +* resulting in linker error: `can't link double-float modules +* with soft-float modules` +* we could set this for TargetMachine via MCOptions, but there +* is no C API for it +* ref: https://github.com/llvm/llvm-project/blob/afa520ab34803c82587ea6759bfd352579f741b4/llvm/lib/Target/RISCV/RISCVTargetMachine.cpp#L90 +*/ + abi_metadata = LLVMMDStringInContext2( + LLVMGetModuleContext(context->module), + abiname, strlen(abiname)); + LLVMAddModuleFlag(context->module, LLVMModuleFlagBehaviorOverride, + target_abi, strlen(target_abi), abi_metadata); +#endif } return context->module; @@ -786,6 +825,8 @@ llvm_session_initialize(void) char *error = NULL; char *cpu = NULL; char *features = NULL; + LLVMRelocMode reloc=LLVMRelocDefault; + LLVMCodeModel codemodel=LLVMCodeModelJITDefault; LLVMTargetMachineRef opt0_tm; LLVMTargetMachineRef opt3_tm; @@ -820,16 +861,21 @@ llvm_session_initialize(void) elog(DEBUG2, "LLVMJIT detected CPU \"%s\", with features \"%s\"", cpu, features); +#ifdef __riscv + reloc=LLVMRelocPIC; + codemodel=LLVMCodeModelMedium; +#endif + opt0_tm = LLVMCreateTargetMachine(llvm_targetref, llvm_triple, cpu, features, LLVMCodeGenLevelNone, - LLVMRelocDefault, - LLVMCodeModelJITDefault); + reloc, + codemodel); opt3_tm = LLVMCreateTargetMachine(llvm_targetref, llvm_triple, cpu, features, LLVMCodeGenLevelAggressive, - LLVMRelocDefault, - LLVMCodeModelJITDefault); + reloc, +
Re: pg_basebackup, walreceiver and wal_sender_timeout
On Mon, Jan 28, 2019, at 10:25, Michael Paquier wrote: > On Mon, Jan 28, 2019 at 09:05:26AM +0100, Magnus Hagander wrote: > > And for plain format, we'd do the same -- sync after each file segment, and > > then a final one of the directory when done, right? > > Well, the code is doing a double amount of work in its current shape > as we call fsync_pgdata() for the plain format, which cascades to > pg_wal and all its files, so it seems to me that there is little point > in issuing a sync when each segment is finished streaming if that's > what you mean. Agreed. While reading the doc page for the pg_basebackup, I've been confused by the fact that it says WAL files will be written to .tarballs (either base.tar or pg_wal.tar) when pg_basebackup is instructed to stream WALs alongside the backup itself. I think it makes sense to elaborate that it only happens when tar format is specified (doc patch is attached). Cheers, Oleksii diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index 57dc83b620..c4f3950e5b 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -306,8 +306,8 @@ PostgreSQL documentation backup will fail and be unusable. -The write-ahead log files will be written to -the base.tar file. +When tar format mode is used, the write-ahead log files will be +written to the base.tar file. @@ -326,9 +326,10 @@ PostgreSQL documentation requires no extra write-ahead logs to be saved on the master. -The write-ahead log files are written to a separate file -named pg_wal.tar (if the server is a version -earlier than 10, the file will be named pg_xlog.tar). +When tar format mode is used, the write-ahead log files will be +written to a separate file named pg_wal.tar +(if the server is a version earlier than 10, the file will be named +pg_xlog.tar). This value is the default.
bad wal on replica / incorrect resource manager data checksum in record / zfs
Hello Postgres Hackers - We are having a reoccurring issue on 2 of our replicas where replication stops due to this message: "incorrect resource manager data checksum in record at ..." This has been occurring on average once every 1 to 2 weeks during large data imports (100s of GBs being written) on one of two replicas. Fixing the issue has been relatively straight forward: shutdown replica, remove the bad wal file, restart replica and the good wal file is retrieved from the master. We are doing streaming replication using replication slots. However twice now, the master had already removed the WAL file so the file had to retrieved from the wal archive. The WAL log directories on the master and the replicas are on ZFS file systems. All servers are running RHEL 7.7 (Maipo) PostgreSQL 10.11 ZFS v0.7.13-1 The issue seems similar to https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com and to https://github.com/timescale/timescaledb/issues/1443 One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS sees our array as a single device. Right before the issue started we did some upgrades and altered some postgres configs and ZFS settings. We have been slowly rolling back changes but so far the the issue continues. Some interesting data points while debugging: We had lowered the ZFS recordsize from 128K to 32K and for that week the issue started happening every other day. Using xxd and diff we compared "good" and "bad" wal files and the differences were not random bad bytes. The bad file either had a block of zeros that were not in the good file at that position or other data. Occasionally the bad data has contained legible strings not in the good file at that position. At least one of those exact strings has existed elsewhere in the files. However I am not sure if that is the case for all of them. This made me think that maybe there was an issue w/ wal file recycling and ZFS under heavy load, so we tried lowering min_wal_size in order to "discourage" wal file recycling but my understanding is a low value discourages recycling but it will still happen (unless setting wal_recycle in psql 12). There is a third replica where this bug has not (yet?) surfaced. This leads me to guess the bad data does not originate on the master. This replica is older than the other replicas, slower CPUs, less RAM, and the WAL disk array is spinning disks. The OS, version of Postgres, and version of ZFS are the same as the other replicas. This replica is not using a replication slot. This replica does not serve users so load/contention is much lower than the others. The other replicas often have 100% utilization of the disk array that houses the (non-wal) data. Any insight into the source of this bug or how to address it? Since the master has a good copy of the WAL file, can the replica re-request the file from the master? Or from the archive? When using replication slots, what circumstances would cause the master to not save the WAL file? (I can't remember if it always had the next wal file or the one after that) Thanks in advance, Alex Malek
Fwd: bad wal on replica / incorrect resource manager data checksum in record / zfs
On Thu, Feb 20, 2020, 6:16 AM Amit Kapila wrote: > On Thu, Feb 20, 2020 at 3:06 AM Alex Malek wrote: > > > > > > Hello Postgres Hackers - > > > > We are having a reoccurring issue on 2 of our replicas where replication > stops due to this message: > > "incorrect resource manager data checksum in record at ..." > > This has been occurring on average once every 1 to 2 weeks during large > data imports (100s of GBs being written) > > on one of two replicas. > > Fixing the issue has been relatively straight forward: shutdown replica, > remove the bad wal file, restart replica and > > the good wal file is retrieved from the master. > > We are doing streaming replication using replication slots. > > However twice now, the master had already removed the WAL file so the > file had to retrieved from the wal archive. > > > > The WAL log directories on the master and the replicas are on ZFS file > systems. > > All servers are running RHEL 7.7 (Maipo) > > PostgreSQL 10.11 > > ZFS v0.7.13-1 > > > > The issue seems similar to > https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com > and to https://github.com/timescale/timescaledb/issues/1443 > > > > One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS > sees our array as a single device. > > > > Right before the issue started we did some upgrades and altered some > postgres configs and ZFS settings. > > We have been slowly rolling back changes but so far the the issue > continues. > > > > Some interesting data points while debugging: > > We had lowered the ZFS recordsize from 128K to 32K and for that week the > issue started happening every other day. > > Using xxd and diff we compared "good" and "bad" wal files and the > differences were not random bad bytes. > > > > The bad file either had a block of zeros that were not in the good file > at that position or other data. Occasionally the bad data has contained > legible strings not in the good file at that position. At least one of > those exact strings has existed elsewhere in the files. > > However I am not sure if that is the case for all of them. > > > > This made me think that maybe there was an issue w/ wal file recycling > and ZFS under heavy load, so we tried lowering > > min_wal_size in order to "discourage" wal file recycling but my > understanding is a low value discourages recycling but it will still > > happen (unless setting wal_recycle in psql 12). > > > > We do print a message "recycled write-ahead log file .." in DEBUG2 > mode. You either want to run the server with DEBUG2 or maybe change > the code to make it LOG and see if that is printed. If you do that, > you can verify if the corrupted WAL is the same as a recycled one. > Are you suggesting having the master, the replicas or all in debug mode? How much extra logging would this generate? A replica typically consumes over 1 TB of WAL files before a bad wal file is encountered. > > There is a third replica where this bug has not (yet?) surfaced. > > This leads me to guess the bad data does not originate on the master. > > This replica is older than the other replicas, slower CPUs, less RAM, > and the WAL disk array is spinning disks. > > The OS, version of Postgres, and version of ZFS are the same as the > other replicas. > > This replica is not using a replication slot. > > This replica does not serve users so load/contention is much lower than > the others. > > The other replicas often have 100% utilization of the disk array that > houses the (non-wal) data. > > > > Any insight into the source of this bug or how to address it? > > > > Since the master has a good copy of the WAL file, can the replica > re-request the file from the master? Or from the archive? > > > > I think we do check in the archive if we get the error during > streaming, but archive might also have the same data due to which this > problem happens. Have you checked that the archive WAL file, is it > different from the bad WAL? See the Typically the master, the archive and the other replicas all have a good copy of the WAL file. relevant bits of code in > WaitForWALToBecomeAvailable especially the code near below comment: > > "Failure while streaming. Most likely, we got here because streaming > replication was terminated, or promotion was triggered. But we also > get here if we find an invalid record in the WAL streamed from master, > in which case something is seriously wrong. There's little chance that > the problem will jus
Re: bad wal on replica / incorrect resource manager data checksum in record / zfs
On Thu, Feb 20, 2020 at 12:01 PM Alex Malek wrote: > On Thu, Feb 20, 2020, 6:16 AM Amit Kapila wrote: > >> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek wrote: >> > >> > >> > Hello Postgres Hackers - >> > >> > We are having a reoccurring issue on 2 of our replicas where >> replication stops due to this message: >> > "incorrect resource manager data checksum in record at ..." >> > This has been occurring on average once every 1 to 2 weeks during large >> data imports (100s of GBs being written) >> > on one of two replicas. >> > Fixing the issue has been relatively straight forward: shutdown >> replica, remove the bad wal file, restart replica and >> > the good wal file is retrieved from the master. >> > We are doing streaming replication using replication slots. >> > However twice now, the master had already removed the WAL file so the >> file had to retrieved from the wal archive. >> > >> > The WAL log directories on the master and the replicas are on ZFS file >> systems. >> > All servers are running RHEL 7.7 (Maipo) >> > PostgreSQL 10.11 >> > ZFS v0.7.13-1 >> > >> > The issue seems similar to >> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com >> and to https://github.com/timescale/timescaledb/issues/1443 >> > >> > One quirk in our ZFS setup is ZFS is not handling our RAID array, so >> ZFS sees our array as a single device. >> > >> > Right before the issue started we did some upgrades and altered some >> postgres configs and ZFS settings. >> > We have been slowly rolling back changes but so far the the issue >> continues. >> > >> > Some interesting data points while debugging: >> > We had lowered the ZFS recordsize from 128K to 32K and for that week >> the issue started happening every other day. >> > Using xxd and diff we compared "good" and "bad" wal files and the >> differences were not random bad bytes. >> > >> > The bad file either had a block of zeros that were not in the good file >> at that position or other data. Occasionally the bad data has contained >> legible strings not in the good file at that position. At least one of >> those exact strings has existed elsewhere in the files. >> > However I am not sure if that is the case for all of them. >> > >> > This made me think that maybe there was an issue w/ wal file recycling >> and ZFS under heavy load, so we tried lowering >> > min_wal_size in order to "discourage" wal file recycling but my >> understanding is a low value discourages recycling but it will still >> > happen (unless setting wal_recycle in psql 12). >> > >> >> We do print a message "recycled write-ahead log file .." in DEBUG2 >> mode. You either want to run the server with DEBUG2 or maybe change >> the code to make it LOG and see if that is printed. If you do that, >> you can verify if the corrupted WAL is the same as a recycled one. >> > > Are you suggesting having the master, the replicas or all in debug mode? > How much extra logging would this generate? > A replica typically consumes over 1 TB of WAL files before a bad wal file > is encountered. > > > >> > There is a third replica where this bug has not (yet?) surfaced. >> > This leads me to guess the bad data does not originate on the master. >> > This replica is older than the other replicas, slower CPUs, less RAM, >> and the WAL disk array is spinning disks. >> > The OS, version of Postgres, and version of ZFS are the same as the >> other replicas. >> > This replica is not using a replication slot. >> > This replica does not serve users so load/contention is much lower than >> the others. >> > The other replicas often have 100% utilization of the disk array that >> houses the (non-wal) data. >> > >> > Any insight into the source of this bug or how to address it? >> > >> > Since the master has a good copy of the WAL file, can the replica >> re-request the file from the master? Or from the archive? >> > >> >> I think we do check in the archive if we get the error during >> streaming, but archive might also have the same data due to which this >> problem happens. Have you checked that the archive WAL file, is it >> different from the bad WAL? See the > > > Typically the master, the archive and the other replicas all have a good > copy of
set parameter for all existing session
I check the “alter database, alter role " and "set " command, but none of them can set the parameters to all the existing sessions. do we have a way to do that? looks the "assign_hook" can be used to customize this, is it a right way to do that?
Re: set parameter for all existing session
On Wed, Jun 12, 2019 at 4:25 PM Pavel Stehule wrote: > Hi > > st 12. 6. 2019 v 9:58 odesílatel alex lock napsal: > >> I check the “alter database, alter role " and "set " command, but none of >> them can set the parameters to all the existing sessions. do we have a >> way to do that? looks the "assign_hook" can be used to customize this, is >> it a right way to do that? >> >> > Maybe you miss to call pg_reload_conf(); > > example: > > alter system set work_mem to '10MB'; > select pg_reload_conf(); > Thanks, it works! > > in other session you can: > > show work_mem; > > Regards > > Pavel >
'tuple concurrently updated' error w/o visible catalog updates
Hello, Earlier this week we have split our Postgres 9.6.8 shards, each having two databases, into one database per shard setup. This was done by promoting replicas and subsequently removing unused databases. Immediately afterwards we have discovered repeated 'tuple concurrently updated' errors on most of those new shards. The error is always shown at the same statement: ERROR,XX000,"tuple concurrently updated","SQL statement ""UPDATE config_content SET cc_content = l_config_content WHERE cc_config_content_id = l_ccm_content_id"" By searching the archives (i.e. https://www.postgresql.org/messageid/flat/CAB7nPqSZCkVfibTvx9TYmHYhVtV_vOMNwOpLHnRU85qeiimUaQ%40mail.gmail.com#cab7npqszckvfibtvx9tymhyhvtv_vomnwoplhnru85qeiim...@mail.gmail.com) I’ve got an impression that this error manifests itself when system catalog tuples are updated concurrently, however I see none of that in the query that leads to an ERROR. There are no triggers on 'config_content' table, neither there are any views referring to it. The errors stopped when we disabled a call to the 'upsert_foo_content' function (here and below I obfuscated real names). This is a fairly simple pl/pgsql function that does a few selects and an upsert. The block inside that function that contains the statement at fault is: -- SELECT ccm_content_id, ccm_simple_update_received_at INTO l_ccm_content_id, l_ccm_simple_update_received_at FROM config_content_metadata WHERE ccm_config_id = l_c_id AND ccm_sales_channel_id = l_sales_channel_id; IF (l_ccm_content_id IS NULL) THEN -- insert config content -- INSERT INTO config_content_metadata(ccm_config_id, ccm_sales_channel_id, ccm_update_caused_by, ccm_simple_update_eid, ccm_simple_update_received_at) VALUES(l_c_id, l_sales_channel_id, l_rp_id, l_content_update_eid, l_content_update_received_at) RETURNING ccm_content_id INTO l_ccm_content_id; INSERT INTO config_content(cc_config_content_id, cc_content) VALUES (l_ccm_content_id, l_config_content); ELSIF (l_ccm_simple_update_received_at < l_content_update_received_at) THEN UPDATE config_content_metadata SET ccm_update_caused_by = l_rp_id, ccm_simple_update_eid = l_content_update_eid, ccm_simple_update_received_at = l_content_update_received_at, ccm_updated_at = now() WHERE ccm_content_id = l_ccm_content_id; -- XXX problematic statement XXX UPDATE config_content SET cc_content = l_config_content WHERE cc_config_content_id = l_ccm_content_id; END IF; -- Note that config_content references config_metdata with a foreign key, however, the referenced column is not updated. That 'upsert_foo_content' is called by another one, upsert_foo_content_batch, in a loop over the elements of a JSON array, something like: -- CREATE OR REPLACE FUNCTION upsert_foo_content_batch(p_batch jsonb) RETURN void LANGUAGE plpgpsql AS $function$ DECLARE ... BEGIN FOR item IN SELECT * FROM jsonb_array_elements(p_batch) LOOP -- some unpacking of fields from json into the local variables PERFORM upsert_foo_content(..) -- called with the unpacked variables END LOOP; END; $function$ -- 'upsert_foo_content_batch' is called, in order, at the end of a long pl/pgsql function 'upsert_foo_event_batch', which consists of a very long CTE that extracts individual fields from a JSON argument, and then performs a number of inserts into some tables, doing on conflict do nothing, afterwards performing more inserts into the tables that reference the previous ones, doing on conflict do update. However, it modifies neither 'config_content' or 'config_content_metadata' tables. So the chain of calls is 'upsert_foo_event_batch' -> 'upsert_foo_content_batch' -> 'upsert_foo_content'. (the last one contains the statement that leads to the "tuple concurrently updated" error). It is possible that 'upsert_foo_content' function is called with the same data multiple times in different processes, however, I’d expect it to either complete successfully, or throw an error because the PK already exists (this is running in a read committed mode, so ISTM not immune to the case where the row in the metadata table is inserted after another session does the check, but before the insert), but not an error mentioned at the beginning of this message. Are there any signs in this description that the queries might be doing something unexpected to PostgreSQL, or that something went wrong during the split? I am running out of options of what could cause the issue, so any pointers or help in debugging it is appreciated (note that this is a production database, I cannot just stop it at will). Cheers, Oleksii
RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager
-Original Message- From: Robert Haas Sent: Thursday, April 26, 2018 10:25 PM To: Andres Freund Cc: Masahiko Sawada ; Michael Paquier ; Mithun Cy ; Tom Lane ; Thomas Munro ; Amit Kapila ; PostgreSQL-development Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund wrote: >> I think the real question is whether the scenario is common enough to >> worry about. In practice, you'd have to be extremely unlucky to be >> doing many bulk loads at the same time that all happened to hash to >> the same bucket. > > With a bunch of parallel bulkloads into partitioned tables that really > doesn't seem that unlikely? It increases the likelihood of collisions, but probably decreases the number of cases where the contention gets really bad. For example, suppose each table has 100 partitions and you are bulk-loading 10 of them at a time. It's virtually certain that you will have some collisions, but the amount of contention within each bucket will remain fairly low because each backend spends only 1% of its time in the bucket corresponding to any given partition. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company Hello! I want to try to test this patch on 302(704 ht) core machine. Patching on master (commit 81256cd05f0745353c6572362155b57250a0d2a0) is ok but got some error while compiling : gistvacuum.c: In function ‘gistvacuumcleanup’: gistvacuum.c:92:3: error: too many arguments to function ‘LockRelationForExtension’ LockRelationForExtension(rel, ExclusiveLock); ^ In file included from gistvacuum.c:21:0: ../../../../src/include/storage/extension_lock.h:30:13: note: declared here extern void LockRelationForExtension(Relation relation); ^ gistvacuum.c:95:3: error: too many arguments to function ‘UnlockRelationForExtension’ UnlockRelationForExtension(rel, ExclusiveLock); ^ In file included from gistvacuum.c:21:0: ../../../../src/include/storage/extension_lock.h:31:13: note: declared here extern void UnlockRelationForExtension(Relation relation); -- Alex Ignatov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager
-- Alex Ignatov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -Original Message- From: Alex Ignatov Sent: Monday, May 21, 2018 6:00 PM To: 'Robert Haas' ; 'Andres Freund' Cc: 'Masahiko Sawada' ; 'Michael Paquier' ; 'Mithun Cy' ; 'Tom Lane' ; 'Thomas Munro' ; 'Amit Kapila' ; 'PostgreSQL-development' Subject: RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager -Original Message- From: Robert Haas Sent: Thursday, April 26, 2018 10:25 PM To: Andres Freund Cc: Masahiko Sawada ; Michael Paquier ; Mithun Cy ; Tom Lane ; Thomas Munro ; Amit Kapila ; PostgreSQL-development Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund wrote: >> I think the real question is whether the scenario is common enough to >> worry about. In practice, you'd have to be extremely unlucky to be >> doing many bulk loads at the same time that all happened to hash to >> the same bucket. > > With a bunch of parallel bulkloads into partitioned tables that really > doesn't seem that unlikely? It increases the likelihood of collisions, but probably decreases the number of cases where the contention gets really bad. For example, suppose each table has 100 partitions and you are bulk-loading 10 of them at a time. It's virtually certain that you will have some collisions, but the amount of contention within each bucket will remain fairly low because each backend spends only 1% of its time in the bucket corresponding to any given partition. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company Hello! I want to try to test this patch on 302(704 ht) core machine. Patching on master (commit 81256cd05f0745353c6572362155b57250a0d2a0) is ok but got some error while compiling : gistvacuum.c: In function ‘gistvacuumcleanup’: gistvacuum.c:92:3: error: too many arguments to function ‘LockRelationForExtension’ LockRelationForExtension(rel, ExclusiveLock); ^ In file included from gistvacuum.c:21:0: ../../../../src/include/storage/extension_lock.h:30:13: note: declared here extern void LockRelationForExtension(Relation relation); ^ gistvacuum.c:95:3: error: too many arguments to function ‘UnlockRelationForExtension’ UnlockRelationForExtension(rel, ExclusiveLock); ^ In file included from gistvacuum.c:21:0: ../../../../src/include/storage/extension_lock.h:31:13: note: declared here extern void UnlockRelationForExtension(Relation relation); Sorry, forgot to mention that patch version is extension-lock-v12.patch -- Alex Ignatov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: GSoC 2018
Hello, On Fri, Dec 15, 2017, at 14:30, Stephen Frost wrote: > Aleksander, > > * Aleksander Alekseev (a.aleks...@postgrespro.ru) wrote: > > > Regarding the difficulty of the project - in fact it's not that > > difficult. Particularly this project can rely on external tools, e.g. > > use Consul for service discovery and leader election based on > > leader-lease approach (implementation [1]). Having this the only thing > > is missing is automatic replica promoting and (optionally) > > re-configuring of HAProxy / pgbouncer / whatever. Yes, and lots of > > Jepsen-like test of course. I believe it's not such a complicated > > project. Does it make sense to address the limitations of the logical replication first, i.e. inability to replicate DDL, sequences and so on? > > What you're talking about is rebuilding Patroni, but adding more into it > than even Patroni tries to do, building it on Logical Replication > instead of physical replication, and calling it simple and something > that could get into core PostgreSQL over a 12 week GSoC project. I've > certainly got doubts about that, even if we decide that it'd be an > external-to-PG project (like Patroni). > > What might be interesting is seeing if Logical Replication could be > added to Patroni as an option and then building on that.. Having > someone involved in the Patroni project would be the right way to go > about proposing that though to see what they think of it. That would > also be much more sensible as a GSoC project, since it'd be an addition > to an existing project and not more-or-less starting a whole new > project. Right now logical replication and physical replication-based HA tools don't work together nicely, since logical replication position is not propagated to the promoted replica (I think Craig Ringer has been tackling this issue for a few releases already, the latest set of patches I could find is https://commitfest.postgresql.org/15/788/). Perhaps there is opportunity for a GSoC student to help fixing it. Until then we cannot use logical replication for HA, and even doing something simpler like automating creation of logical replicas in Patroni makes little sense, as they are doomed to be reinitialized on every failover. -- Sincerely, Alex
Re: GSoC 2018
On Fri, Dec 15, 2017, at 14:52, Aleksander Alekseev wrote: > Completely agree, this project can be an improvement for Stolon (or > Patroni, but I personally never tested or used it, also I got a feeling > that Google guys will prefer a project that is written in Go). This > would make much more sense. I don't believe Google will reject a project based on the fact that it is written in Python (in fact, Python Software Foundation has successfully participated in GSoC for many years). Based on the github statistics, Patroni has started earlier and has more contributors than Stolon (including those contributed more than one patch/pull-request.) -- Sincerely, Alex
Re: Estimate maintenance_work_mem for CREATE INDEX
On Tue, Dec 19, 2017 at 3:15 PM Greg Stark wrote: > On 19 December 2017 at 10:00, Oleksandr Shulgin > wrote: > > > If there would be an option in the database itself to provide those > > estimation, we wouldn't even need to figure out estimation queries. > > "EXPLAIN CREATE INDEX" anyone? > > You're not the first to propose something like that. I think an > EXPLAIN ALTER TABLE would also be very handy -- it's currently > impossible to tell without carefully reading the source code whether a > given DDL change will require a full table scan, a full table rewrite, > or just a quick meta data update (and even in that case what strength > lock will be required). I think there are other utility statements > that make interesting heuristic decisions that would be nice to be > able to have some visibility into -- CLUSTER comes to mind. > Yes, that would be pretty handy. I'm not clear how you would determine how much memory is needed to > sort a table without actually doing the sort though. So that would be > more of an EXPLAIN ANALYZE wouldn't it? > My idea would be to use statistic. So that EXPLAIN CREATE INDEX (or whatever the actual interface could be like) would benefit from up-to-date statistic produced by ANALYZE. Based on the estimated number of rows in the table, average width of column(s) to index and taking into account the bookkeeping structures one should be able to arrive at a good guess for the amount of memory the backend would end up allocating (assuming it is available). Having done that, as the first step, and using statistic again we could also infer (though, probably with less accuracy) memory requirements for building partial indexes. Functional indexes would be harder to tackle, I would think this is only possible if the return type(s) of the function(s) has all fixed width. I didn't look in the code, but I imagine the procedure to read -> sort -> spill to tapes, if needed -> merge sort the tapes is generic to all index types, so this shouldn't be a breaking change for any user-defined indexes (is this already a thing?). OK, maybe it's only generic for B-Tree and BRIN, but not for GIN and GiST, to name a few. Damn, I gotta look in the code at some point. ;-) To let me fantasize a little more, what I would also love to see is the estimated on-disk size for the resulting index, before starting to create it. This is obviously dependent on the actual index type and options, such as fill-factor, etc. Cheers, -- Alex
Re: Doc fix of aggressive vacuum threshold for multixact members storage
Hi Sami, Thanks for the feedback. > 1/ Remove this as > "(50% of the maximum, which is about 20GB)," > > [1] tried to avoid explaining this level of detail, and I > agree with that. I feel it is critical for users to know what is the hard limit of multixact members. As PG doesn't (yet) expose how many multixact members are in use, the only way for users to know the distance to members wraparound is by monitoring the members directory space usage. So it seems to me that the 20 GB number is very important to have in the docs. > 2/ c/"about 10GB"/"10GB" the "about" does not seem necessary here. The threshold is actually ~10.015 GiB (due to the 12 bytes wasted per 8KB page), or ~10.75 GB, so to avoid confusion by users when aggressive autovacuum doesn't trigger exactly at 10GB, I believe we should either be exact, or say that we are not being exact. Being exact is difficult as it depends on the block size. And as I looked through the doc page in question, I noticed there are already several cases using the "about" wording, e.g. "about 50MB of pg_xact storage" and "about 2GB of pg_commit_ts storage", so here I went for consistency with the rest of the doc. Thanks, Alex
Re: Doc fix of aggressive vacuum threshold for multixact members storage
A few paragraphs up the docs, there is this mention: ". There is a separate storage area which holds the list of members in each multixact, which also uses a 32-bit counter and which must also be managed." Maybe we can add more to this paragraph, such as: "also be managed. This member can grow to 20GB" And then in the proposed correction: " Also, if the storage occupied by multixacts members area exceeds 10GB (50% of the maximum the members area can grow), aggressive vacuum scans will occur more often for all tables " What do you think? Looks good to me, attached a v2 patch with small adjustments. Thanks, AlexFrom 3deda711bb4219089b32204c567e735b3d7a152b Mon Sep 17 00:00:00 2001 From: Alex Friedman Date: Thu, 30 Jan 2025 17:19:07 +0200 Subject: [PATCH v2] Doc fix of aggressive vacuum threshold for multixact members storage. --- doc/src/sgml/maintenance.sgml | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 0be90bdc7ef..f4f560bccc1 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -761,7 +761,8 @@ HINT: Execute a database-wide VACUUM in that database. careful aging management, storage cleanup, and wraparound handling. There is a separate storage area which holds the list of members in each multixact, which also uses a 32-bit counter and which must also - be managed. + be managed. This members storage area can grow up to about 20GB before + reaching wraparound. @@ -792,9 +793,9 @@ HINT: Execute a database-wide VACUUM in that database. As a safety device, an aggressive vacuum scan will occur for any table whose multixact-age is greater than . Also, if the - storage occupied by multixacts members exceeds 2GB, aggressive vacuum - scans will occur more often for all tables, starting with those that + linkend="guc-autovacuum-multixact-freeze-max-age"/>. Also, if the storage occupied + by multixacts members exceeds about 10GB (50% of the maximum the members area can grow), + aggressive vacuum scans will occur more often for all tables, starting with those that have the oldest multixact-age. Both of these kinds of aggressive scans will occur even if autovacuum is nominally disabled. -- 2.41.0
Doc fix of aggressive vacuum threshold for multixact members storage
Hi, This patch suggests a correction to the doc page dealing with multixact vacuuming, which, starting with PG 14, says that the multixact members storage threshold for aggressive vacuum is 2 GB. However, I believe the threshold is actually about 10 GB. MultiXactMemberFreezeThreshold() defines the threshold as 2^32 (0x) / 2 or 2^31 multixact members. However, as discussed in multixact.c, multixact members are stored in groups of 4, each group taking up 20 bytes, meaning 5 bytes per member. (This is not quite exact as 12 bytes per 8 KB page are wasted, but I believe it is close enough for the docs.) This makes the threshold in bytes be 2^31 multixact members * 5 bytes per member = 10 GiB. It was also confirmed by observing a live system (with an admittedly unfortunate workload pattern). Also, the maximum storage size for multixact members is 20 GiB (2^32 * 5), and it should be useful to call this out in the doc as well. For reference, the original commit which introduced the current wording is c552e17, and the discussion was here: https://www.postgresql.org/message-id/flat/162395467510.686.11947486273299446208%40wrigleys.postgresql.org The attached patch is against master, but it should probably be backpatched all the way through 14. Best regards, Alex Friedman v1-0001-Doc-fix-of-aggressive-vacuum-threshold-for-multix.patch Description: Binary data
Re: Doc fix of aggressive vacuum threshold for multixact members storage
Hi John, Thanks for reviewing. It seems at a minimum this one-line patch is sufficient for the correction: - storage occupied by multixacts members exceeds 2GB, aggressive vacuum + storage occupied by multixacts members exceeds about 10GB, aggressive vacuum Commit c552e171d16e removed the percentage as part of a judgment call on clarity, and I'm not sure that was wrong. We could add the proposed language on "can grow up to about 20GB" at the end of this paragraph, which seems more natural -- first mention the amount that triggers aggressive vacuum, then the maximum size. Yes, I believe this can work. I'm on the fence about putting a hint in the C file, but the computation has changed in the past, see commit b4d4ce1d50bbdf , so it's a reasonable idea. That's a good find about the change. Taken together with Bertrand's comments, I've added two reminders to multixact.c to update the docs, one for the threshold and another for the multixact storage scheme. Please see if it makes sense. v4 patch attached. Best regards, Alex Friedman From 0965413dbb0b85e4dd78f87a6ca3847dccdc78c7 Mon Sep 17 00:00:00 2001 From: Alex Friedman Date: Thu, 30 Jan 2025 17:19:07 +0200 Subject: [PATCH v4] Doc fix of aggressive vacuum threshold for multixact members storage. --- doc/src/sgml/maintenance.sgml | 5 +++-- src/backend/access/transam/multixact.c | 7 +-- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml index 0be90bdc7ef..89040942be2 100644 --- a/doc/src/sgml/maintenance.sgml +++ b/doc/src/sgml/maintenance.sgml @@ -793,10 +793,11 @@ HINT: Execute a database-wide VACUUM in that database. As a safety device, an aggressive vacuum scan will occur for any table whose multixact-age is greater than . Also, if the - storage occupied by multixacts members exceeds 2GB, aggressive vacuum + storage occupied by multixacts members exceeds about 10GB, aggressive vacuum scans will occur more often for all tables, starting with those that have the oldest multixact-age. Both of these kinds of aggressive - scans will occur even if autovacuum is nominally disabled. + scans will occur even if autovacuum is nominally disabled. The members storage + area can grow up to about 20GB before reaching wraparound. diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 27ccdf9500f..66adb9995d0 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -134,7 +134,8 @@ MultiXactIdToOffsetSegment(MultiXactId multi) * corresponding 4 Xids. Each such 5-word (20-byte) set we call a "group", and * are stored as a whole in pages. Thus, with 8kB BLCKSZ, we keep 409 groups * per page. This wastes 12 bytes per page, but that's OK -- simplicity (and - * performance) trumps space efficiency here. + * performance) trumps space efficiency here. If this computation changes, make + * sure to update the documentation. * * Note that the "offset" macros work with byte offset, not array indexes, so * arithmetic must be done using "char *" pointers. @@ -212,7 +213,9 @@ MXOffsetToMemberOffset(MultiXactOffset offset) member_in_group * sizeof(TransactionId); } -/* Multixact members wraparound thresholds. */ +/* Multixact members wraparound thresholds. + * When changing the thresholds, make sure to update the documentation. + */ #define MULTIXACT_MEMBER_SAFE_THRESHOLD(MaxMultiXactOffset / 2) #define MULTIXACT_MEMBER_DANGER_THRESHOLD \ (MaxMultiXactOffset - MaxMultiXactOffset / 4) -- 2.41.0
Doc: clarify possibility of ephemeral discrepancies between state and wait_event in pg_stat_activity
Hi, This small doc change patch is following up on a past discussion about discrepancies between state and wait_event in pg_stat_activity: https://www.postgresql.org/message-id/flat/ab1c0a7d-e789-5ef5-1180-42708ac6fe2d%40postgrespro.ru As this kind of question is raised by PG users from time to time, the goal is to clarify that such discrepancies are to be expected. The attached patch reuses Robert Haas's eloquent wording from his response in the above thread. I've tried to keep it short and to the point, but it can be made more verbose if needed. Best regards, Alex FriedmanFrom 3cab620d67d200ff4ccb1870f63cbf75a50d0df6 Mon Sep 17 00:00:00 2001 From: Alex Friedman Date: Wed, 26 Feb 2025 19:59:59 +0200 Subject: [PATCH v1] Clarify possibility of ephemeral discrepancies between state and wait_event in pg_stat_activity. --- doc/src/sgml/monitoring.sgml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 9178f1d34ef..57fcd8ab52b 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -1016,7 +1016,9 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser it may or may not be waiting on some event. If the state is active and wait_event is non-null, it means that a query is being executed, but is being blocked somewhere -in the system. +in the system. To keep the reporting low-overhead, the system uses very lightweight +synchronization. As a result, ephemeral discrepancies between wait_event +and state are possible by nature. -- 2.41.0
Re: Doc: clarify possibility of ephemeral discrepancies between state and wait_event in pg_stat_activity
On 26/02/2025 22:00, Sami Imseih wrote: If we do need to document anything, which I am not convinced we should, it should be more generic. Thanks for the feedback, I've attached a v2 patch which has wording that's a bit more generic. It's also worth noting that pg_locks already has a full paragraph explaining inconsistencies, so in my opinion it's worth it at least mentioning something similar here for pg_stat_activity. Best regards, Alex FriedmanFrom fbbfc623e16ed97176c0ccf0ebc534d118e9f252 Mon Sep 17 00:00:00 2001 From: Alex Friedman Date: Wed, 26 Feb 2025 19:59:59 +0200 Subject: [PATCH v2] Clarify possibility of ephemeral discrepancies between state and wait_event in pg_stat_activity. --- doc/src/sgml/monitoring.sgml | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 9178f1d34ef..de49769d407 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -1016,7 +1016,11 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser it may or may not be waiting on some event. If the state is active and wait_event is non-null, it means that a query is being executed, but is being blocked somewhere -in the system. +in the system. To keep the reporting low-overhead, the system uses very lightweight +synchronization. As a result, ephemeral discrepancies between the view's columns, +for example between wait_event and +state, or between state and +query_id, are possible by nature. -- 2.41.0
A small correction to doc and comment of FSM for indexes
Hi, This patch fixes a couple of small inaccuracies in the doc and the comment for FSM about index handling. 1. In the doc for pg_freespacemap, it currently says: For indexes, what is tracked is entirely-unused pages, rather than free space within pages. Therefore, the values are not meaningful, just whether a page is full or empty. However, as what is tracked is entirely-unused pages, the values mean whether a page is "in-use or empty", rather than "full or empty". 2. In indexfsm.c the header comment says: * This is similar to the FSM used for heap, in freespace.c, but instead * of tracking the amount of free space on pages, we only track whether * pages are completely free or in-use. We use the same FSM implementation * as for heaps, using BLCKSZ - 1 to denote used pages, and 0 for unused. However, in the code we see that used pages are marked with 0: /* * RecordUsedIndexPage - mark a page as used in the FSM */ void RecordUsedIndexPage(Relation rel, BlockNumber usedBlock) { RecordPageWithFreeSpace(rel, usedBlock, 0); } And free pages are marked with BLCKSZ - 1: /* * RecordFreeIndexPage - mark a page as free in the FSM */ void RecordFreeIndexPage(Relation rel, BlockNumber freeBlock) { RecordPageWithFreeSpace(rel, freeBlock, BLCKSZ - 1); } And so, this patch also fixes the comment's "using BLCKSZ - 1 to denote used pages, and 0 for unused" to be "using 0 to denote used pages, and BLCKSZ - 1 for unused". While these changes are minor, I've seen how this can cause a bit of confusion, and it would be good to clarify it. Best regards, Alex FriedmanFrom a1b78438343fca053aa0014687eaba34d5e160e0 Mon Sep 17 00:00:00 2001 From: Alex Friedman Date: Tue, 25 Feb 2025 19:12:53 +0200 Subject: [PATCH v1] A small correction to doc and comment of FSM for indexes. --- doc/src/sgml/pgfreespacemap.sgml | 2 +- src/backend/storage/freespace/indexfsm.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/pgfreespacemap.sgml b/doc/src/sgml/pgfreespacemap.sgml index 829ad60f32f..3774a9f8c6b 100644 --- a/doc/src/sgml/pgfreespacemap.sgml +++ b/doc/src/sgml/pgfreespacemap.sgml @@ -67,7 +67,7 @@ For indexes, what is tracked is entirely-unused pages, rather than free space within pages. Therefore, the values are not meaningful, just - whether a page is full or empty. + whether a page is in-use or empty. diff --git a/src/backend/storage/freespace/indexfsm.c b/src/backend/storage/freespace/indexfsm.c index 1fc263892a7..3cd2437599d 100644 --- a/src/backend/storage/freespace/indexfsm.c +++ b/src/backend/storage/freespace/indexfsm.c @@ -16,7 +16,7 @@ * This is similar to the FSM used for heap, in freespace.c, but instead * of tracking the amount of free space on pages, we only track whether * pages are completely free or in-use. We use the same FSM implementation - * as for heaps, using BLCKSZ - 1 to denote used pages, and 0 for unused. + * as for heaps, using 0 to denote used pages, and BLCKSZ - 1 for unused. * *- */ -- 2.41.0
Re: Doc: clarify possibility of ephemeral discrepancies between state and wait_event in pg_stat_activity
discrepancy will look like. What about we do something much more simplified, such as the below: """ To keep the reporting overhead low, the system does not attempt to synchronize activity data for a backend. As a result, ephemeral discrepancies may exist between the view’s columns. """ Yes, I believe it makes sense to make it more generic. Attached v3 with a slight tweak: +in the system. To keep the reporting overhead low, the system does not attempt to +synchronize different aspects of activity data for a backend. As a result, ephemeral +discrepancies may exist between the view's columns. Best regards, Alex FriedmanFrom 58de88469f6201ae698ee34debcdec028526a72a Mon Sep 17 00:00:00 2001 From: Alex Friedman Date: Wed, 26 Feb 2025 19:59:59 +0200 Subject: [PATCH v3] Clarify possibility of ephemeral discrepancies between state and wait_event in pg_stat_activity. --- doc/src/sgml/monitoring.sgml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 9178f1d34ef..0e34b3509b8 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -1016,7 +1016,9 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser it may or may not be waiting on some event. If the state is active and wait_event is non-null, it means that a query is being executed, but is being blocked somewhere -in the system. +in the system. To keep the reporting overhead low, the system does not attempt to +synchronize different aspects of activity data for a backend. As a result, ephemeral +discrepancies may exist between the view's columns. -- 2.41.0
Re: Doc fix of aggressive vacuum threshold for multixact members storage
>I decided to leave this out, since I just remembered that the most > likely change is actually to move to 64-bit offsets, as was proposed > here and has some enthusiastic support: > > https://www.postgresql.org/message-id/CACG=ezawg7_nt-8ey4akv2w9lculthhknwcawmbgeetnjrj...@mail.gmail.com Thanks for the review and the draft, looks good to me, and I'm okay with doing this without the code comments. However, it seems like that thread is just the beginning of wider changes (if they indeed happen), which may impact these calculations as well, and then maybe a doc update reminder may come in useful? Best regards, Alex Friedman
Re: Doc fix of aggressive vacuum threshold for multixact members storage
Good points, thank you. I'm good with going ahead as you've suggested. Best regards, Alex Friedman