Re: Allow escape in application_name
On 2021/12/24 13:49, Kyotaro Horiguchi wrote: I'm ok to remove the check "values[i] != NULL", but think that it's better to keep the other check "*(values[i]) != '\0'" as it is. Because *(values[i]) can be null character and it's a waste of cycles to call process_pgfdw_appname() in that case. Right. I removed too much. Thanks for the check! So I kept the check "*(values[i]) != '\0'" as it is and pushed the patch. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: sequences vs. synchronous replication
At Fri, 24 Dec 2021 08:23:13 +0100, Tomas Vondra wrote in > > > On 12/24/21 06:37, Kyotaro Horiguchi wrote: > > At Thu, 23 Dec 2021 19:50:22 +0100, Tomas Vondra > > wrote in > >> On 12/23/21 15:42, Fujii Masao wrote: > >>> On 2021/12/23 3:49, Tomas Vondra wrote: > Attached is a patch tweaking WAL logging - in wal_level=minimal we do > the same thing as now, in higher levels we log every sequence fetch. > >>> Thanks for the patch! > >>> With the patch, I found that the regression test for sequences failed. > >>> + fetch = log = fetch; > >>> This should be "log = fetch"? > >>> On second thought, originally a sequence doesn't guarantee that the > >>> value already returned by nextval() will never be returned by > >>> subsequent nextval() after the server crash recovery. That is, > >>> nextval() may return the same value across crash recovery. Is this > >>> understanding right? For example, this case can happen if the server > >>> crashes after nextval() returned the value but before WAL for the > >>> sequence was flushed to the permanent storage. > >> > >> I think the important step is commit. We don't guarantee anything for > >> changes in uncommitted transactions. If you do nextval in a > >> transaction and the server crashes before the WAL gets flushed before > >> COMMIT, then yes, nextval may generate the same nextval again. But > >> after commit that is not OK - it must not happen. > > I don't mean to stand on Fujii-san's side particularly, but it seems > > to me sequences of RDBSs are not rolled back generally. Some googling > > told me that at least Oracle (documented), MySQL, DB2 and MS-SQL > > server doesn't rewind sequences at rollback, that is, sequences are > > incremented independtly from transaction control. It seems common to > > think that two nextval() calls for the same sequence must not return > > the same value in any context. > > > > Yes, sequences are not rolled back on abort generally. That would > require much stricter locking, and that'd go against using sequences > in concurrent sessions. I thinks so. > But we're not talking about sequence rollback - we're talking about > data loss, caused by failure to flush WAL for a sequence. But that > affects the *current* code too, and to much greater extent. Ah, yes, I don't object to that aspect. > Consider this: > > BEGIN; > SELECT nextval('s') FROM generate_series(1,1000) s(i); > ROLLBACK; -- or crash of a different backend > > BEGIN; > SELECT nextval('s'); > COMMIT; > > With the current code, this may easily lose the WAL, and we'll > generate duplicate values from the sequence. We pretty much ignore the > COMMIT. > > With the proposed change to WAL logging, that is not possible. The > COMMIT flushes enough WAL to prevent this issue. > > So this actually makes this issue less severe. > > Maybe I'm missing some important detail, though. Can you show an > example where the proposed changes make the issue worse? No. It seems to me improvoment at least from the current state, for the reason you mentioned. > >>> So it's not a bug that sync standby may return the same value as > >>> already returned in the primary because the corresponding WAL has not > >>> been replicated yet, isn't it? > >>> > >> > >> No, I don't think so. Once the COMMIT happens (and gets confirmed by > >> the sync standby), it should be possible to failover to the sync > >> replica without losing any data in committed transaction. Generating > >> duplicate values is a clear violation of that. > > So, strictly speaking, that is a violation of the constraint I > > mentioned regardless whether the transaction is committed or > > not. However we have technical limitations as below. > > > > I don't follow. What violates what? > > If the transaction commits (and gets a confirmation from sync > replica), the modified WAL logging prevents duplicate values. It does > nothing for uncommitted transactions. Seems like an improvement to me. Sorry for the noise. I misunderstand that ROLLBACK is being changed to rollback sequences. > No idea. IMHO from the correctness / behavior point of view, the > modified logging is an improvement. The only issue is the additional > overhead, and I think the cache addresses that quite well. Now I understand the story here. I agree that the patch is improvment from the current behavior. I agree that the overhead is eventually-nothing for WAL-emitting workloads. Still, as Fujii-san concerns, I'm afraid that some people may suffer the degradation the patch causes. I wonder it is acceptable to get back the previous behavior by exposing SEQ_LOG_VALS itself or a boolean to do that, as a 'not-recommended-to-use' variable. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: more descriptive message for process termination due to max_slot_wal_keep_size
On Fri, Dec 24, 2021 at 1:42 PM Kyotaro Horiguchi wrote: > > At Thu, 23 Dec 2021 18:08:08 +0530, Ashutosh Bapat > wrote in > > On Wed, Dec 15, 2021 at 9:42 AM Kyotaro Horiguchi > > wrote: > > > > LOG: invalidating slot "s1" > > > > DETAIL: The slot's restart_lsn 0/1D68 is behind the limit > > > > 0/1100 defined by max_slot_wal_keep_size. > > > > > > The second line could be changed like the following or anything other. > > > > > > > DETAIL: The slot's restart_lsn 0/1D68 got behind the limit > > > > 0/1100 determined by max_slot_wal_keep_size. > > > . > > > > > > > The second version looks better as it gives more details. I am fine > > with either of the above wordings. > > > > I would prefer everything in the same message though since > > "invalidating slot ..." is too short a LOG message. Not everybody > > enabled details always. > > Mmm. Right. I have gone too much to the same way with the > process-termination message. > > I rearranged the meesages as follows in the attached version. (at master) Thank you for the patch! +1 for improving the messages. > > > LOG: terminating process %d to release replication slot \"%s\" because its > > restart_lsn %X/%X exceeds max_slot_wal_keep_size > > DETAIL: The slot got behind the limit %X/%X determined by > > max_slot_wal_keep_size. > > > LOG: invalidating slot \"%s\" because its restart_LSN %X/%X exceeds > > max_slot_wal_keep_size > c> DETAIL: The slot got behind the limit %X/%X determined by > max_slot_wal_keep_size. - LSN_FORMAT_ARGS(restart_lsn; + LSN_FORMAT_ARGS(restart_lsn)), +errdetail("The slot got behind the limit %X/%X determined by max_slot_wal_keep_size.", + LSN_FORMAT_ARGS(oldestLSN; Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also by wal_keep_size? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: more descriptive message for process termination due to max_slot_wal_keep_size
Thank you for the comment. At Fri, 24 Dec 2021 17:06:57 +0900, Masahiko Sawada wrote in > Thank you for the patch! +1 for improving the messages. > > > > > > LOG: terminating process %d to release replication slot \"%s\" because > > > its restart_lsn %X/%X exceeds max_slot_wal_keep_size > > > DETAIL: The slot got behind the limit %X/%X determined by > > > max_slot_wal_keep_size. > > > > > LOG: invalidating slot \"%s\" because its restart_LSN %X/%X exceeds > > > max_slot_wal_keep_size > > c> DETAIL: The slot got behind the limit %X/%X determined by > > max_slot_wal_keep_size. > > - > LSN_FORMAT_ARGS(restart_lsn; > + > LSN_FORMAT_ARGS(restart_lsn)), > +errdetail("The slot > got behind the limit %X/%X determined by max_slot_wal_keep_size.", > + > LSN_FORMAT_ARGS(oldestLSN; > > Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also > by wal_keep_size? Right. But I believe the two are not assumed to be used at once. One can set wal_keep_size larger than max_slot_wal_keep_size but it is actually a kind of ill setting. LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size DETAIL: The slot got behind the limit %X/%X determined by max_slot_wal_keep_size and wal_keep_size. Mmm. I don't like this. I feel we don't need such detail in the message.. I'd like to hear opinions from others, please. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Just a complaint.. At Fri, 12 Nov 2021 16:43:27 +0900 (JST), Kyotaro Horiguchi wrote in > "" on Japanese (CP-932) environment. I didn't actually > tested it on Windows and msys environment ...yet. Active perl cannot be installed because of (perhaps) a powershell version issue... Annoying.. https://community.activestate.com/t/please-update-your-powershell-install-scripts/7897 regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: sequences vs. synchronous replication
On 12/24/21 09:04, Kyotaro Horiguchi wrote: ... So, strictly speaking, that is a violation of the constraint I mentioned regardless whether the transaction is committed or not. However we have technical limitations as below. I don't follow. What violates what? If the transaction commits (and gets a confirmation from sync replica), the modified WAL logging prevents duplicate values. It does nothing for uncommitted transactions. Seems like an improvement to me. Sorry for the noise. I misunderstand that ROLLBACK is being changed to rollback sequences. No problem, this part of the code is certainly rather confusing due to several layers of caching and these WAL-logging optimizations. No idea. IMHO from the correctness / behavior point of view, the modified logging is an improvement. The only issue is the additional overhead, and I think the cache addresses that quite well. Now I understand the story here. I agree that the patch is improvment from the current behavior. I agree that the overhead is eventually-nothing for WAL-emitting workloads. OK, thanks. Still, as Fujii-san concerns, I'm afraid that some people may suffer the degradation the patch causes. I wonder it is acceptable to get back the previous behavior by exposing SEQ_LOG_VALS itself or a boolean to do that, as a 'not-recommended-to-use' variable. Maybe, but what would such workload look like? Based on the tests I did, such workload probably can't generate any WAL. The amount of WAL added by the change is tiny, the regression is caused by having to flush WAL. The only plausible workload I can think of is just calling nextval, and the cache pretty much fixes that. FWIW I plan to explore the idea of looking at sequence page LSN, and flushing up to that position. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM < satyanarlapu...@gmail.com> wrote: > >> > XLogInsert in my opinion is the best place to call it and the hook can be > something like this "void xlog_insert_hook(NULL)" as all the throttling > logic required is the current flush position which can be obtained > from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch. > IMHO, it is not a good idea to call an external hook function inside a critical section. Generally, we ensure that we do not call any code path within a critical section which can throw an error and if we start calling the external hook then we lose that control. It should be blocked at the operation level itself e.g. ALTER TABLE READ ONLY, or by some other hook at a little higher level. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: more descriptive message for process termination due to max_slot_wal_keep_size
On Fri, Dec 24, 2021 at 5:30 PM Kyotaro Horiguchi wrote: > > Thank you for the comment. > > At Fri, 24 Dec 2021 17:06:57 +0900, Masahiko Sawada > wrote in > > Thank you for the patch! +1 for improving the messages. > > > > > > > > > LOG: terminating process %d to release replication slot \"%s\" because > > > > its restart_lsn %X/%X exceeds max_slot_wal_keep_size > > > > DETAIL: The slot got behind the limit %X/%X determined by > > > > max_slot_wal_keep_size. > > > > > > > LOG: invalidating slot \"%s\" because its restart_LSN %X/%X exceeds > > > > max_slot_wal_keep_size > > > c> DETAIL: The slot got behind the limit %X/%X determined by > > > max_slot_wal_keep_size. > > > > - > > LSN_FORMAT_ARGS(restart_lsn; > > + > > LSN_FORMAT_ARGS(restart_lsn)), > > +errdetail("The slot > > got behind the limit %X/%X determined by max_slot_wal_keep_size.", > > + > > LSN_FORMAT_ARGS(oldestLSN; > > > > Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also > > by wal_keep_size? > > Right. But I believe the two are not assumed to be used at once. One > can set wal_keep_size larger than max_slot_wal_keep_size but it is > actually a kind of ill setting. > > LOG: terminating process %d to release replication slot \"%s\" because its > restart_lsn %X/%X exceeds max_slot_wal_keep_size > DETAIL: The slot got behind the limit %X/%X determined by > max_slot_wal_keep_size and wal_keep_size. > > Mmm. I don't like this. I feel we don't need such detail in the > message. How about something like: LOG: terminating process %d to release replication slot \"%s\" because its restart_lsn %X/%X exceeds the limit DETAIL: The slot got behind the limit %X/%X HINT: You might need to increase max_slot_wal_keep_size or wal_keep_size. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
On Fri, Dec 24, 2021 at 4:43 PM Dilip Kumar wrote: > > On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM > wrote: >> >> XLogInsert in my opinion is the best place to call it and the hook can be >> something like this "void xlog_insert_hook(NULL)" as all the throttling >> logic required is the current flush position which can be obtained from >> GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch. > > IMHO, it is not a good idea to call an external hook function inside a > critical section. Generally, we ensure that we do not call any code path > within a critical section which can throw an error and if we start calling > the external hook then we lose that control. It should be blocked at the > operation level itself e.g. ALTER TABLE READ ONLY, or by some other hook at a > little higher level. Yeah, good point. It's not advisable to give the control to the external module in the critical section. For instance, memory allocation isn't allowed (see [1]) and the ereport(ERROR,) would transform to PANIC inside the critical section (see [2], [3]). Moreover the critical section is to be short-spanned i.e. executing the as minimal code as possible. There's no guarantee that an external module would follow these. I suggest we do it at the level of transaction start i.e. when a txnid is getting allocated i.e. in AssignTransactionId(). If we do this, when the limit for the throttling is exceeded, the current txn (even if it is a long running txn) continues to do the WAL insertions, the next txns would get blocked. But this is okay and can be conveyed to the users via documentation if need be. We do block txnid assignments for parallel workers in this function, so this is a good choice IMO. Thoughts? [1] /* * You should not do memory allocations within a critical section, because * an out-of-memory error will be escalated to a PANIC. To enforce that * rule, the allocation functions Assert that. */ #define AssertNotInCriticalSection(context) \ Assert(CritSectionCount == 0 || (context)->allowInCritSection) [2] /* * If we are inside a critical section, all errors become PANIC * errors. See miscadmin.h. */ if (CritSectionCount > 0) elevel = PANIC; [3] * A related, but conceptually distinct, mechanism is the "critical section" * mechanism. A critical section not only holds off cancel/die interrupts, * but causes any ereport(ERROR) or ereport(FATAL) to become ereport(PANIC) * --- that is, a system-wide reset is forced. Needless to say, only really * *critical* code should be marked as a critical section! Currently, this * mechanism is only used for XLOG-related code. Regards, Bharath Rupireddy.
Re: An obsolete comment of pg_stat_statements
On Fri, Dec 24, 2021 at 03:32:10PM +0900, Kyotaro Horiguchi wrote: > Thanks! Looks better. It is used as-is in the attached. > > And I will register this to the next CF. Do we really need to have this comment in the function header? The same is explained a couple of lines down so this feels like a duplicate, and it is hard to miss it with the code shaped as-is (aka the relationship between compute_query_id and queryId and the consequences on what's stored in this case). -- Michael signature.asc Description: PGP signature
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
On Fri, Dec 24, 2021 at 11:21 AM Kyotaro Horiguchi wrote: > > At Thu, 23 Dec 2021 20:35:54 +0530, Bharath Rupireddy > wrote in > > Hi, > > > > It is useful (for debugging purposes) if the checkpoint end message > > has the checkpoint LSN and REDO LSN [1]. It gives more context while > > analyzing checkpoint-related issues. The pg_controldata gives the last > > checkpoint LSN and REDO LSN, but having this info alongside the log > > message helps analyze issues that happened previously, connect the > > dots and identify the root cause. > > > > Attaching a small patch herewith. Thoughts? > > A big +1 from me. I thought about proposing the same in the past. Thanks for the review. I've added a CF entry to not lose track - https://commitfest.postgresql.org/36/3474/. > > [1] > > 2021-12-23 14:58:54.694 UTC [1965649] LOG: checkpoint starting: > > shutdown immediate > > 2021-12-23 14:58:54.714 UTC [1965649] LOG: checkpoint complete: wrote > > 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; > > write=0.001 s, sync=0.001 s, total=0.025 s; sync files=0, > > longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB; > > LSN=0/14D0AD0, REDO LSN=0/14D0AD0 > > I thougt about something like the following, but your proposal may be > clearer. > > > WAL range=[0/14D0340, 0/14D0AD0] Yeah the proposed in the v1 is clear saying checkpoint/restartpoint LSN and REDO LSN. Regards, Bharath Rupireddy.
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
On Fri, Dec 24, 2021 at 02:51:34PM +0900, Kyotaro Horiguchi wrote: > I thougt about something like the following, but your proposal may be > clearer. +"LSN=%X/%X, REDO LSN=%X/%X", This could be rather confusing for the average user, even if I agree that this is some information that only an advanced user could understand. Could it be possible to define those fields in a more deterministic way? For one, it is hard to understand the relationship between both fields without looking at the code, particulary if both share the same value. It is at least rather.. Well, mostly, easy to guess what each other field means in this context, which is not the case of what you are proposing here. One idea could be use also "start point" for REDO, for example. -- Michael signature.asc Description: PGP signature
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
On Fri, Dec 24, 2021 at 5:42 PM Michael Paquier wrote: > > On Fri, Dec 24, 2021 at 02:51:34PM +0900, Kyotaro Horiguchi wrote: > > I thougt about something like the following, but your proposal may be > > clearer. > > +"LSN=%X/%X, REDO LSN=%X/%X", > This could be rather confusing for the average user, even if I agree > that this is some information that only an advanced user could > understand. Could it be possible to define those fields in a more > deterministic way? For one, it is hard to understand the relationship > between both fields without looking at the code, particulary if both > share the same value. It is at least rather.. Well, mostly, easy to > guess what each other field means in this context, which is not the > case of what you are proposing here. One idea could be use also > "start point" for REDO, for example. How about "location=%X/%X, REDO start location=%X/%X"? The entire log message can look like below: 2021-12-24 12:20:19.140 UTC [1977834] LOG: checkpoint complete:location=%X/%X, REDO start location=%X/%X; wrote 7 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s, sync=0.007 s, total=0.192 s; sync files=5, longest=0.005 s, average=0.002 s; distance=293 kB, estimate=56584 kB Another variant: 2021-12-24 12:20:19.140 UTC [1977834] LOG: checkpoint completed at location=%X/%X with REDO start location=%X/%X: wrote 7 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s, sync=0.007 s, total=0.192 s; sync files=5, longest=0.005 s, average=0.002 s; distance=293 kB, estimate=56584 kB 2021-12-24 12:20:19.140 UTC [1977834] LOG: restartpoint completed at location=%X/%X with REDO start location=%X/%X: wrote 7 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s, sync=0.007 s, total=0.192 s; sync files=5, longest=0.005 s, average=0.002 s; distance=293 kB, estimate=56584 kB Thoughts? Regards, Bharath Rupireddy.
Re: Be clear about what log_checkpoints emits in the documentation
On Fri, Dec 24, 2021 at 11:30 AM Kyotaro Horiguchi wrote: > > At Thu, 23 Dec 2021 20:56:22 +0530, Bharath Rupireddy > wrote in > > Hi, > > > > Currently the documentation of the log_checkpoints GUC says the following: > > > > Some statistics are included in the log messages, including the > > number > > of buffers written and the time spent writing them. > > > > Usage of the word "Some" makes it a vague statement. Why can't we just > > be clear about what statistics the log_checkpoints GUC can emit, like > > the attached patch? > > > > Thoughts? > > It seems like simply a maintenance burden of documentation since it > doesn't add any further detail of any item in a checkpoint log > message. But I'm not sure we want detailed explanations for them and > it seems to me we deliberately never explained the detail of any log > messages. Agreed. "Some statistics" can cover whatever existing and newly added stuff, if any. Let it be that way. Regards, Bharath Rupireddy.
Re: correct the sizes of values and nulls arrays in pg_control_checkpoint
On Thu, Dec 23, 2021 at 9:16 PM Bharath Rupireddy wrote: > > On Thu, Dec 23, 2021 at 9:13 PM Euler Taveira wrote: > > > > On Thu, Dec 23, 2021, at 8:39 AM, Bharath Rupireddy wrote: > > > > pg_control_checkpoint emits 18 columns whereas the values and nulls > > arrays are defined to be of size 19. Although it's not critical, > > attaching a tiny patch to fix this. > > > > Good catch! I'm wondering if a constant wouldn't be useful for such case. > > Thanks. I thought of having a macro, but it creates a lot of diff with > the previous versions as we have to change for other pg_control_XXX > functions. I've added a CF entry to not lose track - https://commitfest.postgresql.org/36/3475/ Regards, Bharath Rupireddy.
Re: Add new function to convert 32-bit XID to 64-bit
On Thu, Dec 23, 2021 at 8:33 PM Fujii Masao wrote: > > Hi, > > When periodically collecting and accumulating statistics or status > information like pg_locks, pg_stat_activity, pg_prepared_xacts, etc for > future troubleshooting or some reasons, I'd like to store a transaction ID of > such information as 64-bit version so that the information of specified > transaction easily can be identified and picked up by transaction ID. > Otherwise it's not easy to distinguish transactions with the same 32-bit XID > but different epoch, only by 32-bit XID. > > But since pg_locks or pg_stat_activity etc returns 32-bit XID, we could not > store it as 64-bit version. To improve this situation, I'd like to propose to > add new function that converts the given 32-bit XID (i.e., xid data type) to > 64-bit one (xid8 data type). If we do this, for example we can easily get > 64-bit XID from pg_stat_activity by "SELECT convert_xid_to_xid8(backend_xid) > FROM pg_stat_activity", if necessary. Thought? What will this function do? >From your earlier description it looks like you want this function to add epoch to the xid when making it a 64bit value. Is that right? > > As another approach, we can change the data type of > pg_stat_activity.backend_xid or pg_locks.transaction, etc from xid to xid8. > But this idea looks overkill to me, and it may break the existing > applications accessing pg_locks etc. So IMO it's better to just provide the > convertion function. Obviously we will break backward compatibility for applications upgrading to a newer PostgreSQL version. Downside is applications using 64bit xid will need to change their applications. If we want to change the datatype anyway, better to create a new type LongTransactionId or something like to represent 64bit transaction id and then change these functions to use that. -- Best Wishes, Ashutosh Bapat
RE: Delay the variable initialization in get_rel_sync_entry
On Friday, December 24, 2021 8:13 AM Michael Paquier wrote: > > On Thu, Dec 23, 2021 at 12:54:41PM -0300, Euler Taveira wrote: > > On Wed, Dec 22, 2021, at 10:11 AM, houzj.f...@fujitsu.com wrote: > >> The extra cost could sometimes be noticeable because get_rel_sync_entry is > a > >> hot function which is executed for each change. And the 'am_partition' and > >> 'relkind' are necessary only when we need to rebuild the RelationSyncEntry. > >> > >> Here is the perf result for the case when inserted large amounts of data > >> into > a > >> un-published table in which case the cost is noticeable. > >> > >> --12.83%--pgoutput_change > >> |--11.84%--get_rel_sync_entry > >> |--4.76%--get_rel_relispartition > >> |--4.70%--get_rel_relkind > > How does the perf balance change once you apply the patch? Do we have > anything else that stands out? Getting rid of this bottleneck is fine > by itself, but I am wondering if there are more things to worry about > or not. Thanks for the response. Here is the perf result of pgoutput_change after applying the patch. I didn't notice something else that stand out. |--2.99%--pgoutput_change |--1.80%--get_rel_sync_entry |--1.56%--hash_search Also attach complete profiles. > > Good catch. WFM. Deferring variable initialization close to its first use is > > good practice. > > Yeah, it is usually a good practice to have the declaration within > the code block that uses it rather than piling everything at the > beginning of the function. Being able to see that in profiles is > annoying, and the change is simple, so I'd like to backpatch it. +1 > This is a period of vacations for a lot of people, so I'll wait until > the beginning-ish of January before doing anything. Thanks, added it to CF. https://commitfest.postgresql.org/36/3471/ Best regards, Hou zj <>
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
On Sun, Dec 12, 2021 at 06:08:05PM +0530, Bharath Rupireddy wrote: > Another idea could be to use the infrastructure laid out by the commit > 9ce346e [1]. With ereport_startup_progress, we can emit the LOGs(of > current recovering WAL file) for every log_startup_progress_interval > seconds/milliseconds. > > One problem is that ereport_startup_progress doesn't work on > StandbyMode, maybe we can remove this restriction unless we have a > major reason for not allowing it on the standby. > /* Prepare to report progress of the redo phase. */ > if (!StandbyMode) > begin_startup_progress_phase(); The relevant conversation starts here. https://www.postgresql.org/message-id/20210814214700.GO10479%40telsasoft.com There was a lot of confusion in that thread, though. The understanding was that it didn't make sense for that feature to continuously log messages on a standby (every 10sec by default). That seems like too much - the issue of a checkpointed logged every 5min was enough of a hurdle. If you're talking about a new feature that uses the infrastructre from 9ce3, but is controlled by a separate GUC like log_wal_traffic, that could be okay. -- Justin
Proposal: sslmode=tls-only
>From 53.2.9. SSL Session Encryption: > When SSL encryption can be performed, the server is expected to send only > the single S byte and then wait for the frontend to initiate an SSL > handshake. If additional bytes are available to read at this point, it > likely means that a man-in-the-middle is attempting to perform a > buffer-stuffing attack (CVE-2021-23222). Frontends should be coded either > to read exactly one byte from the socket before turning the socket over to > their SSL library, or to treat it as a protocol violation if they find they > have read additional bytes. > An initial SSLRequest can also be used in a connection that is being > opened to send a CancelRequest message. > While the protocol itself does not provide a way for the server to force > SSL encryption, the administrator can configure the server to reject > unencrypted sessions as a byproduct of authentication checking. Has consideration been given to having something like ssl-mode=tls-only where the SSLRequest message is skipped and the TLS handshake starts immediately with the protocol continuing after that? This has several advantages: 1) One less round-trip when establishing the connection, speeding this up. TLS 1.3 removed a round-trip and this was significant so it could help. 2) No possibility of downgrading to non-TLS. (Not sure this is an issue though.) 3) Connections work through normal TLS proxies and SNI can be used for routing. This final advantage is the main reason I'd like to see this implemented. Postgres is increasingly being run in multi-tenant Kubernetes clusters where load-balancers and node ports are not available, cost or don't scale and it is currently difficult to connect to databases running there. If it was possible to use normal ingress TLS proxies, running Postgres in Kubernetes would be much easier and there are other use cases where just using TLS would be beneficial as well. Questions about TLS SNI support have been asked for several years now and this would be a big usability improvement. SNI support was recently added to the client-side and this seems like the next logical step. Thoughts? - Keith
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby wrote: > > On Sun, Dec 12, 2021 at 06:08:05PM +0530, Bharath Rupireddy wrote: > > Another idea could be to use the infrastructure laid out by the commit > > 9ce346e [1]. With ereport_startup_progress, we can emit the LOGs(of > > current recovering WAL file) for every log_startup_progress_interval > > seconds/milliseconds. > > > > One problem is that ereport_startup_progress doesn't work on > > StandbyMode, maybe we can remove this restriction unless we have a > > major reason for not allowing it on the standby. > > /* Prepare to report progress of the redo phase. */ > > if (!StandbyMode) > > begin_startup_progress_phase(); > > The relevant conversation starts here. > https://www.postgresql.org/message-id/20210814214700.GO10479%40telsasoft.com > > There was a lot of confusion in that thread, though. > > The understanding was that it didn't make sense for that feature to > continuously log messages on a standby (every 10sec by default). That seems > like too much - the issue of a checkpointed logged every 5min was enough of a > hurdle. > > If you're talking about a new feature that uses the infrastructre from 9ce3, > but is controlled by a separate GUC like log_wal_traffic, that could be okay. Do you see any problems using the same GUC log_startup_progress_interval and ereport_startup_progress API introduced by 9ce346e? I prefer this instead of a new GUC. Thoughts? Regards, Bharath Rupireddy.
Re: 回复:Re: Re: 回复:Re: Is it worth pushing conditions to sublink/subplan?
On Thu, Dec 23, 2021 at 3:52 AM 曾文旌(义从) wrote: > > Fixed a bug found during testing. > > > Wenjing > > >>> Hi, + if (condition_is_safe_pushdown_to_sublink(rinfo, expr_info->outer)) + { + /* replace qual expr from outer var = const to var = const and push down to sublink query */ + sublink_query_push_qual(subquery, (Node *)copyObject(rinfo->clause), expr_info->outer, expr_info->inner); Since sublink_query_push_qual() is always guarded by condition_is_safe_pushdown_to_sublink(), it seems sublink_query_push_qual() can be folded into condition_is_safe_pushdown_to_sublink(). For generate_base_implied_equalities(): + if (ec->ec_processed) + { + ec_index++; + continue; + } + else if (list_length(ec->ec_members) > 1) Minor comment: the keyword `else` can be omitted (due to `continue` above). +* Since there may be an unexpanded sublink in the targetList, +* we'll skip it for now. Since there may be an -> If there is an + {"lazy_process_sublink", PGC_USERSET, QUERY_TUNING_METHOD, + gettext_noop("enable lazy process sublink."), Looking at existing examples from src/backend/utils/misc/guc.c, enable_lazy_sublink_processing seems to be consistent with existing guc variable naming. +lazy_process_sublinks(PlannerInfo *root, bool single_result_rte) lazy_process_sublinks -> lazily_process_sublinks + else + { /* There shouldn't be any OJ info to translate, as yet */ Assert(subroot->join_info_list == NIL); Indentation for the else block is off. + if (istop) + f->quals = preprocess_expression_ext(root, f->quals, EXPRKIND_QUAL, false); + else + f->quals = preprocess_expression_ext(root, f->quals, EXPRKIND_QUAL, true); The above can be written as: + f->quals = preprocess_expression_ext(root, f->quals, EXPRKIND_QUAL, !istop); For find_equal_conditions_contain_uplevelvar_in_sublink_query(): + context.has_unexpected_expr == false && `!context.has_unexpected_expr` should suffice equal_expr_safety_check -> is_equal_expr_safe Cheers
Re: Proposal: sslmode=tls-only
On 12/24/21 09:08, Keith Burdis wrote: > From 53.2.9. SSL Session Encryption: > > > When SSL encryption can be performed, the server is expected to > send only the single S byte and then wait for the frontend to > initiate an SSL handshake. If additional bytes are available to > read at this point, it likely means that a man-in-the-middle is > attempting to perform a buffer-stuffing attack (CVE-2021-23222). > Frontends should be coded either to read exactly one byte from the > socket before turning the socket over to their SSL library, or to > treat it as a protocol violation if they find they have read > additional bytes. > > > An initial SSLRequest can also be used in a connection that is > being opened to send a CancelRequest message. > > > While the protocol itself does not provide a way for the server to > force SSL encryption, the administrator can configure the server > to reject unencrypted sessions as a byproduct of authentication > checking. > > > Has consideration been given to having something like > ssl-mode=tls-only where the SSLRequest message is skipped and the TLS > handshake starts immediately with the protocol continuing after that? > > This has several advantages: > 1) One less round-trip when establishing the connection, speeding this > up. TLS 1.3 removed a round-trip and this was significant so it could > help. > 2) No possibility of downgrading to non-TLS. (Not sure this is an > issue though.) > 3) Connections work through normal TLS proxies and SNI can be used for > routing. > > This final advantage is the main reason I'd like to see this > implemented. Postgres is increasingly being run in multi-tenant > Kubernetes clusters where load-balancers and node ports are not > available, cost or don't scale and it is currently difficult to > connect to databases running there. If it was possible to use normal > ingress TLS proxies, running Postgres in Kubernetes would be much > easier and there are other use cases where just using TLS would be > beneficial as well. > > Questions about TLS SNI support have been asked for several years now > and this would be a big usability improvement. SNI support was > recently added to the client-side and this seems like the next logical > step. > > Thoughts? > > Isn't that going to break every existing client? How is a client supposed to know which protocol to follow? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
[PATCH] reduce page overlap of GiST indexes built using sorted method
Hey! Postgres 14 introduces an option to create a GiST index using a sort method. It allows to create indexes much faster but as it had been mentioned in sort support patch discussion the faster build performance comes at cost of higher degree of overlap between pages than for indexes built with regular method. Sort support was implemented for GiST opclass in PostGIS but eventually got removed as default behaviour in latest 3.2 release because as it had been discovered by Paul Ramsey https://lists.osgeo.org/pipermail/postgis-devel/2021-November/029225.html performance of queries might degrade by 50%. Together with Darafei Praliaskouski, Andrey Borodin and me we tried several approaches to solve query performance degrade: - The first attempt was try to decide whether to make a split depending on direction of curve (Z-curve for Postgres geometry type, Hilbert curve for PostGIS). It was implemented by filling page until fillfactor / 2 and then checking penalty for every next item and keep inserting in current page if penalty is 0 or start new page if penalty is not 0. It turned out that with this approach index becomes significantly larger whereas pages overlap still remains high. - Andrey Borodin implemented LRU + split: a fixed amount of pages are kept in memory and the best candidate page to insert the next item in is selected by minimum penalty among these pages. If the best page for insertion is full, it gets splited into multiple pages, and if the amount of candidate pages after split exceeds the limit, the pages insertion to which has not happened recently are flushed. https://github.com/x4m/postgres_g/commit/0f2ed5f32a00f6c3019048e0c145b7ebda629e73. We made some tests and while query performance speed using index built with this approach is fine a size of index is extremely large. Eventually we made implementation of an idea outlined in sort support patch discussion here https://www.postgresql.org/message-id/flat/08173bd0-488d-da76-a904-912c35da4...@iki.fi#09ac9751a4cde897c99b99b2170faf3a that several pages can be collected and then divided into actual index pages by calling picksplit. My benchmarks with data provided in postgis-devel show that query performance using index built with patched sort support is comparable with performance of query using index built with regular method. The size of index is also matches size of index built with non-sorting method. It should be noted that with the current implementation of the sorting build method, pages are always filled up to fillfactor. This patch changes this behavior to what it would be if using a non-sorting method, and pages are not always filled to fillfactor for the sake of query performance. I'm interested in improving it and I wonder if there are any ideas on this. Benchmark summary: create index roads_rdr_idx on roads_rdr using gist (geom); with sort support before patch / CREATE INDEX 76.709 ms with sort support after patch / CREATE INDEX 225.238 ms without sort support / CREATE INDEX 446.071 ms select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom; with sort support before patch / SELECT 5766.526 ms with sort support after patch / SELECT 2646.554 ms without sort support / SELECT 2721.718 ms index size with sort support before patch / IDXSIZE 2940928 bytes with sort support after patch / IDXSIZE 4956160 bytes without sort support / IDXSIZE 5447680 bytes More detailed: Before patch using sorted method: postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using gist(geom); CREATE INDEX Time: 76.709 ms postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom; count 505806 (1 row) Time: 5766.526 ms (00:05.767) postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom; count 505806 (1 row) Time: 5880.142 ms (00:05.880) postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom; count 505806 (1 row) Time: 5778.437 ms (00:05.778) postgres=# select gist_stat('roads_rdr_geom_idx_sortsupport'); gist_stat -- Number of levels: 3+ Number of pages: 359 + Number of leaf pages: 356 + Number of tuples: 93034+ Number of invalid tuples: 0+ Number of leaf tuples: 92676+ Total size of tuples: 2609260 bytes+ Total size of leaf tuples: 2599200 bytes+ Total size of index: 2940928 bytes+ (1 row) After patch using sorted method: postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using gist(geom); CREATE INDEX Time: 225.238 ms postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom; count 505806 (1 row) Time: 2646.554 ms (00:02.647) postgres=# select count(*) from roads_rdr a, roads_rdr b whe
Re: Proposal: sslmode=tls-only
Servers that use sslmode=tls-only would not be compatible with clients that do not yet support it, but that is same for any similar server-side change, for example if the server requires a minimum of TLS 1.3 but the client only supports TLS 1.2. IIUC with the default sslmode=prefer a client currently first tries to connect using SSL by sending an SSLRequest and if the response is not an S then it tries with a plain StartupMessage. If we wanted this new mode to work by default as well, this could be changed to try a TLS handshake after the SSLRequest and StartupMessage. Similarly for the other existing sslmode values an SSLRequest could be sent and if that fails then a TLS handshake could be initiated. But this is "only provided as the default for backwards compatibility, and is not recommended in secure deployments". I'd expect that most secure deployments set an explicit sslmode=verify-full, so having them set an explicit sslmode=tls-only would not be an issue once the client and server support it. There is already precedent for having clients handle something new [1]:"The frontend should also be prepared to handle an ErrorMessage response to SSLRequest from the server. This would only occur if the server predates the addition of SSL support to PostgreSQL. (Such servers are now very ancient, and likely do not exist in the wild anymore.) In this case the connection must be closed, but the frontend might choose to open a fresh connection and proceed without requesting SSL." [1] https://www.postgresql.org/docs/14/protocol-flow.html#id-1.10.5.7.11 On Fri, 24 Dec 2021 at 19:16, Andrew Dunstan wrote: > > On 12/24/21 09:08, Keith Burdis wrote: > > From 53.2.9. SSL Session Encryption: > > > > > > When SSL encryption can be performed, the server is expected to > > send only the single S byte and then wait for the frontend to > > initiate an SSL handshake. If additional bytes are available to > > read at this point, it likely means that a man-in-the-middle is > > attempting to perform a buffer-stuffing attack (CVE-2021-23222). > > Frontends should be coded either to read exactly one byte from the > > socket before turning the socket over to their SSL library, or to > > treat it as a protocol violation if they find they have read > > additional bytes. > > > > > > An initial SSLRequest can also be used in a connection that is > > being opened to send a CancelRequest message. > > > > > > While the protocol itself does not provide a way for the server to > > force SSL encryption, the administrator can configure the server > > to reject unencrypted sessions as a byproduct of authentication > > checking. > > > > > > Has consideration been given to having something like > > ssl-mode=tls-only where the SSLRequest message is skipped and the TLS > > handshake starts immediately with the protocol continuing after that? > > > > This has several advantages: > > 1) One less round-trip when establishing the connection, speeding this > > up. TLS 1.3 removed a round-trip and this was significant so it could > > help. > > 2) No possibility of downgrading to non-TLS. (Not sure this is an > > issue though.) > > 3) Connections work through normal TLS proxies and SNI can be used for > > routing. > > > > This final advantage is the main reason I'd like to see this > > implemented. Postgres is increasingly being run in multi-tenant > > Kubernetes clusters where load-balancers and node ports are not > > available, cost or don't scale and it is currently difficult to > > connect to databases running there. If it was possible to use normal > > ingress TLS proxies, running Postgres in Kubernetes would be much > > easier and there are other use cases where just using TLS would be > > beneficial as well. > > > > Questions about TLS SNI support have been asked for several years now > > and this would be a big usability improvement. SNI support was > > recently added to the client-side and this seems like the next logical > > step. > > > > Thoughts? > > > > > > Isn't that going to break every existing client? How is a client > supposed to know which protocol to follow? > > > cheers > > > andrew > > > -- > Andrew Dunstan > EDB: https://www.enterprisedb.com > >
Re: minor gripe about lax reloptions parsing for views
On 2021-Dec-21, Mark Dilger wrote: > Rebased patch attached: These tests are boringly repetitive. Can't we have something like a nested loop, with AMs on one and reloptions on the other, where each reloption is tried on each AM and an exception block to report the failure or success for each case? Maybe have the list of AMs queried from pg_am with hardcoded additions if needed? -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Ninguna manada de bestias tiene una voz tan horrible como la humana" (Orual)
Re: Proposal: sslmode=tls-only
Keith Burdis writes: > Has consideration been given to having something like ssl-mode=tls-only > where the SSLRequest message is skipped and the TLS handshake starts > immediately with the protocol continuing after that? https://www.postgresql.org/message-id/flat/fcc3ebeb7f05775b63f3207ed52a54ea5d17fb42.camel%40vmware.com regards, tom lane
Re: How to retain lesser paths at add_path()?
Hi, While researching the viability to change Citus' planner in a way to take more advantage of the postgres planner I ran into the same limitations in add_path as described on this thread. For Citus we are investigating a possibility to introduce Path nodes that describe the operation of transer tuples over the network. Due to the pruning rules in add_path we currently lose many paths that have great opportunities of future optimizations. Having looked at the patches provided I don't think that adding a new List to the Path structure where we retain 'lesser' paths is the right approach. First of all, this would require extension developers to interpret all these paths, where many would dominate others on cost, sorting, parameterization etc. Instead I like the approach suggested by both Robert Haas and Tom Lane. > have some kind of figure of merit or other special marking for paths > that will have some possible special advantage in later planning > steps. This is well in line with how the current logic works in add_path where cost, sorting and parameterization, rowcount and parallel safety are dimensions on which paths are compared. IMHO extensions should be able to add dimensions of their interest to this list of current dimensions used. The thoughts Tom Lane expressed earlier struc a chord with me. > [ thinks a bit... ] Maybe that could be improved if we can express > the result as a bitmask, defined in such a way that OR'ing (or maybe > AND'ing? haven't worked it out) the results from different comparisons > does the right thing. Attached you will find 3 patches that implement a way for extensions to introduce 'a figure of merit' by the means of path comparisons. - The first patch refactors the decision logic out of the forloop into their own function as to make it easier to reason about what is being compared. This refactor also changes the direction of some if-statements as to provide clear early decision points. - The second patch rewrites the original if/else/switch tree into 5 logical or operations of a bitmask expressing the comparison between paths, together with early escapes once we know the paths are different. To keep the original logic there is a slight deviation from simply or-ing 5 comparisons. After comparing cost, sorting and parameterization we only continue or-ing rows and parallelism if the paths are leaning either to path1 (new_path) or path2 (old_path). If the first 3 comparisons result in the paths being equal the original code prefers parallel paths over paths with less rows. - The third and last path builds on the logical or operations introduced in the middle patch. After the first three dimensions postgres compares paths on we allow extensions to compare paths in the same manner. Their result gets merged into the compounded comparison. To come to the three patches above I have decomposed the orignal behaviour into 3 possible decisions add_path can take per iteration in the loop. It either REJECTS the new path, DOMINATES the old path or CONTINUES with the next path in the list. The decision for any of the 3 actions is based on 6 different input parameters: - cost std fuzz factor - sort order / pathkeys - parameterizations / bitmap subset of required outer relid's - row count - parallel safety - cost smaller fuzz factor To verify the correct decisions being made in the refactor of the second patch I modelled both implementations in a scripting language and passed in all possible comparisons for the six dimensions above. With the change of behaviour after the first 3 dimensions I came to an exact one-to-one mapping of the decisions being made before and after patch 2. Throughout this thread an emphasis on performance has been expressed by many participants. I want to acknowledge their stance. Due to the nature of the current planner the code in add_path gets invoked on an exponential scale when the number of joins increases and extra paths are retained. Especially with my proposed patch being called inside the loop where paths are being compared makes that the hook gets called ever more often compared to the earlier proposed patches on this thread. To reason about the performance impact of a patch in this area I think we need to get to a mutual understanding and agreement on how to evaluate the performance. Given many if not most installations will run without any hook installed in this area we should aim for minimal to no measurable overhead of the code without a hook installed. This is also why I made sure, via modelling of the decision logic in a scripting language the behaviour of which paths are retained is not changed with these patches when no hook is installed. When an extension needs to add a dimension to the comparison of paths the impact of this patch is twofold: - A dynamic function gets invoked to compare every path to every other path. Both the dynamic function call and the logics to compare the paths will introduce extra time spent in add_pat
generic plans and "initial" pruning
Executing generic plans involving partitions is known to become slower as partition count grows due to a number of bottlenecks, with AcquireExecutorLocks() showing at the top in profiles. Previous attempt at solving that problem was by David Rowley [1], where he proposed delaying locking of *all* partitions appearing under an Append/MergeAppend until "initial" pruning is done during the executor initialization phase. A problem with that approach that he has described in [2] is that leaving partitions unlocked can lead to race conditions where the Plan node belonging to a partition can be invalidated when a concurrent session successfully alters the partition between AcquireExecutorLocks() saying the plan is okay to execute and then actually executing it. However, using an idea that Robert suggested to me off-list a little while back, it seems possible to determine the set of partitions that we can safely skip locking. The idea is to look at the "initial" or "pre-execution" pruning instructions contained in a given Append or MergeAppend node when AcquireExecutorLocks() is collecting the relations to lock and consider relations from only those sub-nodes that survive performing those instructions. I've attempted implementing that idea in the attached patch. Note that "initial" pruning steps are now performed twice when executing generic plans: once in AcquireExecutorLocks() to find partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to determine the set of partition sub-nodes to be initialized for execution, though I wasn't able to come up with a good idea to avoid this duplication. Using the following benchmark setup: pgbench testdb -i --partitions=$nparts > /dev/null 2>&1 pgbench -n testdb -S -T 30 -Mprepared And plan_cache_mode = force_generic_plan, I get following numbers: HEAD: 32 tps = 20561.776403 (without initial connection time) 64 tps = 12553.131423 (without initial connection time) 128 tps = 13330.365696 (without initial connection time) 256 tps = 8605.723120 (without initial connection time) 512 tps = 4435.951139 (without initial connection time) 1024tps = 2346.902973 (without initial connection time) 2048tps = 1334.680971 (without initial connection time) Patched: 32 tps = 27554.156077 (without initial connection time) 64 tps = 27531.161310 (without initial connection time) 128 tps = 27138.305677 (without initial connection time) 256 tps = 25825.467724 (without initial connection time) 512 tps = 19864.386305 (without initial connection time) 1024tps = 18742.668944 (without initial connection time) 2048tps = 16312.412704 (without initial connection time) -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CAKJS1f_kfRQ3ZpjQyHC7=pk9vrhxihbqfz+hc0jcwwnrkkf...@mail.gmail.com [2] https://www.postgresql.org/message-id/CAKJS1f99JNe%2Bsw5E3qWmS%2BHeLMFaAhehKO67J1Ym3pXv0XBsxw%40mail.gmail.com v1-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patch Description: Binary data
Re: row filtering for logical replication
On Fri, Dec 24, 2021 at 11:04 AM Peter Smith wrote: > > The current PG docs text for CREATE PUBLICATION (in the v54-0001 > patch) has a part that says > > + A nullable column in the WHERE clause could cause the > + expression to evaluate to false; avoid using columns without not-null > + constraints in the WHERE clause. > > I felt that the caution to "avoid using" nullable columns is too > strongly worded. AFAIK nullable columns will work perfectly fine so > long as you take due care of them in the WHERE clause. In fact, it > might be very useful sometimes to filter on nullable columns. > > Here is a small test example: > > // publisher > test_pub=# create table t1 (id int primary key, msg text null); > test_pub=# create publication p1 for table t1 where (msg != 'three'); > // subscriber > test_sub=# create table t1 (id int primary key, msg text null); > test_sub=# CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost > dbname=test_pub application_name=sub1' PUBLICATION p1; > > // insert some data > test_pub=# insert into t1 values (1, 'one'), (2, 'two'), (3, 'three'), > (4, null), (5, 'five'); > test_pub=# select * from t1; > id | msg > +--- > 1 | one > 2 | two > 3 | three > 4 | > 5 | five > (5 rows) > > // data at sub > test_sub=# select * from t1; > id | msg > +-- > 1 | one > 2 | two > 5 | five > (3 rows) > > Notice the row 4 with the NULL is also not replicated. But, perhaps we > were expecting it to be replicated (because NULL is not 'three'). To > do this, simply rewrite the WHERE clause to properly account for > nulls. > > // truncate both sides > test_pub=# truncate table t1; > test_sub=# truncate table t1; > > // alter the WHERE clause > test_pub=# alter publication p1 set table t1 where (msg is null or msg > != 'three'); > > // insert data at pub > test_pub=# insert into t1 values (1, 'one'), (2, 'two'), (3, 'three'), > (4, null), (5, 'five'); > INSERT 0 5 > test_pub=# select * from t1; > id | msg > +--- > 1 | one > 2 | two > 3 | three > 4 | > 5 | five > (5 rows) > > // data at sub (not it includes the row 4) > test_sub=# select * from t1; > id | msg > +-- > 1 | one > 2 | two > 4 | > 5 | five > (4 rows) > > ~~ > > So, IMO the PG docs wording for this part should be relaxed a bit. > > e.g. > BEFORE: > + A nullable column in the WHERE clause could cause the > + expression to evaluate to false; avoid using columns without not-null > + constraints in the WHERE clause. > AFTER: > + A nullable column in the WHERE clause could cause the > + expression to evaluate to false. To avoid unexpected results, any possible > + null values should be accounted for. > Your suggested wording sounds reasonable to me. Euler, others, any thoughts? -- With Regards, Amit Kapila.