date:20211224

Re: Allow escape in application_name

2021-12-24 Thread Fujii Masao





On 2021/12/24 13:49, Kyotaro Horiguchi wrote:

I'm ok to remove the check "values[i] != NULL", but think that it's
better to keep the other check "*(values[i]) != '\0'" as it
is. Because *(values[i]) can be null character and it's a waste of
cycles to call process_pgfdw_appname() in that case.


Right. I removed too much.


Thanks for the check! So I kept the check "*(values[i]) != '\0'" as it is
and pushed the patch. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: sequences vs. synchronous replication

2021-12-24 Thread Kyotaro Horiguchi

At Fri, 24 Dec 2021 08:23:13 +0100, Tomas Vondra 
 wrote in 
> 
> 
> On 12/24/21 06:37, Kyotaro Horiguchi wrote:
> > At Thu, 23 Dec 2021 19:50:22 +0100, Tomas Vondra
> >  wrote in
> >> On 12/23/21 15:42, Fujii Masao wrote:
> >>> On 2021/12/23 3:49, Tomas Vondra wrote:
>  Attached is a patch tweaking WAL logging - in wal_level=minimal we do
>  the same thing as now, in higher levels we log every sequence fetch.
> >>> Thanks for the patch!
> >>> With the patch, I found that the regression test for sequences failed.
> >>> +    fetch = log = fetch;
> >>> This should be "log = fetch"?
> >>> On second thought, originally a sequence doesn't guarantee that the
> >>> value already returned by nextval() will never be returned by
> >>> subsequent nextval() after the server crash recovery. That is,
> >>> nextval() may return the same value across crash recovery. Is this
> >>> understanding right? For example, this case can happen if the server
> >>> crashes after nextval() returned the value but before WAL for the
> >>> sequence was flushed to the permanent storage.
> >>
> >> I think the important step is commit. We don't guarantee anything for
> >> changes in uncommitted transactions. If you do nextval in a
> >> transaction and the server crashes before the WAL gets flushed before
> >> COMMIT, then yes, nextval may generate the same nextval again. But
> >> after commit that is not OK - it must not happen.
> > I don't mean to stand on Fujii-san's side particularly, but it seems
> > to me sequences of RDBSs are not rolled back generally.  Some googling
> > told me that at least Oracle (documented), MySQL, DB2 and MS-SQL
> > server doesn't rewind sequences at rollback, that is, sequences are
> > incremented independtly from transaction control.  It seems common to
> > think that two nextval() calls for the same sequence must not return
> > the same value in any context.
> > 
> 
> Yes, sequences are not rolled back on abort generally. That would
> require much stricter locking, and that'd go against using sequences
> in concurrent sessions.

I thinks so.

> But we're not talking about sequence rollback - we're talking about
> data loss, caused by failure to flush WAL for a sequence. But that
> affects the *current* code too, and to much greater extent.

Ah, yes, I don't object to that aspect.

> Consider this:
> 
> BEGIN;
> SELECT nextval('s') FROM generate_series(1,1000) s(i);
> ROLLBACK; -- or crash of a different backend
> 
> BEGIN;
> SELECT nextval('s');
> COMMIT;
> 
> With the current code, this may easily lose the WAL, and we'll
> generate duplicate values from the sequence. We pretty much ignore the
> COMMIT.
>
> With the proposed change to WAL logging, that is not possible. The
> COMMIT flushes enough WAL to prevent this issue.
> 
> So this actually makes this issue less severe.
> 
> Maybe I'm missing some important detail, though. Can you show an
> example where the proposed changes make the issue worse?

No. It seems to me improvoment at least from the current state, for
the reason you mentioned.

> >>> So it's not a bug that sync standby may return the same value as
> >>> already returned in the primary because the corresponding WAL has not
> >>> been replicated yet, isn't it?
> >>>
> >>
> >> No, I don't think so. Once the COMMIT happens (and gets confirmed by
> >> the sync standby), it should be possible to failover to the sync
> >> replica without losing any data in committed transaction. Generating
> >> duplicate values is a clear violation of that.
> > So, strictly speaking, that is a violation of the constraint I
> > mentioned regardless whether the transaction is committed or
> > not. However we have technical limitations as below.
> > 
> 
> I don't follow. What violates what?
> 
> If the transaction commits (and gets a confirmation from sync
> replica), the modified WAL logging prevents duplicate values. It does
> nothing for uncommitted transactions. Seems like an improvement to me.

Sorry for the noise. I misunderstand that ROLLBACK is being changed to
rollback sequences.

> No idea. IMHO from the correctness / behavior point of view, the
> modified logging is an improvement. The only issue is the additional
> overhead, and I think the cache addresses that quite well.

Now I understand the story here.

I agree that the patch is improvment from the current behavior.
I agree that the overhead is eventually-nothing for WAL-emitting workloads.

Still, as Fujii-san concerns, I'm afraid that some people may suffer
the degradation the patch causes.  I wonder it is acceptable to get
back the previous behavior by exposing SEQ_LOG_VALS itself or a
boolean to do that, as a 'not-recommended-to-use' variable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: more descriptive message for process termination due to max_slot_wal_keep_size

2021-12-24 Thread Masahiko Sawada

On Fri, Dec 24, 2021 at 1:42 PM Kyotaro Horiguchi
 wrote:
>
> At Thu, 23 Dec 2021 18:08:08 +0530, Ashutosh Bapat 
>  wrote in
> > On Wed, Dec 15, 2021 at 9:42 AM Kyotaro Horiguchi
> >  wrote:
> > > > LOG:  invalidating slot "s1"
> > > > DETAIL:  The slot's restart_lsn 0/1D68 is behind the limit 
> > > > 0/1100 defined by max_slot_wal_keep_size.
> > >
> > > The second line could be changed like the following or anything other.
> > >
> > > > DETAIL:  The slot's restart_lsn 0/1D68 got behind the limit 
> > > > 0/1100 determined by max_slot_wal_keep_size.
> > > .
> > >
> >
> > The second version looks better as it gives more details. I am fine
> > with either of the above wordings.
> >
> > I would prefer everything in the same message though since
> > "invalidating slot ..." is too short a LOG message. Not everybody
> > enabled details always.
>
> Mmm. Right. I have gone too much to the same way with the
> process-termination message.
>
> I rearranged the meesages as follows in the attached version. (at master)

Thank you for the patch! +1 for improving the messages.

>
> > LOG:  terminating process %d to release replication slot \"%s\" because its 
> > restart_lsn %X/%X exceeds max_slot_wal_keep_size
> > DETAIL:  The slot got behind the limit %X/%X determined by 
> > max_slot_wal_keep_size.
>
> > LOG:  invalidating slot \"%s\" because its restart_LSN %X/%X exceeds 
> > max_slot_wal_keep_size
> c> DETAIL:  The slot got behind the limit %X/%X determined by 
> max_slot_wal_keep_size.

-
LSN_FORMAT_ARGS(restart_lsn;
+
LSN_FORMAT_ARGS(restart_lsn)),
+errdetail("The slot
got behind the limit %X/%X determined by max_slot_wal_keep_size.",
+
LSN_FORMAT_ARGS(oldestLSN;

Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also
by wal_keep_size?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: more descriptive message for process termination due to max_slot_wal_keep_size

2021-12-24 Thread Kyotaro Horiguchi

Thank you for the comment.

At Fri, 24 Dec 2021 17:06:57 +0900, Masahiko Sawada  
wrote in 
> Thank you for the patch! +1 for improving the messages.
> 
> >
> > > LOG:  terminating process %d to release replication slot \"%s\" because 
> > > its restart_lsn %X/%X exceeds max_slot_wal_keep_size
> > > DETAIL:  The slot got behind the limit %X/%X determined by 
> > > max_slot_wal_keep_size.
> >
> > > LOG:  invalidating slot \"%s\" because its restart_LSN %X/%X exceeds 
> > > max_slot_wal_keep_size
> > c> DETAIL:  The slot got behind the limit %X/%X determined by 
> > max_slot_wal_keep_size.
> 
> -
> LSN_FORMAT_ARGS(restart_lsn;
> +
> LSN_FORMAT_ARGS(restart_lsn)),
> +errdetail("The slot
> got behind the limit %X/%X determined by max_slot_wal_keep_size.",
> +
> LSN_FORMAT_ARGS(oldestLSN;
> 
> Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also
> by wal_keep_size?

Right. But I believe the two are not assumed to be used at once. One
can set wal_keep_size larger than max_slot_wal_keep_size but it is
actually a kind of ill setting.

LOG:  terminating process %d to release replication slot \"%s\" because its 
restart_lsn %X/%X exceeds max_slot_wal_keep_size
DETAIL:  The slot got behind the limit %X/%X determined by 
max_slot_wal_keep_size and wal_keep_size.

Mmm. I don't like this.  I feel we don't need such detail in the
message..  I'd like to hear opinions from others, please.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2021-12-24 Thread Kyotaro Horiguchi

Just a complaint..

At Fri, 12 Nov 2021 16:43:27 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> "" on Japanese (CP-932) environment. I didn't actually
> tested it on Windows and msys environment ...yet.

Active perl cannot be installed because of (perhaps) a powershell
version issue... Annoying..

https://community.activestate.com/t/please-update-your-powershell-install-scripts/7897

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: sequences vs. synchronous replication

2021-12-24 Thread Tomas Vondra





On 12/24/21 09:04, Kyotaro Horiguchi wrote:

...
So, strictly speaking, that is a violation of the constraint I
mentioned regardless whether the transaction is committed or
not. However we have technical limitations as below.



I don't follow. What violates what?

If the transaction commits (and gets a confirmation from sync
replica), the modified WAL logging prevents duplicate values. It does
nothing for uncommitted transactions. Seems like an improvement to me.


Sorry for the noise. I misunderstand that ROLLBACK is being changed to
rollback sequences.



No problem, this part of the code is certainly rather confusing due to 
several layers of caching and these WAL-logging optimizations.



No idea. IMHO from the correctness / behavior point of view, the
modified logging is an improvement. The only issue is the additional
overhead, and I think the cache addresses that quite well.


Now I understand the story here.

I agree that the patch is improvment from the current behavior.
I agree that the overhead is eventually-nothing for WAL-emitting workloads.



OK, thanks.


Still, as Fujii-san concerns, I'm afraid that some people may suffer
the degradation the patch causes.  I wonder it is acceptable to get
back the previous behavior by exposing SEQ_LOG_VALS itself or a
boolean to do that, as a 'not-recommended-to-use' variable.



Maybe, but what would such workload look like? Based on the tests I did, 
such workload probably can't generate any WAL. The amount of WAL added 
by the change is tiny, the regression is caused by having to flush WAL.


The only plausible workload I can think of is just calling nextval, and 
the cache pretty much fixes that.


FWIW I plan to explore the idea of looking at sequence page LSN, and 
flushing up to that position.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-24 Thread Dilip Kumar

On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM <
satyanarlapu...@gmail.com> wrote:

>
>>
> XLogInsert in my opinion is the best place to call it and the hook can be
> something like this "void xlog_insert_hook(NULL)" as all the throttling
> logic required is the current flush position which can be obtained
> from GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.
>

IMHO, it is not a good idea to call an external hook function inside a
critical section.  Generally, we ensure that we do not call any code path
within a critical section which can throw an error and if we start calling
the external hook then we lose that control.  It should be blocked at the
operation level itself e.g. ALTER TABLE READ ONLY, or by some other hook at
a little higher level.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: more descriptive message for process termination due to max_slot_wal_keep_size

2021-12-24 Thread Masahiko Sawada

On Fri, Dec 24, 2021 at 5:30 PM Kyotaro Horiguchi
 wrote:
>
> Thank you for the comment.
>
> At Fri, 24 Dec 2021 17:06:57 +0900, Masahiko Sawada  
> wrote in
> > Thank you for the patch! +1 for improving the messages.
> >
> > >
> > > > LOG:  terminating process %d to release replication slot \"%s\" because 
> > > > its restart_lsn %X/%X exceeds max_slot_wal_keep_size
> > > > DETAIL:  The slot got behind the limit %X/%X determined by 
> > > > max_slot_wal_keep_size.
> > >
> > > > LOG:  invalidating slot \"%s\" because its restart_LSN %X/%X exceeds 
> > > > max_slot_wal_keep_size
> > > c> DETAIL:  The slot got behind the limit %X/%X determined by 
> > > max_slot_wal_keep_size.
> >
> > -
> > LSN_FORMAT_ARGS(restart_lsn;
> > +
> > LSN_FORMAT_ARGS(restart_lsn)),
> > +errdetail("The slot
> > got behind the limit %X/%X determined by max_slot_wal_keep_size.",
> > +
> > LSN_FORMAT_ARGS(oldestLSN;
> >
> > Isn't oldestLSN calculated not only by max_slot_wal_keep_size but also
> > by wal_keep_size?
>
> Right. But I believe the two are not assumed to be used at once. One
> can set wal_keep_size larger than max_slot_wal_keep_size but it is
> actually a kind of ill setting.
>
> LOG:  terminating process %d to release replication slot \"%s\" because its 
> restart_lsn %X/%X exceeds max_slot_wal_keep_size
> DETAIL:  The slot got behind the limit %X/%X determined by 
> max_slot_wal_keep_size and wal_keep_size.
>
> Mmm. I don't like this.  I feel we don't need such detail in the
> message.

How about something like:

LOG:  terminating process %d to release replication slot \"%s\"
because its restart_lsn %X/%X exceeds the limit
DETAIL:  The slot got behind the limit %X/%X
HINT: You might need to increase max_slot_wal_keep_size or wal_keep_size.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

2021-12-24 Thread Bharath Rupireddy

On Fri, Dec 24, 2021 at 4:43 PM Dilip Kumar  wrote:
>
> On Fri, Dec 24, 2021 at 3:27 AM SATYANARAYANA NARLAPURAM 
>  wrote:
>>
>> XLogInsert in my opinion is the best place to call it and the hook can be 
>> something like this "void xlog_insert_hook(NULL)" as all the throttling 
>> logic required is the current flush position which can be obtained from 
>> GetFlushRecPtr and the ReplicationSlotCtl. Attached a draft patch.
>
> IMHO, it is not a good idea to call an external hook function inside a 
> critical section.  Generally, we ensure that we do not call any code path 
> within a critical section which can throw an error and if we start calling 
> the external hook then we lose that control.  It should be blocked at the 
> operation level itself e.g. ALTER TABLE READ ONLY, or by some other hook at a 
> little higher level.

Yeah, good point. It's not advisable to give the control to the
external module in the critical section. For instance, memory
allocation isn't allowed (see [1]) and the ereport(ERROR,) would
transform to PANIC inside the critical section (see [2], [3]).
Moreover the critical section is to be short-spanned i.e. executing
the as minimal code as possible. There's no guarantee that an external
module would follow these.

I suggest we do it at the level of transaction start i.e. when a txnid
is getting allocated i.e. in AssignTransactionId(). If we do this,
when the limit for the throttling is exceeded, the current txn (even
if it is a long running txn) continues to do the WAL insertions, the
next txns would get blocked. But this is okay and can be conveyed to
the users via documentation if need be. We do block txnid assignments
for parallel workers in this function, so this is a good choice IMO.

Thoughts?

[1]
/*
 * You should not do memory allocations within a critical section, because
 * an out-of-memory error will be escalated to a PANIC. To enforce that
 * rule, the allocation functions Assert that.
 */
#define AssertNotInCriticalSection(context) \
Assert(CritSectionCount == 0 || (context)->allowInCritSection)

[2]
/*
 * If we are inside a critical section, all errors become PANIC
 * errors.  See miscadmin.h.
 */
if (CritSectionCount > 0)
elevel = PANIC;

[3]
 * A related, but conceptually distinct, mechanism is the "critical section"
 * mechanism.  A critical section not only holds off cancel/die interrupts,
 * but causes any ereport(ERROR) or ereport(FATAL) to become ereport(PANIC)
 * --- that is, a system-wide reset is forced.  Needless to say, only really
 * *critical* code should be marked as a critical section!  Currently, this
 * mechanism is only used for XLOG-related code.

Regards,
Bharath Rupireddy.

Re: An obsolete comment of pg_stat_statements

2021-12-24 Thread Michael Paquier

On Fri, Dec 24, 2021 at 03:32:10PM +0900, Kyotaro Horiguchi wrote:
> Thanks! Looks better.  It is used as-is in the attached.
> 
> And I will register this to the next CF.

Do we really need to have this comment in the function header?  The
same is explained a couple of lines down so this feels like a
duplicate, and it is hard to miss it with the code shaped as-is (aka
the relationship between compute_query_id and queryId and the
consequences on what's stored in this case).
--
Michael


signature.asc
Description: PGP signature

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2021-12-24 Thread Bharath Rupireddy

On Fri, Dec 24, 2021 at 11:21 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 23 Dec 2021 20:35:54 +0530, Bharath Rupireddy 
>  wrote in
> > Hi,
> >
> > It is useful (for debugging purposes) if the checkpoint end message
> > has the checkpoint LSN and REDO LSN [1]. It gives more context while
> > analyzing checkpoint-related issues. The pg_controldata gives the last
> > checkpoint LSN and REDO LSN, but having this info alongside the log
> > message helps analyze issues that happened previously, connect the
> > dots and identify the root cause.
> >
> > Attaching a small patch herewith. Thoughts?
>
> A big +1 from me.  I thought about proposing the same in the past.

Thanks for the review. I've added a CF entry to not lose track -
https://commitfest.postgresql.org/36/3474/.

> > [1]
> > 2021-12-23 14:58:54.694 UTC [1965649] LOG:  checkpoint starting:
> > shutdown immediate
> > 2021-12-23 14:58:54.714 UTC [1965649] LOG:  checkpoint complete: wrote
> > 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
> > write=0.001 s, sync=0.001 s, total=0.025 s; sync files=0,
> > longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB;
> > LSN=0/14D0AD0, REDO LSN=0/14D0AD0
>
> I thougt about something like the following, but your proposal may be
> clearer.
>
> > WAL range=[0/14D0340, 0/14D0AD0]

Yeah the proposed in the v1 is clear saying checkpoint/restartpoint
LSN and REDO LSN.

Regards,
Bharath Rupireddy.

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2021-12-24 Thread Michael Paquier

On Fri, Dec 24, 2021 at 02:51:34PM +0900, Kyotaro Horiguchi wrote:
> I thougt about something like the following, but your proposal may be
> clearer.

+"LSN=%X/%X, REDO LSN=%X/%X",
This could be rather confusing for the average user, even if I agree
that this is some information that only an advanced user could
understand.  Could it be possible to define those fields in a more
deterministic way?  For one, it is hard to understand the relationship
between both fields without looking at the code, particulary if both
share the same value.  It is at least rather..  Well, mostly, easy to
guess what each other field means in this context, which is not the
case of what you are proposing here.  One idea could be use also
"start point" for REDO, for example. 
--
Michael


signature.asc
Description: PGP signature

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

2021-12-24 Thread Bharath Rupireddy

On Fri, Dec 24, 2021 at 5:42 PM Michael Paquier  wrote:
>
> On Fri, Dec 24, 2021 at 02:51:34PM +0900, Kyotaro Horiguchi wrote:
> > I thougt about something like the following, but your proposal may be
> > clearer.
>
> +"LSN=%X/%X, REDO LSN=%X/%X",
> This could be rather confusing for the average user, even if I agree
> that this is some information that only an advanced user could
> understand.  Could it be possible to define those fields in a more
> deterministic way?  For one, it is hard to understand the relationship
> between both fields without looking at the code, particulary if both
> share the same value.  It is at least rather..  Well, mostly, easy to
> guess what each other field means in this context, which is not the
> case of what you are proposing here.  One idea could be use also
> "start point" for REDO, for example.

How about "location=%X/%X, REDO start location=%X/%X"? The entire log
message can look like below:

2021-12-24 12:20:19.140 UTC [1977834] LOG:  checkpoint
complete:location=%X/%X, REDO start location=%X/%X; wrote 7 buffers
(0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s,
sync=0.007 s, total=0.192 s; sync files=5, longest=0.005 s,
average=0.002 s; distance=293 kB, estimate=56584 kB

Another variant:
2021-12-24 12:20:19.140 UTC [1977834] LOG:  checkpoint completed at
location=%X/%X with REDO start location=%X/%X: wrote 7 buffers (0.0%);
0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s, sync=0.007
s, total=0.192 s; sync files=5, longest=0.005 s, average=0.002 s;
distance=293 kB, estimate=56584 kB
2021-12-24 12:20:19.140 UTC [1977834] LOG:  restartpoint completed at
location=%X/%X with REDO start location=%X/%X: wrote 7 buffers (0.0%);
0 WAL file(s) added, 0 removed, 0 recycled; write=0.005 s, sync=0.007
s, total=0.192 s; sync files=5, longest=0.005 s, average=0.002 s;
distance=293 kB, estimate=56584 kB

Thoughts?

Regards,
Bharath Rupireddy.

Re: Be clear about what log_checkpoints emits in the documentation

2021-12-24 Thread Bharath Rupireddy

On Fri, Dec 24, 2021 at 11:30 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 23 Dec 2021 20:56:22 +0530, Bharath Rupireddy 
>  wrote in
> > Hi,
> >
> > Currently the documentation of the log_checkpoints GUC says the following:
> >
> > Some statistics are included in the log messages, including the 
> > number
> > of buffers written and the time spent writing them.
> >
> > Usage of the word "Some" makes it a vague statement. Why can't we just
> > be clear about what statistics the log_checkpoints GUC can emit, like
> > the attached patch?
> >
> > Thoughts?
>
> It seems like simply a maintenance burden of documentation since it
> doesn't add any further detail of any item in a checkpoint log
> message.  But I'm not sure we want detailed explanations for them and
> it seems to me we deliberately never explained the detail of any log
> messages.

Agreed. "Some statistics" can cover whatever existing and newly added
stuff, if any. Let it be that way.

Regards,
Bharath Rupireddy.

Re: correct the sizes of values and nulls arrays in pg_control_checkpoint

2021-12-24 Thread Bharath Rupireddy

On Thu, Dec 23, 2021 at 9:16 PM Bharath Rupireddy
 wrote:
>
> On Thu, Dec 23, 2021 at 9:13 PM Euler Taveira  wrote:
> >
> > On Thu, Dec 23, 2021, at 8:39 AM, Bharath Rupireddy wrote:
> >
> > pg_control_checkpoint emits 18 columns whereas the values and nulls
> > arrays are defined to be of size 19. Although it's not critical,
> > attaching a tiny patch to fix this.
> >
> > Good catch! I'm wondering if a constant wouldn't be useful for such case.
>
> Thanks. I thought of having a macro, but it creates a lot of diff with
> the previous versions as we have to change for other pg_control_XXX
> functions.

I've added a CF entry to not lose track -
https://commitfest.postgresql.org/36/3475/

Regards,
Bharath Rupireddy.

Re: Add new function to convert 32-bit XID to 64-bit

2021-12-24 Thread Ashutosh Bapat

On Thu, Dec 23, 2021 at 8:33 PM Fujii Masao  wrote:
>
> Hi,
>
> When periodically collecting and accumulating statistics or status 
> information like pg_locks, pg_stat_activity, pg_prepared_xacts, etc for 
> future troubleshooting or some reasons, I'd like to store a transaction ID of 
> such information as 64-bit version so that the information of specified 
> transaction easily can be identified and picked up by transaction ID. 
> Otherwise it's not easy to distinguish transactions with the same 32-bit XID 
> but different epoch, only by 32-bit XID.
>
> But since pg_locks or pg_stat_activity etc returns 32-bit XID, we could not 
> store it as 64-bit version. To improve this situation, I'd like to propose to 
> add new function that converts the given 32-bit XID (i.e., xid data type) to 
> 64-bit one (xid8 data type). If we do this, for example we can easily get 
> 64-bit XID from pg_stat_activity by "SELECT convert_xid_to_xid8(backend_xid) 
> FROM pg_stat_activity", if necessary. Thought?

What will this function do?

>From your earlier description it looks like you want this function to
add epoch to the xid when making it a 64bit value. Is that right?

>
> As another approach, we can change the data type of 
> pg_stat_activity.backend_xid or pg_locks.transaction, etc from xid to xid8. 
> But this idea looks overkill to me, and it may break the existing 
> applications accessing pg_locks etc. So IMO it's better to just provide the 
> convertion function.

Obviously we will break backward compatibility for applications
upgrading to a newer PostgreSQL version. Downside is applications
using 64bit xid will need to change their applications.

If we want to change the datatype anyway, better to create a new type
LongTransactionId or something like to represent 64bit transaction id
and then change these functions to use that.

-- 
Best Wishes,
Ashutosh Bapat

RE: Delay the variable initialization in get_rel_sync_entry

2021-12-24 Thread houzj.f...@fujitsu.com

On Friday, December 24, 2021 8:13 AM Michael Paquier  
wrote:
> 
> On Thu, Dec 23, 2021 at 12:54:41PM -0300, Euler Taveira wrote:
> > On Wed, Dec 22, 2021, at 10:11 AM, houzj.f...@fujitsu.com wrote:
> >> The extra cost could sometimes be noticeable because get_rel_sync_entry is
> a
> >> hot function which is executed for each change. And the 'am_partition' and
> >> 'relkind' are necessary only when we need to rebuild the RelationSyncEntry.
> >>
> >> Here is the perf result for the case when inserted large amounts of data 
> >> into
> a
> >> un-published table in which case the cost is noticeable.
> >>
> >> --12.83%--pgoutput_change
> >> |--11.84%--get_rel_sync_entry
> >> |--4.76%--get_rel_relispartition
> >> |--4.70%--get_rel_relkind
> 
> How does the perf balance change once you apply the patch?  Do we have
> anything else that stands out?  Getting rid of this bottleneck is fine
> by itself, but I am wondering if there are more things to worry about
> or not.

Thanks for the response.
Here is the perf result of pgoutput_change after applying the patch.
I didn't notice something else that stand out.

|--2.99%--pgoutput_change
|--1.80%--get_rel_sync_entry
|--1.56%--hash_search

Also attach complete profiles.

> > Good catch. WFM. Deferring variable initialization close to its first use is
> > good practice.
> 
> Yeah, it is usually a good practice to have the declaration within
> the code block that uses it rather than piling everything at the
> beginning of the function.  Being able to see that in profiles is
> annoying, and the change is simple, so I'd like to backpatch it.

+1

> This is a period of vacations for a lot of people, so I'll wait until
> the beginning-ish of January before doing anything.

Thanks, added it to CF.
https://commitfest.postgresql.org/36/3471/

Best regards,
Hou zj
<>

Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-12-24 Thread Justin Pryzby

On Sun, Dec 12, 2021 at 06:08:05PM +0530, Bharath Rupireddy wrote:
> Another idea could be to use the infrastructure laid out by the commit
> 9ce346e [1]. With ereport_startup_progress, we can emit the LOGs(of
> current recovering WAL file) for every log_startup_progress_interval
> seconds/milliseconds.
> 
> One problem is that ereport_startup_progress doesn't work on
> StandbyMode, maybe we can remove this restriction unless we have a
> major reason for not allowing it on the standby.
> /* Prepare to report progress of the redo phase. */
> if (!StandbyMode)
> begin_startup_progress_phase();

The relevant conversation starts here.
https://www.postgresql.org/message-id/20210814214700.GO10479%40telsasoft.com

There was a lot of confusion in that thread, though.

The understanding was that it didn't make sense for that feature to
continuously log messages on a standby (every 10sec by default).  That seems
like too much - the issue of a checkpointed logged every 5min was enough of a
hurdle.

If you're talking about a new feature that uses the infrastructre from 9ce3,
but is controlled by a separate GUC like log_wal_traffic, that could be okay.

-- 
Justin

Proposal: sslmode=tls-only

2021-12-24 Thread Keith Burdis

>From 53.2.9. SSL Session Encryption:


> When SSL encryption can be performed, the server is expected to send only
> the single S byte and then wait for the frontend to initiate an SSL
> handshake. If additional bytes are available to read at this point, it
> likely means that a man-in-the-middle is attempting to perform a
> buffer-stuffing attack (CVE-2021-23222). Frontends should be coded either
> to read exactly one byte from the socket before turning the socket over to
> their SSL library, or to treat it as a protocol violation if they find they
> have read additional bytes.


> An initial SSLRequest can also be used in a connection that is being
> opened to send a CancelRequest message.


> While the protocol itself does not provide a way for the server to force
> SSL encryption, the administrator can configure the server to reject
> unencrypted sessions as a byproduct of authentication checking.


Has consideration been given to having something like ssl-mode=tls-only
where the SSLRequest message is skipped and the TLS handshake starts
immediately with the protocol continuing after that?

This has several advantages:
1) One less round-trip when establishing the connection, speeding this up.
TLS 1.3 removed a round-trip and this was significant so it could help.
2) No possibility of downgrading to non-TLS. (Not sure this is an issue
though.)
3) Connections work through normal TLS proxies and SNI can be used for
routing.

This final advantage is the main reason I'd like to see this implemented.
Postgres is increasingly being run in multi-tenant Kubernetes clusters
where load-balancers and node ports are not available, cost or don't scale
and it is currently difficult to connect to databases running there. If it
was possible to use normal ingress TLS proxies, running Postgres in
Kubernetes would be much easier and there are other use cases where just
using TLS would be beneficial as well.

Questions about TLS SNI support have been asked for several years now and
this would be a big usability improvement. SNI support was recently added
to the client-side and this seems like the next logical step.

Thoughts?

  - Keith

Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-12-24 Thread Bharath Rupireddy

On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby  wrote:
>
> On Sun, Dec 12, 2021 at 06:08:05PM +0530, Bharath Rupireddy wrote:
> > Another idea could be to use the infrastructure laid out by the commit
> > 9ce346e [1]. With ereport_startup_progress, we can emit the LOGs(of
> > current recovering WAL file) for every log_startup_progress_interval
> > seconds/milliseconds.
> >
> > One problem is that ereport_startup_progress doesn't work on
> > StandbyMode, maybe we can remove this restriction unless we have a
> > major reason for not allowing it on the standby.
> > /* Prepare to report progress of the redo phase. */
> > if (!StandbyMode)
> > begin_startup_progress_phase();
>
> The relevant conversation starts here.
> https://www.postgresql.org/message-id/20210814214700.GO10479%40telsasoft.com
>
> There was a lot of confusion in that thread, though.
>
> The understanding was that it didn't make sense for that feature to
> continuously log messages on a standby (every 10sec by default).  That seems
> like too much - the issue of a checkpointed logged every 5min was enough of a
> hurdle.
>
> If you're talking about a new feature that uses the infrastructre from 9ce3,
> but is controlled by a separate GUC like log_wal_traffic, that could be okay.

Do you see any problems using the same GUC
log_startup_progress_interval and ereport_startup_progress API
introduced by 9ce346e? I prefer this instead of a new GUC. Thoughts?

Regards,
Bharath Rupireddy.

Re: 回复：Re: Re: 回复：Re: Is it worth pushing conditions to sublink/subplan?

2021-12-24 Thread Zhihong Yu

On Thu, Dec 23, 2021 at 3:52 AM 曾文旌(义从)  wrote:

>
> Fixed a bug found during testing.
>
>
> Wenjing
>
>
>>> Hi,
+   if (condition_is_safe_pushdown_to_sublink(rinfo,
expr_info->outer))
+   {
+   /* replace qual expr from outer var = const to var = const
and push down to sublink query */
+   sublink_query_push_qual(subquery, (Node
*)copyObject(rinfo->clause), expr_info->outer, expr_info->inner);

Since sublink_query_push_qual() is always guarded
by condition_is_safe_pushdown_to_sublink(), it seems
sublink_query_push_qual() can be folded into
condition_is_safe_pushdown_to_sublink().

For generate_base_implied_equalities():

+   if (ec->ec_processed)
+   {
+   ec_index++;
+   continue;
+   }
+   else if (list_length(ec->ec_members) > 1)

Minor comment: the keyword `else` can be omitted (due to `continue` above).

+* Since there may be an unexpanded sublink in the targetList,
+* we'll skip it for now.

Since there may be an -> If there is an

+   {"lazy_process_sublink", PGC_USERSET, QUERY_TUNING_METHOD,
+   gettext_noop("enable lazy process sublink."),

Looking at existing examples from src/backend/utils/misc/guc.c,
enable_lazy_sublink_processing seems to be consistent with existing guc
variable naming.

+lazy_process_sublinks(PlannerInfo *root, bool single_result_rte)

lazy_process_sublinks -> lazily_process_sublinks

+   else
+   {
/* There shouldn't be any OJ info to translate, as yet */
Assert(subroot->join_info_list == NIL);

Indentation for the else block is off.

+   if (istop)
+   f->quals = preprocess_expression_ext(root, f->quals,
EXPRKIND_QUAL, false);
+   else
+   f->quals = preprocess_expression_ext(root, f->quals,
EXPRKIND_QUAL, true);

The above can be written as:

+   f->quals = preprocess_expression_ext(root, f->quals,
EXPRKIND_QUAL, !istop);

For find_equal_conditions_contain_uplevelvar_in_sublink_query():
+   context.has_unexpected_expr == false &&
`!context.has_unexpected_expr` should suffice

equal_expr_safety_check -> is_equal_expr_safe

Cheers

Re: Proposal: sslmode=tls-only

2021-12-24 Thread Andrew Dunstan



On 12/24/21 09:08, Keith Burdis wrote:
> From 53.2.9. SSL Session Encryption:
>  
>
> When SSL encryption can be performed, the server is expected to
> send only the single S byte and then wait for the frontend to
> initiate an SSL handshake. If additional bytes are available to
> read at this point, it likely means that a man-in-the-middle is
> attempting to perform a buffer-stuffing attack (CVE-2021-23222).
> Frontends should be coded either to read exactly one byte from the
> socket before turning the socket over to their SSL library, or to
> treat it as a protocol violation if they find they have read
> additional bytes.
>
>
> An initial SSLRequest can also be used in a connection that is
> being opened to send a CancelRequest message.
>
>
> While the protocol itself does not provide a way for the server to
> force SSL encryption, the administrator can configure the server
> to reject unencrypted sessions as a byproduct of authentication
> checking.
>
>
> Has consideration been given to having something like
> ssl-mode=tls-only where the SSLRequest message is skipped and the TLS
> handshake starts immediately with the protocol continuing after that?
>
> This has several advantages:
> 1) One less round-trip when establishing the connection, speeding this
> up. TLS 1.3 removed a round-trip and this was significant so it could
> help.
> 2) No possibility of downgrading to non-TLS. (Not sure this is an
> issue though.)
> 3) Connections work through normal TLS proxies and SNI can be used for
> routing.  
>
> This final advantage is the main reason I'd like to see this
> implemented. Postgres is increasingly being run in multi-tenant
> Kubernetes clusters where load-balancers and node ports are not
> available, cost or don't scale and it is currently difficult to
> connect to databases running there. If it was possible to use normal
> ingress TLS proxies, running Postgres in Kubernetes would be much
> easier and there are other use cases where just using TLS would be
> beneficial as well.
>
> Questions about TLS SNI support have been asked for several years now
> and this would be a big usability improvement. SNI support was
> recently added to the client-side and this seems like the next logical
> step. 
>
> Thoughts?
>
>

Isn't that going to break every existing client? How is a client
supposed to know which protocol to follow?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

[PATCH] reduce page overlap of GiST indexes built using sorted method

2021-12-24 Thread Aliaksandr Kalenik

Hey!

Postgres 14 introduces an option to create a GiST index using a sort
method. It allows to create indexes much faster but as it had been
mentioned in sort support patch discussion the faster build performance
comes at cost of higher degree of overlap between pages than for indexes
built with regular method.


Sort support was implemented for GiST opclass in PostGIS but eventually got
removed as default behaviour in latest 3.2 release because as it had been
discovered by Paul Ramsey
https://lists.osgeo.org/pipermail/postgis-devel/2021-November/029225.html
performance of queries might degrade by 50%.

Together with Darafei Praliaskouski, Andrey Borodin and me we tried several
approaches to solve query performance degrade:

   - The first attempt was try to decide whether to make a split depending
   on direction of curve (Z-curve for Postgres geometry type, Hilbert curve
   for PostGIS). It was implemented by filling page until fillfactor / 2 and
   then checking penalty for every next item and keep inserting in current
   page if penalty is 0 or start new page if penalty is not 0. It turned out
   that with this approach index becomes significantly larger whereas pages
   overlap still remains high.
   - Andrey Borodin implemented LRU + split: a fixed amount of pages are
   kept in memory and the best candidate page to insert the next item in is
   selected by minimum penalty among these pages. If the best page for
   insertion is full, it gets splited into multiple pages, and if the amount
   of candidate pages after split exceeds the limit, the pages insertion to
   which has not happened recently are flushed.
   
https://github.com/x4m/postgres_g/commit/0f2ed5f32a00f6c3019048e0c145b7ebda629e73.
   We made some tests and while query performance speed using index built with
   this approach is fine a size of index is extremely large.

Eventually we made implementation of an idea outlined in sort support patch
discussion here
https://www.postgresql.org/message-id/flat/08173bd0-488d-da76-a904-912c35da4...@iki.fi#09ac9751a4cde897c99b99b2170faf3a
that several pages can be collected and then divided into actual index
pages by calling picksplit. My benchmarks with data provided in
postgis-devel show that query performance using index built with patched
sort support is comparable with performance of query using index built with
regular method. The size of index is also matches size of index built with
non-sorting method.


It should be noted that with the current implementation of the sorting
build method, pages are always filled up to fillfactor. This patch changes
this behavior to what it would be if using a non-sorting method, and pages
are not always filled to fillfactor for the sake of query performance. I'm
interested in improving it and I wonder if there are any ideas on this.


Benchmark summary:


create index roads_rdr_idx on roads_rdr using gist (geom);


with sort support before patch / CREATE INDEX 76.709 ms

with sort support after patch / CREATE INDEX 225.238 ms

without sort support / CREATE INDEX 446.071 ms


select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;


with sort support before patch / SELECT 5766.526 ms

with sort support after patch / SELECT 2646.554 ms

without sort support / SELECT 2721.718 ms


index size


with sort support before patch / IDXSIZE 2940928 bytes

with sort support after patch / IDXSIZE 4956160 bytes

without sort support / IDXSIZE 5447680 bytes

More detailed:

Before patch using sorted method:


postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using
gist(geom);

CREATE INDEX

Time: 76.709 ms


postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;

 count



 505806

(1 row)


Time: 5766.526 ms (00:05.767)

postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;

 count



 505806

(1 row)


Time: 5880.142 ms (00:05.880)

postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;

 count



 505806

(1 row)


Time: 5778.437 ms (00:05.778)


postgres=# select gist_stat('roads_rdr_geom_idx_sortsupport');

gist_stat

--

 Number of levels:  3+

 Number of pages:   359  +

 Number of leaf pages:  356  +

 Number of tuples:  93034+

 Number of invalid tuples:  0+

 Number of leaf tuples: 92676+

 Total size of tuples:  2609260 bytes+

 Total size of leaf tuples: 2599200 bytes+

 Total size of index:   2940928 bytes+



(1 row)

After patch using sorted method:

postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using
gist(geom);

CREATE INDEX

Time: 225.238 ms


postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;

 count



 505806

(1 row)


Time: 2646.554 ms (00:02.647)

postgres=# select count(*) from roads_rdr a, roads_rdr b whe

Re: Proposal: sslmode=tls-only

2021-12-24 Thread Keith Burdis

 Servers that use sslmode=tls-only would not be compatible with clients
that do not yet support it, but that is same for any similar server-side
change, for example if the server requires a minimum of TLS 1.3 but the
client only supports TLS 1.2.

IIUC with the default sslmode=prefer a client currently first tries to
connect using SSL by sending an SSLRequest and if the response is not an S
then it tries with a plain StartupMessage.  If we wanted this new mode to
work by default as well, this could be changed to try a TLS handshake after
the SSLRequest and StartupMessage.  Similarly for the other existing
sslmode values an SSLRequest could be sent and if that fails then a TLS
handshake could be initiated.

But this is "only provided as the default for backwards compatibility, and
is not recommended in secure deployments". I'd expect that most secure
deployments set an explicit sslmode=verify-full, so having them set an
explicit sslmode=tls-only would not be an issue once the client and server
support it.

There is already precedent for having clients handle something new [1]:"The
frontend should also be prepared to handle an ErrorMessage response to
SSLRequest from the server. This would only occur if the server predates
the addition of SSL support to PostgreSQL. (Such servers are now very
ancient, and likely do not exist in the wild anymore.) In this case the
connection must be closed, but the frontend might choose to open a fresh
connection and proceed without requesting SSL."

[1] https://www.postgresql.org/docs/14/protocol-flow.html#id-1.10.5.7.11

On Fri, 24 Dec 2021 at 19:16, Andrew Dunstan  wrote:

>
> On 12/24/21 09:08, Keith Burdis wrote:
> > From 53.2.9. SSL Session Encryption:
> >
> >
> > When SSL encryption can be performed, the server is expected to
> > send only the single S byte and then wait for the frontend to
> > initiate an SSL handshake. If additional bytes are available to
> > read at this point, it likely means that a man-in-the-middle is
> > attempting to perform a buffer-stuffing attack (CVE-2021-23222).
> > Frontends should be coded either to read exactly one byte from the
> > socket before turning the socket over to their SSL library, or to
> > treat it as a protocol violation if they find they have read
> > additional bytes.
> >
> >
> > An initial SSLRequest can also be used in a connection that is
> > being opened to send a CancelRequest message.
> >
> >
> > While the protocol itself does not provide a way for the server to
> > force SSL encryption, the administrator can configure the server
> > to reject unencrypted sessions as a byproduct of authentication
> > checking.
> >
> >
> > Has consideration been given to having something like
> > ssl-mode=tls-only where the SSLRequest message is skipped and the TLS
> > handshake starts immediately with the protocol continuing after that?
> >
> > This has several advantages:
> > 1) One less round-trip when establishing the connection, speeding this
> > up. TLS 1.3 removed a round-trip and this was significant so it could
> > help.
> > 2) No possibility of downgrading to non-TLS. (Not sure this is an
> > issue though.)
> > 3) Connections work through normal TLS proxies and SNI can be used for
> > routing.
> >
> > This final advantage is the main reason I'd like to see this
> > implemented. Postgres is increasingly being run in multi-tenant
> > Kubernetes clusters where load-balancers and node ports are not
> > available, cost or don't scale and it is currently difficult to
> > connect to databases running there. If it was possible to use normal
> > ingress TLS proxies, running Postgres in Kubernetes would be much
> > easier and there are other use cases where just using TLS would be
> > beneficial as well.
> >
> > Questions about TLS SNI support have been asked for several years now
> > and this would be a big usability improvement. SNI support was
> > recently added to the client-side and this seems like the next logical
> > step.
> >
> > Thoughts?
> >
> >
>
> Isn't that going to break every existing client? How is a client
> supposed to know which protocol to follow?
>
>
> cheers
>
>
> andrew
>
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>

Re: minor gripe about lax reloptions parsing for views

2021-12-24 Thread Alvaro Herrera

On 2021-Dec-21, Mark Dilger wrote:

> Rebased patch attached:

These tests are boringly repetitive.  Can't we have something like a
nested loop, with AMs on one and reloptions on the other, where each
reloption is tried on each AM and an exception block to report the
failure or success for each case?  Maybe have the list of AMs queried
from pg_am with hardcoded additions if needed?

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Ninguna manada de bestias tiene una voz tan horrible como la humana" (Orual)

Re: Proposal: sslmode=tls-only

2021-12-24 Thread Tom Lane

Keith Burdis  writes:
> Has consideration been given to having something like ssl-mode=tls-only
> where the SSLRequest message is skipped and the TLS handshake starts
> immediately with the protocol continuing after that?

https://www.postgresql.org/message-id/flat/fcc3ebeb7f05775b63f3207ed52a54ea5d17fb42.camel%40vmware.com

regards, tom lane

Re: How to retain lesser paths at add_path()?

2021-12-24 Thread Nils Dijk

Hi,

While researching the viability to change Citus' planner in a way to
take more advantage of the postgres planner I ran into the same
limitations in add_path as described on this thread. For Citus we are
investigating a possibility to introduce Path nodes that describe the
operation of transer tuples over the network. Due to the pruning rules
in add_path we currently lose many paths that have great opportunities
of future optimizations.

Having looked at the patches provided I don't think that adding a new
List to the Path structure where we retain 'lesser' paths is the right
approach. First of all, this would require extension developers to
interpret all these paths, where many would dominate others on cost,
sorting, parameterization etc.

Instead I like the approach suggested by both Robert Haas and Tom Lane.
> have some kind of figure of merit or other special marking for paths
> that will have some possible special advantage in later planning
> steps.

This is well in line with how the current logic works in add_path
where cost, sorting and parameterization, rowcount and parallel safety
are dimensions on which paths are compared. IMHO extensions should be
able to add dimensions of their interest to this list of current
dimensions used.

The thoughts Tom Lane expressed earlier struc a chord with me.
> [ thinks a bit... ]  Maybe that could be improved if we can express
> the result as a bitmask, defined in such a way that OR'ing (or maybe
> AND'ing? haven't worked it out) the results from different comparisons
> does the right thing.

Attached you will find 3 patches that implement a way for extensions
to introduce 'a figure of merit' by the means of path comparisons.
 - The first patch refactors the decision logic out of the forloop
into their own function as to make it easier to reason about what is
being compared. This refactor also changes the direction of some
if-statements as to provide clear early decision points.
 - The second patch rewrites the original if/else/switch tree into 5
logical or operations of a bitmask expressing the comparison between
paths, together with early escapes once we know the paths are
different. To keep the original logic there is a slight deviation from
simply or-ing 5 comparisons. After comparing cost, sorting and
parameterization we only continue or-ing rows and parallelism if the
paths are leaning either to path1 (new_path) or path2 (old_path). If
the first 3 comparisons result in the paths being equal the original
code prefers parallel paths over paths with less rows.
 - The third and last path builds on the logical or operations
introduced in the middle patch. After the first three dimensions
postgres compares paths on we allow extensions to compare paths in the
same manner. Their result gets merged into the compounded comparison.

To come to the three patches above I have decomposed the orignal
behaviour into 3 possible decisions add_path can take per iteration in
the loop. It either REJECTS the new path, DOMINATES the old path or
CONTINUES with the next path in the list. The decision for any of the
3 actions is based on 6 different input parameters:
 - cost std fuzz factor
 - sort order / pathkeys
 - parameterizations / bitmap subset of required outer relid's
 - row count
 - parallel safety
 - cost smaller fuzz factor

To verify the correct decisions being made in the refactor of the
second patch I modelled both implementations in a scripting language
and passed in all possible comparisons for the six dimensions above.
With the change of behaviour after the first 3 dimensions I came to an
exact one-to-one mapping of the decisions being made before and after
patch 2.

Throughout this thread an emphasis on performance has been expressed
by many participants. I want to acknowledge their stance. Due to the
nature of the current planner the code in add_path gets invoked on an
exponential scale when the number of joins increases and extra paths
are retained. Especially with my proposed patch being called inside
the loop where paths are being compared makes that the hook gets
called ever more often compared to the earlier proposed patches on
this thread.

To reason about the performance impact of a patch in this area I think
we need to get to a mutual understanding and agreement on how to
evaluate the performance.

Given many if not most installations will run without any hook
installed in this area we should aim for minimal to no measurable
overhead of the code without a hook installed. This is also why I made
sure, via modelling of the decision logic in a scripting language the
behaviour of which paths are retained is not changed with these
patches when no hook is installed.

When an extension needs to add a dimension to the comparison of paths
the impact of this patch is twofold:
 - A dynamic function gets invoked to compare every path to every
other path. Both the dynamic function call and the logics to compare
the paths will introduce extra time spent in add_pat

generic plans and "initial" pruning

2021-12-24 Thread Amit Langote

Executing generic plans involving partitions is known to become slower
as partition count grows due to a number of bottlenecks, with
AcquireExecutorLocks() showing at the top in profiles.

Previous attempt at solving that problem was by David Rowley [1],
where he proposed delaying locking of *all* partitions appearing under
an Append/MergeAppend until "initial" pruning is done during the
executor initialization phase.  A problem with that approach that he
has described in [2] is that leaving partitions unlocked can lead to
race conditions where the Plan node belonging to a partition can be
invalidated when a concurrent session successfully alters the
partition between AcquireExecutorLocks() saying the plan is okay to
execute and then actually executing it.

However, using an idea that Robert suggested to me off-list a little
while back, it seems possible to determine the set of partitions that
we can safely skip locking.  The idea is to look at the "initial" or
"pre-execution" pruning instructions contained in a given Append or
MergeAppend node when AcquireExecutorLocks() is collecting the
relations to lock and consider relations from only those sub-nodes
that survive performing those instructions.   I've attempted
implementing that idea in the attached patch.

Note that "initial" pruning steps are now performed twice when
executing generic plans: once in AcquireExecutorLocks() to find
partitions to be locked, and a 2nd time in ExecInit[Merge]Append() to
determine the set of partition sub-nodes to be initialized for
execution, though I wasn't able to come up with a good idea to avoid
this duplication.

Using the following benchmark setup:

pgbench testdb -i --partitions=$nparts > /dev/null 2>&1
pgbench -n testdb -S -T 30 -Mprepared

And plan_cache_mode = force_generic_plan,

I get following numbers:

HEAD:

32  tps = 20561.776403 (without initial connection time)
64  tps = 12553.131423 (without initial connection time)
128 tps = 13330.365696 (without initial connection time)
256 tps = 8605.723120 (without initial connection time)
512 tps = 4435.951139 (without initial connection time)
1024tps = 2346.902973 (without initial connection time)
2048tps = 1334.680971 (without initial connection time)

Patched:

32  tps = 27554.156077 (without initial connection time)
64  tps = 27531.161310 (without initial connection time)
128 tps = 27138.305677 (without initial connection time)
256 tps = 25825.467724 (without initial connection time)
512 tps = 19864.386305 (without initial connection time)
1024tps = 18742.668944 (without initial connection time)
2048tps = 16312.412704 (without initial connection time)

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] 
https://www.postgresql.org/message-id/CAKJS1f_kfRQ3ZpjQyHC7=pk9vrhxihbqfz+hc0jcwwnrkkf...@mail.gmail.com

[2] 
https://www.postgresql.org/message-id/CAKJS1f99JNe%2Bsw5E3qWmS%2BHeLMFaAhehKO67J1Ym3pXv0XBsxw%40mail.gmail.com


v1-0001-Teach-AcquireExecutorLocks-to-acquire-fewer-locks.patch
Description: Binary data

Re: row filtering for logical replication

2021-12-24 Thread Amit Kapila

On Fri, Dec 24, 2021 at 11:04 AM Peter Smith  wrote:
>
> The current PG docs text for CREATE PUBLICATION (in the v54-0001
> patch) has a part that says
>
> +   A nullable column in the WHERE clause could cause the
> +   expression to evaluate to false; avoid using columns without not-null
> +   constraints in the WHERE clause.
>
> I felt that the caution to "avoid using" nullable columns is too
> strongly worded. AFAIK nullable columns will work perfectly fine so
> long as you take due care of them in the WHERE clause. In fact, it
> might be very useful sometimes to filter on nullable columns.
>
> Here is a small test example:
>
> // publisher
> test_pub=# create table t1 (id int primary key, msg text null);
> test_pub=# create publication p1 for table t1 where (msg != 'three');
> // subscriber
> test_sub=# create table t1 (id int primary key, msg text null);
> test_sub=# CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost
> dbname=test_pub application_name=sub1' PUBLICATION p1;
>
> // insert some data
> test_pub=# insert into t1 values (1, 'one'), (2, 'two'), (3, 'three'),
> (4, null), (5, 'five');
> test_pub=# select * from t1;
>  id |  msg
> +---
>   1 | one
>   2 | two
>   3 | three
>   4 |
>   5 | five
> (5 rows)
>
> // data at sub
> test_sub=# select * from t1;
>  id | msg
> +--
>   1 | one
>   2 | two
>   5 | five
> (3 rows)
>
> Notice the row 4 with the NULL is also not replicated. But, perhaps we
> were expecting it to be replicated (because NULL is not 'three'). To
> do this, simply rewrite the WHERE clause to properly account for
> nulls.
>
> // truncate both sides
> test_pub=# truncate table t1;
> test_sub=# truncate table t1;
>
> // alter the WHERE clause
> test_pub=# alter publication p1 set table t1 where (msg is null or msg
> != 'three');
>
> // insert data at pub
> test_pub=# insert into t1 values (1, 'one'), (2, 'two'), (3, 'three'),
> (4, null), (5, 'five');
> INSERT 0 5
> test_pub=# select * from t1;
>  id |  msg
> +---
>   1 | one
>   2 | two
>   3 | three
>   4 |
>   5 | five
> (5 rows)
>
> // data at sub (not it includes the row 4)
> test_sub=# select * from t1;
>  id | msg
> +--
>   1 | one
>   2 | two
>   4 |
>   5 | five
> (4 rows)
>
> ~~
>
> So, IMO the PG docs wording for this part should be relaxed a bit.
>
> e.g.
> BEFORE:
> +   A nullable column in the WHERE clause could cause the
> +   expression to evaluate to false; avoid using columns without not-null
> +   constraints in the WHERE clause.
> AFTER:
> +   A nullable column in the WHERE clause could cause the
> +   expression to evaluate to false. To avoid unexpected results, any possible
> +   null values should be accounted for.
>

Your suggested wording sounds reasonable to me. Euler, others, any thoughts?

-- 
With Regards,
Amit Kapila.

Re: Allow escape in application_name

Re: sequences vs. synchronous replication

Re: more descriptive message for process termination due to max_slot_wal_keep_size

Re: more descriptive message for process termination due to max_slot_wal_keep_size

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Re: sequences vs. synchronous replication

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

Re: more descriptive message for process termination due to max_slot_wal_keep_size

Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes

Re: An obsolete comment of pg_stat_statements

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

Re: Add checkpoint and redo LSN to LogCheckpointEnd log message

Re: Be clear about what log_checkpoints emits in the documentation

Re: correct the sizes of values and nulls arrays in pg_control_checkpoint

Re: Add new function to convert 32-bit XID to 64-bit

RE: Delay the variable initialization in get_rel_sync_entry

Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

Proposal: sslmode=tls-only

Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

Re: 回复：Re: Re: 回复：Re: Is it worth pushing conditions to sublink/subplan?

Re: Proposal: sslmode=tls-only

[PATCH] reduce page overlap of GiST indexes built using sorted method

Re: Proposal: sslmode=tls-only

Re: minor gripe about lax reloptions parsing for views

Re: Proposal: sslmode=tls-only

Re: How to retain lesser paths at add_path()?

generic plans and "initial" pruning

Re: row filtering for logical replication

29 matches

Site Navigation

Mail list logo

Footer information