date:20241219

On Wed, Dec 18, 2024 at 6:26 PM Michael Paquier  wrote:
>
> On Thu, Dec 19, 2024 at 07:50:57AM +0530, Amit Kapila wrote:
> > The difference between fix_memory_leak_v2 and fix_memory_leak_v3 is
> > that the earlier one resets the pubctx to NULL along with freeing the
> > context memory. Resetting a file-level global variable is a good idea,
> > similar to what we do for RelationSyncCache, so I prefer v2 over v3,
> > but I am fine if you would like to proceed with v3.
>
> FWIW, I am not OK with v3.  I've raised this exact point a couple of
> days ago upthread:
> https://www.postgresql.org/message-id/z1t5pxsneyws4...@paquier.xyz
>
> v2 does not have these weaknesses by design.

I agree that v2 is better than v3 in terms of that.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New "single" COPY format

2024-12-19 Thread Joel Jacobson

On Thu, Dec 19, 2024, at 07:48, jian he wrote:
> I have reviewed v21-0001 again.
>
> v21-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch
> is a good refactor.
>
>
> overall looks good to me.

OK, I could submit it as a separate patch.

Would we also want the reorganization of existing copy option validations?
That is, v21-0003-Reorganize-option-validations.patch minus 
v21-0002-Add-COPY-format-list.patch.
Nice that you managed to arrange them in the same order as in the 
documentation. Made the code easier to follow.

/Joel

Re: Remaining dependency on setlocale()


On 17.12.24 19:10, Jeff Davis wrote:

On Tue, 2024-12-17 at 13:14 +0100, Peter Eisentraut wrote:

I think we will need to keep the global LC_CTYPE setting set to
something useful, for example so that system error messages come out
in
the right encoding.


Do we need to rely on the global LC_CTYPE setting? We already use
bind_textdomain_codeset().


I don't think that would cover messages from the C library (strerror, 
dlerror, etc.).



But I'm concerned about the the Perl_setlocale() dance in plperl.c.
Perl apparently does a setlocale(LC_ALL, "") during startup, and that
code is a workaround to reset everything back afterwards.  We need to
be
careful not to break that.

(Perl has fixed that in 5.19, but the fix requires that you set
another
environment variable before launching Perl, which you can't do in a
threaded system, so we'd probably need another fix eventually.  See
.)


I don't fully understand that issue, but I would think the direction we
are going (keeping the global LC_CTYPE more consistent and relying on
it less) would make the problem better.


Yes, I think it's the right direction, but we need to figure this issue 
out eventually.

Re: Parametrization minimum password lenght

2024-12-19 Thread Emanuele Musella

Hi Nathan,

thank you.

It seems to me you are working on an older version of patch.

In attached the latest one for your reference.

Thanks for your support

Best regards

Emanuele Musella



Il giorno mer 18 dic 2024 alle ore 21:56 Nathan Bossart <
nathandboss...@gmail.com> ha scritto:

> Here is what I have staged for commit.
>
> --
> nathan
>


min_password_length_v9.patch
Description: Binary data

RE: COPY performance on Windows

2024-12-19 Thread Ryohei Takahashi (Fujitsu)

Hi 


Thank you for your advice and testing.

> I think, it could be checked, if table has text fields instead of 
> numeric - we could exclude numeric conversion
> and have the same input-output operations (really more IO-operation, but 
> we need to compare)

I changed the column from int to text.
The performance becomes worse in each version,
but the rate of the difference of duration did not change.


> By the way, do you use prebuild Postgres versions for this test or
> build it by yourself with the same options? I am going to use built 
> myself.

In the mail in 2024-12-16 12:09:03, I used the modules which I build by myself.
In the other mail, I used the modules which community provides at following.
https://www.postgresql.org/download/windows/


> Could you confirm, that you receive you results on all execution orders 
> (17.0 first and 17.0 last)?

In my environment, the results do not depend on order.
(The performance of the order 16.4, 16.6, 17.0 is same as that of the order 
17.0, 16.6, 16.4)


Regards,
Ryohei Takahashi

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-12-19 Thread Richard Guo

On Thu, Dec 19, 2024 at 6:15 PM Richard Guo  wrote:
> I think we need to check whether rs_tbmiterator is NULL before calling
> tbm_end_iterate on it, like below.
>
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -572,9 +572,11 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
> if (scan)
> {
> /*
> -* End iteration on iterators saved in scan descriptor.
> +* End iteration on iterators saved in scan descriptor, if they
> +* haven't already been cleaned up.
>  */
> -   tbm_end_iterate(&scan->st.rs_tbmiterator);
> +   if (!tbm_exhausted(&scan->st.rs_tbmiterator))
> +   tbm_end_iterate(&scan->st.rs_tbmiterator);
>
> /* rescan to release any page pin */
> table_rescan(node->ss.ss_currentScanDesc, NULL);

This change may also be needed in ExecEndBitmapHeapScan().

Thanks
Richard

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-12-19 Thread Melanie Plageman

On Thu, Dec 19, 2024 at 10:12 AM Richard Guo  wrote:
>
> On Thu, Dec 19, 2024 at 6:15 PM Richard Guo  wrote:
> > I think we need to check whether rs_tbmiterator is NULL before calling
> > tbm_end_iterate on it, like below.
> >
> > --- a/src/backend/executor/nodeBitmapHeapscan.c
> > +++ b/src/backend/executor/nodeBitmapHeapscan.c
> > @@ -572,9 +572,11 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
> > if (scan)
> > {
> > /*
> > -* End iteration on iterators saved in scan descriptor.
> > +* End iteration on iterators saved in scan descriptor, if they
> > +* haven't already been cleaned up.
> >  */
> > -   tbm_end_iterate(&scan->st.rs_tbmiterator);
> > +   if (!tbm_exhausted(&scan->st.rs_tbmiterator))
> > +   tbm_end_iterate(&scan->st.rs_tbmiterator);
> >
> > /* rescan to release any page pin */
> > table_rescan(node->ss.ss_currentScanDesc, NULL);
>
> This change may also be needed in ExecEndBitmapHeapScan().

Thanks, Richard! I'm working on the fix and adding the test case you found.

- Melanie

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:
> I've been double-checking the code to refresh myself with the problem,
> and I don't see a reason to not apply something like the attached set
> down to v13 for all these remaining branches (minus an edit of the
> commit message).

LGTM

-- 
nathan

Re: Parametrization minimum password lenght

On Thu, Dec 19, 2024 at 07:25:30AM +, Bertrand Drouvot wrote:
> +  if (pwdlen < min_password_length)
>ereport(ERROR,
>(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> errmsg("password is too short")));
> 
> Now that the minimum password length is not "hardcoded" anymore, I wonder if 
> it
> wouldn't be better to provide more details here (pwdlen and 
> min_password_length).
> 
> Suggestion in on_top_of_0001.txt attached.

Yeah, good point.

> +   /* Define custom GUC variables. */
> +   DefineCustomIntVariable("passwordcheck.min_password_length",
> +   "Minimum allowed 
> password length.",
> +   NULL,
> +   &min_password_length,
> +   8,
> +   0, INT_MAX,
> 
> Since password must contain both letters and nonletters, 0 seems too low. I
> wonder if 2 is not a better value (done in on_top_of_0001.txt attached).
> 
> Also, it seems to me that INT_MAX is too large (as mentioned in [1]), but 
> that's
> probably a nit.

I don't think we need to restrict the allowed values here.  It's true that
the lower bound will be effectively 2 due to the other checks, but that's
not the responsibility of this one to enforce.  Those other checks will
likely become configurable at some point, too.

> -  errmsg("password is too short")));
> +  errmsg("password is too short: %d (< 
> %d)", pwdlen, min_password_length)));

I think we should explicitly point to the parameter and note its current
value.

-- 
nathan

Re: Parametrization minimum password lenght

On Thu, Dec 19, 2024 at 11:22:02AM +0100, Emanuele Musella wrote:
> It seems to me you are working on an older version of patch.
> 
> In attached the latest one for your reference.

I used the v9 patch as the basis for v10.  There are a few small edits,
such as removing the upper bound for the parameter and expanding the
documentation, but the feature itself remains intact.

-- 
nathan

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

2024-12-19 Thread Andres Freund

On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:
> On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:
> > I've been double-checking the code to refresh myself with the problem,
> > and I don't see a reason to not apply something like the attached set
> > down to v13 for all these remaining branches (minus an edit of the
> > commit message).
> 
> LGTM

Dito.

Re: connection establishment versus parallel workers

Sorry for the delay, and thanks again for digging into this.

On Fri, Dec 13, 2024 at 03:56:00PM +1300, Thomas Munro wrote:
> 0001 patch is unchanged, 0002 patch sketches out a response to the
> observation a couple of paragraphs above.

Both of these patches seem to improve matters quite a bit.  I haven't yet
thought too deeply about it all, but upon a skim, your patches seem
entirely reasonable to me.

However, while this makes the test numbers for >= v16 look more like those
for v15, we're also seeing a big jump from v13 to v14.  This bisects pretty
cleanly to commit d872510.  I haven't figured out _why_ this commit is
impacting this particular test, but I figured I'd at least update the
thread with what we know so far.

-- 
nathan

Re: Bug in nbtree SAOP scans with non-required arrays, truncated high key

2024-12-19 Thread Peter Geoghegan

On Wed, Dec 18, 2024 at 3:20 PM Peter Geoghegan  wrote:
> Attached fix addresses the issue by consistently resetting the scan's
> so->scanBehind flag (which might still be set to true from the
> previous page's high key) at the start of _bt_advance_array_keys.

I pushed this fix just now.

I should point out (for the benefit of Tom, or whoever writes the next
set of release notes) that I think that this bug is very unlikely to
occur in practice. Getting wrong answers to queries could only happen
given an index with more than 3 columns, and only for scans with just
the right combination of scan key types. It could also only happen
when the scan advances its array keys using high keys with more than a
single truncated attribute.

Note also that assert-enabled builds would always have seen assertion
failures whenever something like my test case ran. Presumably we'd
have seen a report about such an assertion failure if the underlying
problem was at all common.

--
Peter Geoghegan

Re: Add CASEFOLD() function.


On 16.12.24 18:49, Jeff Davis wrote:

One question I have is whether we want this function to normalize the
output.

I believe most usecases would want the output normalized, because
normalization differences (e.g. "a" U+0061 followed by "combining
acute" U+0301 vs "a with acute" U+00E1) are more minor than differences
in case.


Can you explain this in further detail?  I don't quite follow why this 
would be required.



Of course, a user could wrap it with the normalize() function, but
that's verbose and easy to forget. I'm also not sure that it can be
made as fast as a combined function that does both.

And a follow-up question: if it does normalize, the second parameter
would be the requested normal form. But to accept the keyword forms
(NFC, NFD in gram.y) rather than the string forms ('NFC', 'NFD') then
we'd need to also need to add CASEFOLD to gram.y (like NORMALIZE). Is
that a reasonable thing to do?


That's maybe one reason to keep it separate.

Another might be that's not entirely clear how this should work in 
encodings other than UTF-8.  For example, the normalized string might 
not be representable in the encoding.

Re: Skip collecting decoded changes of already-aborted transactions

2024-12-19 Thread Amit Kapila

On Thu, Dec 19, 2024 at 7:14 AM Masahiko Sawada  wrote:
>
> On Mon, Dec 9, 2024 at 9:09 PM Amit Kapila  wrote:
> >
> > On Tue, Nov 26, 2024 at 3:03 AM Masahiko Sawada  
> > wrote:
> > >
> > > I've attached a new version patch that incorporates all comments I got so 
> > > far.
> > >
> >
> > Review comments:
>
> Thank you for reviewing the patch!
>
> > ===
> > 1.
> > + * The given transaction is marked as streamed if appropriate and the 
> > caller
> > + * requested it by passing 'mark_txn_streaming' as true.
> > + *
> >   * 'txn_prepared' indicates that we have decoded the transaction at prepare
> >   * time.
> >   */
> >  static void
> > -ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > bool txn_prepared)
> > +ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > bool txn_prepared,
> > + bool mark_txn_streaming)
> >  {
> > ...
> >   }
> > + else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) ||
> > (txn->nentries_mem != 0)))
> > + {
> > + /*
> > + * Mark the transaction as streamed, if appropriate.
> >
> > The comments related to the above changes don't clarify in which cases
> > the 'mark_txn_streaming' should be set. Before this patch, it was
> > clear from the comments and code about the cases where we would decide
> > to mark it as streamed.
>
> I think we can rename it to txn_streaming for consistency with
> txn_prepared. I've changed the comment for that.
>

@@ -2067,7 +2143,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
ReorderBufferTXN *txn,
ReorderBufferChange *specinsert)
 {
  /* Discard the changes that we just streamed */
- ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
+ ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);

@@ -1924,7 +2000,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb,
ReorderBufferTXN *txn)
  * full cleanup will happen as part of the COMMIT PREPAREDs, so now
  * just truncate txn by removing changes and tuplecids.
  */
- ReorderBufferTruncateTXN(rb, txn, true);
+ ReorderBufferTruncateTXN(rb, txn, true, true);

In both the above places, the patch unconditionally passes the
'txn_streaming' even for prepared transactions when it wouldn't be a
streaming xact. Inside the function, the patch handled that by first
checking whether the transaction is prepared (txn_prepared). So, the
logic will work but the function signature and the way its callers are
using make it difficult to use and extend in the future.

I think for the first case, we should get the streaming parameter in
ReorderBufferResetTXN(), and for the second case
ReorderBufferStreamCommit(), we should pass it as false because by
that time transaction is already streamed and prepared. We are
invoking it for cleanup. Even when we call ReorderBufferTruncateTXN()
from ReorderBufferCheckAndTruncateAbortedTXN(), it will be better to
write a comment at the caller about why we are passing this parameter
as false.

>
> >
> > 4.
> > + /*
> > + * Remember if the transaction is already aborted so we can detect when
> > + * the transaction is concurrently aborted during the replay.
> > + */
> > + already_aborted = rbtxn_is_aborted(txn);
> > +
> >   ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
> >   txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
> >
> > @@ -2832,10 +2918,10 @@ ReorderBufferPrepare(ReorderBuffer *rb,
> > TransactionId xid,
> >   * when rollback prepared is decoded and sent, the downstream should be
> >   * able to rollback such a xact. See comments atop DecodePrepare.
> >   *
> > - * Note, for the concurrent_abort + streaming case a stream_prepare was
> > + * Note, for the concurrent abort + streaming case a stream_prepare was
> >   * already sent within the ReorderBufferReplay call above.
> >   */
> > - if (txn->concurrent_abort && !rbtxn_is_streamed(txn))
> > + if (!already_aborted && rbtxn_is_aborted(txn) && !rbtxn_is_streamed(txn))
> >   rb->prepare(rb, txn, txn->final_lsn);
> >
> > It is not clear from the comments how the 'already_aborted' is
> > handled. I think after this patch we would have already truncated all
> > its changes. If so, why do we need to try to replay the changes of
> > such a xact?
>
> I used ReorderBufferReplay() for convenience; it sends begin_prepare()
> and prepare() appropriately, handles streaming-prepared transactions,
> and updates statistics etc. But as you pointed out, it would not be
> necessary to set up a historical snapshot etc. I agree that we don't
> need to try replaying such aborted transactions but I'd like to
> confirm we don't really need to execute invalidation messages evein in
> aborted transactions.
>

We need to execute invalidations if we have loaded any cache entries,
for example in the case of streaming. See comments in the function
ReorderBufferAbort(). However, I find both the current changes and the
previous patch a bit difficult to follow. How about if we instead
invent a flag like RBTXN_SENT_PREPARE or something like that and then
use that flag to de

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-12-19 Thread Richard Guo

On Sat, Oct 19, 2024 at 5:48 AM Melanie Plageman
 wrote:
> I plan to commit 0002 and 0003 next week. I'm interested if you think
> 0001 is correct.
> I may also commit 0004-0006 as I feel they are ready too.

I noticed an oversight on master, which I think was introduced by the
0005 patch.  In ExecReScanBitmapHeapScan:

-  if (scan->st.bitmap.rs_shared_iterator)
-  {
-  tbm_end_shared_iterate(scan->st.bitmap.rs_shared_iterator);
-  scan->st.bitmap.rs_shared_iterator = NULL;
-  }
-
-  if (scan->st.bitmap.rs_iterator)
-  {
-  tbm_end_private_iterate(scan->st.bitmap.rs_iterator);
-  scan->st.bitmap.rs_iterator = NULL;
-  }
+  tbm_end_iterate(&scan->st.rs_tbmiterator);

I don't think it's safe to call tbm_end_iterate directly on
scan->st.rs_tbmiterator, as it may have already been cleaned up by
this point.

Here is a query that triggers a SIGSEGV fault on master.

create table t (a int);
insert into t values (1), (2);
create index on t (a);

set enable_seqscan to off;
set enable_indexscan to off;

select * from t t1 left join t t2 on t1.a in (select a from t t3 where
t2.a > 1);


I think we need to check whether rs_tbmiterator is NULL before calling
tbm_end_iterate on it, like below.

--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -572,9 +572,11 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
if (scan)
{
/*
-* End iteration on iterators saved in scan descriptor.
+* End iteration on iterators saved in scan descriptor, if they
+* haven't already been cleaned up.
 */
-   tbm_end_iterate(&scan->st.rs_tbmiterator);
+   if (!tbm_exhausted(&scan->st.rs_tbmiterator))
+   tbm_end_iterate(&scan->st.rs_tbmiterator);

/* rescan to release any page pin */
table_rescan(node->ss.ss_currentScanDesc, NULL);

Thanks
Richard

Re: Fix bank selection logic in SLRU

2024-12-19 Thread Yura Sokolov


19.12.2024 13:10, Andrey Borodin пишет:




On 19 Dec 2024, at 15:01, Yura Sokolov  wrote:

- `&` takes 0.69ns
- `mult-rec` takes 2.94ns
- `%` takes 3.24ns.


Thanks, Yura, for benchmarks and off-list conversation.
I’ve reproduced similar numbers on my Apple M2.
I agree that additional 3-4ns are negligible in case of SLRU access.



+   bits16  nbanks;

Perhaps, it’s not bits anymore. Also, is 64K banks ought enough for everybody?


Best regards, Andrey Borodin.


There are artificial limit currently:

#define SLRU_MAX_ALLOWED_BUFFERS ((1024 * 1024 * 1024) / BLCKSZ)
#define SLRU_BANK_BITSHIFT  4
#define SLRU_BANK_SIZE  (1 << SLRU_BANK_BITSHIFT)

So, there's no more than 8192 banks at the moment.

But I believe, some customers will want to have more than 1GB of SLRU in 
the future. (They do already actually)



Yura

Re: Send duration output to separate log files

2024-12-19 Thread Greg Sabino Mullane

Thanks for the feedback on this, everyone. I think, given the lack of
committer enthusiasm, I may have to go the route of redesigning the logging
system completely. At any rate, I cannot circle back to this for a while.

2. possibility to support rsyslog by setting different or some syslog
> related redirection by setting different facility.


Yes, that would be nice. Perhaps I'll push just the message_type bit, which
seems to be a common precursor to many of these ideas.

Cheers,
Greg

Re: Removing the pgstat_flush_io() call from the walwriter

2024-12-19 Thread Nazir Bilal Yavuz

Hi,

On Thu, 19 Dec 2024 at 08:50, Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Wed, Dec 18, 2024 at 11:55:11AM -0500, Andres Freund wrote:
> > Yea, i think it's fine to just call it unnecessarily. Particularly because 
> > we
> > do want to actually report IO stats for walwriter eventually.
>
> Yeah, better to spend time and energy on making it "necessary" instead: I'll
> try to have a look at adding IO stats for walwriter.

I have been working on showing all WAL stats in the pg_stat_io [1]
view but currently it is blocked because the size of the WAL read IO
may vary and it is not possible to show this on the pg_stat_io view.
Proposed a fix [2] for that (by counting IOs as bytes instead of
blocks in the pg_stat_io) which you already reviewed :).

[1] https://commitfest.postgresql.org/51/4950/
[2] https://commitfest.postgresql.org/51/5256/

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Removing the pgstat_flush_io() call from the walwriter

2024-12-19 Thread Bertrand Drouvot

Hi,

On Thu, Dec 19, 2024 at 05:50:43PM +0300, Nazir Bilal Yavuz wrote:
> Hi,
> 
> On Thu, 19 Dec 2024 at 08:50, Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > On Wed, Dec 18, 2024 at 11:55:11AM -0500, Andres Freund wrote:
> > > Yea, i think it's fine to just call it unnecessarily. Particularly 
> > > because we
> > > do want to actually report IO stats for walwriter eventually.
> >
> > Yeah, better to spend time and energy on making it "necessary" instead: I'll
> > try to have a look at adding IO stats for walwriter.
> 
> I have been working on showing all WAL stats in the pg_stat_io [1]
> view

Oh nice, I did not realize that you were working on it, thanks for the
notification ;-) We'll have a look at it!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: JIT compilation per plan node

2024-12-19 Thread Matheus Alcantara

Hi,

It seems quite a while since the last email, but I was taking a look at this
thread and I think that we can have another, but similar, approach for this
proposal.

Em ter., 14 de mai. de 2024 às 01:10, David Rowley
 escreveu:
> Currently, during execution, ExecCreateExprSetupSteps() traverses the
> Node tree of the Expr to figure out the max varattno of for each slot.
> That's done so all of the tuple deforming happens at once rather than
> incrementally. Figuring out the max varattno is a price we have to pay
> for every execution of the query. I think we'd be better off doing
> that in the planner.
>
> To do this, I thought that setrefs.c could do this processing in
> fix_join_expr / fix_upper_expr and wrap up the expression in a new
> Node type that stores the max varattno for each special var type.
>
> This idea is related to this discussion because another thing that
> could be stored in the very same struct is the "num_exec" value.   I
> feel the number of executions of an ExprState is a better gauge of how
> useful JIT will be than the cost of the plan node. Now, looking at
> set_join_references(), the execution estimates are not exactly
> perfect. For example;
>
The fix_join_expr and fix_upper_expr functions have been evolved that now both
receives a "num_exec" parameter which can be derived from plan->plan_rows via
NUM_EXEC_TLIST macro. After a plan is created on create_plan_recurse we have
both plan_rows and total_cost filled, so it seems to me that all information
needed to decide if a plan node should be compiled or not is already present on
Plan struct?

If that's the case, I was thinking, what if we add a new field jit on
Plan struct
and fill this value after a plan is created on create_plan_recurse? Similar
idea with the first patch:

static void
plan_consider_jit(Plan *plan)
{
plan->jit = false;

if (jit_enabled)
{
Costtotal_cost;

total_cost = plan->total_cost * plan->plan_rows;

if (total_cost > jit_above_cost)
plan->jit = true;
}
}

And then when jit_compile_expr is called we can check if the jit is enabled for
ExprState->parent->plan node.

This make any kind of sense? I'm not sure but I've executed some benchmarks.
I've used the same test scenario used on [1], which is:

CREATE TABLE listp(a int, b int) PARTITION BY LIST(a);
SELECT 'CREATE TABLE listp'|| x || ' PARTITION OF listp FOR VALUES IN
('||x||');' FROM generate_Series(1,1000) x; \gexec
INSERT INTO listp SELECT 1,x FROM generate_series(1,1000) x;

EXPLAIN (VERBOSE, ANALYZE) SELECT COUNT(*) FROM listp WHERE b < 0;

Results:

master jit=off:
Planning Time: 59.457
Execution Time: 457.000

master jit=on:
Planning Time: 59.529 ms
JIT:
  Functions: 9008
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 99.267 ms (Deform 37.434 ms), Inlining 0.000 ms,
Optimization 309.079 ms, Emission 517.495 ms, Total 925.841 ms
Execution Time: 674.756 ms

patch jit=off
Planning Time: 60.906 ms
Execution Time: 453.978 ms

patch jit=on:
Planning Time: 67.625 ms
JIT:
  Functions: 17
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 0.502 ms (Deform 0.073 ms), Inlining 0.000 ms,
Optimization 0.998 ms, Emission 3.915 ms, Total 5.415 ms
Execution Time: 328.239 ms

Note that I used the jit_above_cost default value for both tests.

I've executed the same EXPLAIN query 100 times on master and with the patch and
all results seem similar with the above.

jit=off
Master mean: 438.898
Patch mean: 436.036

jit=on
Master mean: 665.730
Patch mean: 347.758

I'm a bit concerned if it's a good idea to add the jit field on Plan node, but I
also I'm not sure if this approach makes sense. Attached a simple patch that
play with the idea.

WYT?

[1] 
https://www.postgresql.org/message-id/cagpvpcsbn_-t3jvpmmhhsqns2nogo1tsbbyzng1cjggacqj...@mail.gmail.com

--
Matheus Alcantara
From 2577544cc8b93d9a08a872353e3d260c5999ccef Mon Sep 17 00:00:00 2001
From: Matheus Alcantara 
Date: Wed, 18 Dec 2024 16:54:01 -0300
Subject: [PATCH v1] JIT compilation per plan node

Introduce a new jit field on Planner struct to make it possible to
decide whether a plan node should be compiled or not.

Previously the cost of the entire query was used to decide whether it
should be compiled or not. This approach has a problem that it can exist
some expressions on plan tree that is not executed frequently, so it is
not worth to compile.

This commit change the consideration of JIT from plan level to
per-plan-node. The plan total cost and the number of times that the node
will be executed is used to decide if the plan node should be compiled
or not.
---
 src/backend/jit/jit.c   |  8 
 src/backend/optimizer/plan/createplan.c | 27 +
 src/backend/optimizer/plan/planner.c|  3 +--
 src/include/nodes/plannodes.h   |  2 ++
 4 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/

Re: Fix bank selection logic in SLRU

2024-12-19 Thread Yura Sokolov


19.12.2024 15:10, Andrey M. Borodin wrote:




On 19 Dec 2024, at 15:37, Yura Sokolov  wrote:

So, there's no more than 8192 banks at the moment.


OK, but still current type indicates bitwise usage, while struct member is used 
as a number.


Ok, I agree. Here's version with type change bits16 -> uint16.From 4a16d64fe6eba042bfca7ff0b0d73ec8b749e6e0 Mon Sep 17 00:00:00 2001
From: Yura Sokolov 
Date: Tue, 10 Dec 2024 15:38:02 +0300
Subject: [PATCH v2] Fix SLRU bank selection.

Due to code evolution, nbanks is not power of two any more. But still
binary & were used to obtain bankno.
---
 src/backend/access/transam/slru.c | 6 +++---
 src/include/access/slru.h | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index e7f73bf4275..afedb5c039f 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -343,7 +343,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 	ctl->shared = shared;
 	ctl->sync_handler = sync_handler;
 	ctl->long_segment_names = long_segment_names;
-	ctl->bank_mask = (nslots / SLRU_BANK_SIZE) - 1;
+	ctl->nbanks = nbanks;
 	strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
 }
 
@@ -606,7 +606,7 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int64 pageno, TransactionId xid)
 {
 	SlruShared	shared = ctl->shared;
 	LWLock	   *banklock = SimpleLruGetBankLock(ctl, pageno);
-	int			bankno = pageno & ctl->bank_mask;
+	int			bankno = pageno % ctl->nbanks;
 	int			bankstart = bankno * SLRU_BANK_SIZE;
 	int			bankend = bankstart + SLRU_BANK_SIZE;
 
@@ -1177,7 +1177,7 @@ SlruSelectLRUPage(SlruCtl ctl, int64 pageno)
 		int			bestinvalidslot = 0;	/* keep compiler quiet */
 		int			best_invalid_delta = -1;
 		int64		best_invalid_page_number = 0;	/* keep compiler quiet */
-		int			bankno = pageno & ctl->bank_mask;
+		int			bankno = pageno % ctl->nbanks;
 		int			bankstart = bankno * SLRU_BANK_SIZE;
 		int			bankend = bankstart + SLRU_BANK_SIZE;
 
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 97e612cd100..648e202ff54 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -129,9 +129,9 @@ typedef struct SlruCtlData
 	SlruShared	shared;
 
 	/*
-	 * Bitmask to determine bank number from page number.
+	 * Number of banks to determine bank number from page number.
 	 */
-	bits16		bank_mask;
+	uint16		nbanks;
 
 	/*
 	 * If true, use long segment file names.  Otherwise, use short file names.
@@ -179,7 +179,7 @@ SimpleLruGetBankLock(SlruCtl ctl, int64 pageno)
 {
 	int			bankno;
 
-	bankno = pageno & ctl->bank_mask;
+	bankno = pageno % ctl->nbanks;
 	return &(ctl->shared->bank_locks[bankno].lock);
 }
 
-- 
2.43.0

Re: Parametrization minimum password lenght

On Thu, Dec 19, 2024 at 09:36:17AM -0600, Nathan Bossart wrote:
> On Thu, Dec 19, 2024 at 07:25:30AM +, Bertrand Drouvot wrote:
>> - errmsg("password is too short")));
>> + errmsg("password is too short: %d (< 
>> %d)", pwdlen, min_password_length)));
> 
> I think we should explicitly point to the parameter and note its current
> value.

Like so.

-- 
nathan
>From ce67f3ff5fca47b443195f5bd1932429294bdbfb Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 18 Dec 2024 14:49:20 -0600
Subject: [PATCH v11 1/1] Add passwordcheck.min_password_length.

This new parameter can be used to change the minimum allowed
password length (in bytes).  Note that it has no effect if a user
supplies a pre-encrypted password.

Author: Emanuele Musella, Maurizio Boriani
Reviewed-by: Tomas Vondra, Bertrand Drouvot
Discussion: 
https://postgr.es/m/CA%2BugDNyYtHOtWCqVD3YkSVYDWD_1fO8Jm_ahsDGA5dXhbDPwrQ%40mail.gmail.com
---
 .../passwordcheck/expected/passwordcheck.out  |  4 ++
 .../expected/passwordcheck_1.out  |  4 ++
 contrib/passwordcheck/passwordcheck.c | 26 ++---
 contrib/passwordcheck/sql/passwordcheck.sql   |  4 ++
 doc/src/sgml/passwordcheck.sgml   | 37 +++
 5 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/contrib/passwordcheck/expected/passwordcheck.out 
b/contrib/passwordcheck/expected/passwordcheck.out
index dfb2ccfe00..83472c76d2 100644
--- a/contrib/passwordcheck/expected/passwordcheck.out
+++ b/contrib/passwordcheck/expected/passwordcheck.out
@@ -6,6 +6,10 @@ ALTER USER regress_passwordcheck_user1 PASSWORD 
'a_nice_long_password';
 -- error: too short
 ALTER USER regress_passwordcheck_user1 PASSWORD 'tooshrt';
 ERROR:  password is too short
+DETAIL:  password must be at least "passwordcheck.min_password_length" (8) 
bytes long
+-- ok
+SET passwordcheck.min_password_length = 6;
+ALTER USER regress_passwordcheck_user1 PASSWORD 'v_shrt';
 -- error: contains user name
 ALTER USER regress_passwordcheck_user1 PASSWORD 
'xyzregress_passwordcheck_user1';
 ERROR:  password must not contain user name
diff --git a/contrib/passwordcheck/expected/passwordcheck_1.out 
b/contrib/passwordcheck/expected/passwordcheck_1.out
index 9519d60a49..fb12ec45cc 100644
--- a/contrib/passwordcheck/expected/passwordcheck_1.out
+++ b/contrib/passwordcheck/expected/passwordcheck_1.out
@@ -6,6 +6,10 @@ ALTER USER regress_passwordcheck_user1 PASSWORD 
'a_nice_long_password';
 -- error: too short
 ALTER USER regress_passwordcheck_user1 PASSWORD 'tooshrt';
 ERROR:  password is too short
+DETAIL:  password must be at least "passwordcheck.min_password_length" (8) 
bytes long
+-- ok
+SET passwordcheck.min_password_length = 6;
+ALTER USER regress_passwordcheck_user1 PASSWORD 'v_shrt';
 -- error: contains user name
 ALTER USER regress_passwordcheck_user1 PASSWORD 
'xyzregress_passwordcheck_user1';
 ERROR:  password must not contain user name
diff --git a/contrib/passwordcheck/passwordcheck.c 
b/contrib/passwordcheck/passwordcheck.c
index 0785618f2a..d79ecb00b6 100644
--- a/contrib/passwordcheck/passwordcheck.c
+++ b/contrib/passwordcheck/passwordcheck.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include 
+#include 
 
 #ifdef USE_CRACKLIB
 #include 
@@ -26,11 +27,11 @@
 
 PG_MODULE_MAGIC;
 
-/* Saved hook value in case of unload */
+/* Original hook */
 static check_password_hook_type prev_check_password_hook = NULL;
 
-/* passwords shorter than this will be rejected */
-#define MIN_PWD_LENGTH 8
+/* GUC variables */
+static int min_password_length = 8;
 
 /*
  * check_password
@@ -93,10 +94,12 @@ check_password(const char *username,
 #endif
 
/* enforce minimum length */
-   if (pwdlen < MIN_PWD_LENGTH)
+   if (pwdlen < min_password_length)
ereport(ERROR,

(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("password is too short")));
+errmsg("password is too short"),
+errdetail("password must be at least 
\"passwordcheck.min_password_length\" (%d) bytes long",
+  
min_password_length)));
 
/* check if the password contains the username */
if (strstr(password, username))
@@ -142,6 +145,19 @@ check_password(const char *username,
 void
 _PG_init(void)
 {
+   /* Define custom GUC variables. */
+   DefineCustomIntVariable("passwordcheck.min_password_length",
+   "Minimum allowed 
password length.",
+   NULL,
+   &min_password_length,
+   8,
+

Re: Add CASEFOLD() function.

2024-12-19 Thread Jeff Davis

On Thu, 2024-12-19 at 17:18 +0100, Peter Eisentraut wrote:
> Can you explain this in further detail?  I don't quite follow why
> this 
> would be required.

I am unsure now.

My initial reasoning was based on the idea that users would want to use
CASEFOLD(t) in a unique expression index as an improvement over
LOWER(t). And if you do that, you'd be surprised if some equivalent
strings ended up in the index. I don't think that's a huge problem,
because in other contexts we leave it up to the user to keep things
normalized consistently, and a CHECK(t IS NFC NORMALIZED) is a good way
to do that.

But there's a problem: full case folding doesn't preserve the normal
form, so even if the input is NFC normalized, the output might not be.
If we solve this problem, then we can just say that CASEFOLD()
preserves the normal form, consistently with how the spec defines
LOWER()/UPPER(), and I think that would be the best outcome.

I'm not sure if that problem is solvable, though, because what if the
input string is in both NFC and NFD, how do we know which normal form
to preserve?

We could tell users to use an expression index on
NORMALIZE(CASEFOLD(t)) instead, but that feels like inefficient
boilerplate.

> 
> Another might be that's not entirely clear how this should work in 
> encodings other than UTF-8.  For example, the normalized string might
> not be representable in the encoding.

That's a good point.

Regards,
Jeff Davis

Re: Doc: clarify the log message level of the VERBOSE option

2024-12-19 Thread Fujii Masao




On 2024/12/05 16:48, Masahiro Ikeda wrote:

On 2024-12-05 16:23, Yugo NAGATA wrote:


-  Prints a progress report as each table is clustered.
+  Prints a progress report as each table is clustered,
+  at INFO level.


I feel like the comma could not be necessary here like the fix for
the ANALYZE document, but it might be better to wait for a review
from a native English speaker.


Thanks. OK, I'll wait.


I'm not a native English speaker, but barring any objection,
I'd like to commit the attached patch. In this patch,
I've just removed a comma and updated the commit log message.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From a05585f5362f22a3272c9308868f215995a281d1 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Fri, 20 Dec 2024 02:02:28 +0900
Subject: [PATCH v3] doc: Clarify log level for VERBOSE messages in maintenance
 commands.

VERBOSE messages from ANALYZE, CLUSTER, REINDEX, and VACUUM are logged
at the INFO level, but this detail was missing from the documentation.
This commit updates the docs to mention the log level for these messages.

Author: Masahiro Ikeda
Reviewed-by: Yugo Nagata
Discussion: https://postgr.es/m/b4a4b7916982dccd9607c8efb3ce5...@oss.nttdata.com
---
 doc/src/sgml/ref/analyze.sgml | 2 +-
 doc/src/sgml/ref/cluster.sgml | 3 ++-
 doc/src/sgml/ref/reindex.sgml | 3 ++-
 doc/src/sgml/ref/vacuum.sgml  | 3 ++-
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml
index a0db56ae74..ec81f00fec 100644
--- a/doc/src/sgml/ref/analyze.sgml
+++ b/doc/src/sgml/ref/analyze.sgml
@@ -65,7 +65,7 @@ ANALYZE [ ( option [, ...] ) ] [ VERBOSE
 
  
-  Enables display of progress messages.
+  Enables display of progress messages at INFO level.
  
 

diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index c5760244e6..8811f169ea 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -106,7 +106,8 @@ CLUSTER [ ( option [, ...] ) ] [ VERBOSE
 
  
-  Prints a progress report as each table is clustered.
+  Prints a progress report as each table is clustered
+  at INFO level.
  
 

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index dcf70d14bc..5b3c769800 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -194,7 +194,8 @@ REINDEX [ ( option [, ...] ) ] { DA
 VERBOSE
 
  
-  Prints a progress report as each index is reindexed.
+  Prints a progress report as each index is reindexed
+  at INFO level.
  
 

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9110938fab..971b1237d4 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -129,7 +129,8 @@ VACUUM [ ( option [, ...] ) ] [ VERBOSE
 
  
-  Prints a detailed vacuum activity report for each table.
+  Prints a detailed vacuum activity report for each table
+  at INFO level.
  
 

-- 
2.47.0

Re: Fix bank selection logic in SLRU

2024-12-19 Thread Yura Sokolov

10.12.2024 17:07, Dilip Kumar wrote:
On Tue, 10 Dec 2024 at 6:32 PM, Andrey M. Borodin > wrote:

 > On 10 Dec 2024, at 15:39, Yura Sokolov mailto:y.soko...@postgrespro.ru>> wrote:
 >
 > It is not critical bug, since it doesn't hurt correctness just
performance. In worst case only one bank will be used.

Ugh... yeah. IMO the problem is that we do not have protection that
rejects values that are not power of 2.
If other values given system operates as if there are
2^(popcount(n)-1) banks. So if we just round down value to nearest
power of 2 - we will help incorrectly configured systems to use
proper amount of memory and keep performance of properly configured
systems.

+1

IMO doing modulo is not necessary. And hash function is pure waste
of CPU cycles.

I agree

I did some measurement "divide-modulo" vs "modulo using multiplication by
reciprocal" vs "simple binary and" using simple C program [0].
(Note: loop is made to be dependent on previous iteration result so no
parallel computation happens).

Results on Ryzen 7 5825U:

$ time ./div 1 15 3 # binary and
real0m0,943s
$ time ./div 1 15 1 # multiply by reciprocal
real0m3,123s
$ time ./div 1 15 0 # just plain `%`
real0m4,540s

It means:
- `&` takes 0.69ns
- `mult-rec` takes 2.94ns
- `%` takes 3.24ns.

I believe, compared to further memory accesses it could be count as
negligible.

(Certainly, it could be worse on some older processors. But I still doubt
it will be measurably worse on top of all other things SLRU does.)

[0] https://gist.github.com/funny-falcon/173923b4fea7ffdf9e02595a0f99aa74

Regards,
Yura Sokolov aka funny-falcon

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-12-19 Thread Ranier Vilela

Em qua., 18 de dez. de 2024 às 20:18, Melanie Plageman <
melanieplage...@gmail.com> escreveu:

> On Wed, Dec 18, 2024 at 3:13 PM Melanie Plageman
>  wrote:
> >
> > option 1:
> > most straightforward fix
> >
> > diff --git a/src/backend/access/heap/heapam_handler.c
> > b/src/backend/access/heap/heapam_handler.c
> > index d0e5922eed7..fa03bedd4b8 100644
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -2577,6 +2577,11 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
> buffer,
> >
> > if (scan->rs_flags & SO_ALLOW_PAGEMODE)
> > {
> > +   int start, end;
> > +
> > +   if (hscan->rs_ntuples == 0)
> > +   return false;
> > +
> > /*
> >  * In pageatatime mode, heap_prepare_pagescan() already did
> visibility
> >  * checks, so just look at the info it left in rs_vistuples[].
> > @@ -2586,8 +2591,8 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
> buffer,
> >  * in increasing order, but it's not clear that there would be
> enough
> >  * gain to justify the restriction.
> >  */
> > -   int start = 0,
> > -   end = hscan->rs_ntuples - 1;
> > +   start = 0;
> > +   end = hscan->rs_ntuples - 1;
> >
> > while (start <= end)
> > {
> >
> > option 2:
> > change the binary search code a bit more but avoid the extra branch
> > introduced by option 1.
> >
> > diff --git a/src/backend/access/heap/heapam_handler.c
> > b/src/backend/access/heap/heapam_handler.c
> > index d0e5922eed7..c8e25bdd197 100644
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -2586,18 +2586,17 @@ SampleHeapTupleVisible(TableScanDesc scan,
> > Buffer buffer,
> >  * in increasing order, but it's not clear that there would be
> enough
> >  * gain to justify the restriction.
> >  */
> > -   int start = 0,
> > -   end = hscan->rs_ntuples - 1;
> > +   uint32 start = 0, end = hscan->rs_ntuples;
> >
> > -   while (start <= end)
> > +   while (start < end)
> > {
> > -   int mid = (start + end) / 2;
> > +   int mid = (start + end) / 2;
> > OffsetNumber curoffset = hscan->rs_vistuples[mid];
> >
> > if (tupoffset == curoffset)
> > return true;
> > else if (tupoffset < curoffset)
> > -   end = mid - 1;
> > +   end = mid;
> > else
> > start = mid + 1;
> > }
>
> I pushed the straightforward option for now so that it's fixed.
>
How about:

diff --git a/src/backend/access/heap/heapam_handler.c
b/src/backend/access/heap/heapam_handler.c
index 2da4e4da13..1620f0c3db 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2574,7 +2574,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
buffer,

  if (scan->rs_flags & SO_ALLOW_PAGEMODE)
  {
- uint32 start,
+ int64 start,
  end;

  if (hscan->rs_ntuples == 0)
@@ -2594,7 +2594,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
buffer,

  while (start <= end)
  {
- uint32 mid = (start + end) / 2;
+ int64 mid = (start + end) >> 1;
  OffsetNumber curoffset = hscan->rs_vistuples[mid];

  if (tupoffset == curoffset)


best regards
Ranier Vilela

>
> - Melanie
>

Re: [PROPOSAL] : Disallow use of empty column name in (column_name '') in ALTER or CREATE of foreign table.

2024-12-19 Thread Robert Haas

On Thu, Dec 19, 2024 at 7:08 AM Nishant Sharma
 wrote:
> Thanks Robert for your response and the review comments!
>
> I have addressed both your suggestions, by changing the error
> message and merging both the patches to one.
>
> PFA, patch set v4.

Cool. I don't want to commit this right before my vacation but I'll
look into it in the new year if nobody beats me to it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Separate GUC for replication origins


On 10.12.24 19:41, Euler Taveira wrote:
I'm attaching a patch that adds max_replication_origins. It basically 
replaces

all of the points that refers to max_replication_slots on the subscriber. It
uses the same default value as max_replication_slots (10). I did nothing to
keep the backward compatibility. I mean has a special value (like -1) that
means same value as max_replication_slots. Is it worth it? (If not, it 
should

be mentioned in the commit message.)


I think some backward compatibility tweak like that would be useful. 
Otherwise, the net effect of this is that most people will have to do 
one more thing than before to keep their systems working.  Also, all 
deployment and automation scripts would have to be updated for this.

Re: Change GUC hashtable to use simplehash?

2024-12-19 Thread John Naylor

I wrote:

> The offending code is not even my preferred way to handle the last
> word of the string (see f4ad0021af), so if the current way is still
> not valgrind-clean, I wonder if we should give up and add an
> exception, since we know any garbage bits are masked off.

That would actually be a maintenance headache because the function is
inlined, but here's a better idea: We already have a fallback path for
when the string is not suitably aligned, or in 32-bit builds. We could
just use that under Valgrind:

 static inline size_t
 fasthash_accum_cstring(fasthash_state *hs, const char *str)
 {
-#if SIZEOF_VOID_P >= 8
+#if SIZEOF_VOID_P >= 8 && !defined(USE_VALGRIND)

Any objections?

-- 
John Naylor
Amazon Web Services

Re: Fix bank selection logic in SLRU

2024-12-19 Thread Andrey Borodin

> On 19 Dec 2024, at 15:01, Yura Sokolov  wrote:
> 
> - `&` takes 0.69ns
> - `mult-rec` takes 2.94ns
> - `%` takes 3.24ns.

Thanks, Yura, for benchmarks and off-list conversation.
I’ve reproduced similar numbers on my Apple M2.
I agree that additional 3-4ns are negligible in case of SLRU access.

+   bits16  nbanks;

Perhaps, it’s not bits anymore. Also, is 64K banks ought enough for everybody?

Best regards, Andrey Borodin.

Re: Vacuum statistics

2024-12-19 Thread Alena Rybakina


Hi!

On 02.12.2024 17:46, Ilia Evdokimov wrote:
In my opinion, the patches are semantically correct. However, not all 
dead code has been removed - I'm referring to 
pgstat_update_snapshot(). Also, the tests need to be fixed.




I fixed it [0]. Thank you!

[0] 
https://www.postgresql.org/message-id/86f76aa5-1ab5-4e2e-9b15-405051852a2a%40postgrespro.ru


--
Regards,
Alena Rybakina
Postgres Professional

Re: Fix bank selection logic in SLRU

2024-12-19 Thread Andrey M. Borodin




> On 19 Dec 2024, at 15:37, Yura Sokolov  wrote:
> 
> So, there's no more than 8192 banks at the moment.

OK, but still current type indicates bitwise usage, while struct member is used 
as a number.


Best regards, Andrey Borodin.

speedup COPY TO for partitioned table.

2024-12-19 Thread jian he

hi.

COPY (select_query) generally slower than
table_beginscan.. table_scan_getnextslot ..table_endscan,
especially for partitioned tables.
so in the function DoCopyTo
trying to use table_beginscan.. table_scan_getnextslot ..table_endscan
for COPY TO when source table is a partitioned table.

setup-
CREATE TABLE t3 (a INT, b int ) PARTITION BY RANGE (a);
create table t3_1 partition of t3 for values from (1) to (11);
create table t3_2 partition of t3 for values from (11) to (15);
insert into t3 select g from generate_series(1, 3) g;
insert into t3 select g from generate_series(11, 11) g;

so now you can do:
copy t3 to stdout;

in the master, you will get:
ERROR:  cannot copy from partitioned table "t3"
HINT:  Try the COPY (SELECT ...) TO variant.


attached copy_par_regress_test.sql is a simple benchmark sql file,
a partitioned table with 10 partitions, 2 levels of indirection.
The simple benchmark shows around 7.7% improvement in my local environment.

local environment:
PostgreSQL 18devel_debug_build_382092a0cd on x86_64-linux, compiled by
gcc-14.1.0, 64-bit
From ba2307cca8bd1d53e0febddaf11c932dae5d31e0 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Thu, 19 Dec 2024 19:48:12 +0800
Subject: [PATCH v1 1/1] speedup COPY TO for partitioned table.

COPY (select_query) generally slower than
table_beginscan.. table_scan_getnextslot ..table_endscan.
especially for partitioned table.

so using table_beginscan.. table_scan_getnextslot ..table_endscan
for COPY TO when source table is a partitioned table.

enviroment:
PostgreSQL 18devel_debug_build_382092a0cd on x86_64-linux, compiled by gcc-14.1.0, 64-bit
shows around 7.7% improvement.
---
 src/backend/commands/copyto.c | 69 +++
 1 file changed, 61 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index 161a0f8b0a..42dba2d4a8 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -19,6 +19,8 @@
 #include 
 
 #include "access/tableam.h"
+#include "access/table.h"
+#include "catalog/pg_inherits.h"
 #include "commands/copy.h"
 #include "commands/progress.h"
 #include "executor/execdesc.h"
@@ -79,6 +81,7 @@ typedef struct CopyToStateData
 	List	   *attnumlist;		/* integer list of attnums to copy */
 	char	   *filename;		/* filename, or NULL for STDOUT */
 	bool		is_program;		/* is 'filename' a program to popen? */
+	bool		is_partitioned;	/* is the COPY source relation a partitioned table? */
 	copy_data_dest_cb data_dest_cb; /* function for writing data */
 
 	CopyFormatOptions opts;
@@ -396,13 +399,7 @@ BeginCopyTo(ParseState *pstate,
 	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 	 errmsg("cannot copy from sequence \"%s\"",
 			RelationGetRelationName(rel;
-		else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-			ereport(ERROR,
-	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-	 errmsg("cannot copy from partitioned table \"%s\"",
-			RelationGetRelationName(rel)),
-	 errhint("Try the COPY (SELECT ...) TO variant.")));
-		else
+		else if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
 			ereport(ERROR,
 	(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 	 errmsg("cannot copy from non-table relation \"%s\"",
@@ -426,6 +423,7 @@ BeginCopyTo(ParseState *pstate,
 	/* Extract options from the statement node tree */
 	ProcessCopyOptions(pstate, &cstate->opts, false /* is_from */ , options);
 
+	cstate->is_partitioned = false;
 	/* Process the source/target relation or query */
 	if (rel)
 	{
@@ -433,6 +431,8 @@ BeginCopyTo(ParseState *pstate,
 
 		cstate->rel = rel;
 
+		if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			cstate->is_partitioned = true;
 		tupDesc = RelationGetDescr(cstate->rel);
 	}
 	else
@@ -847,7 +847,60 @@ DoCopyTo(CopyToState cstate)
 		}
 	}
 
-	if (cstate->rel)
+	/*
+	 * if COPY TO source table is a partitioned table, then open each
+	 * partition and process each individual partition.
+	 */
+	if (cstate->rel && cstate->is_partitioned)
+	{
+		List	   *children = NIL;
+		List	   *scan_oids = NIL;
+
+		processed = 0;
+		children = find_all_inheritors(RelationGetRelid(cstate->rel),
+		 AccessShareLock,
+		 NULL);
+
+		foreach_oid(childreloid, children)
+		{
+			char		relkind = get_rel_relkind(childreloid);
+
+			if (RELKIND_HAS_PARTITIONS(relkind))
+continue;
+
+			scan_oids = lappend_oid(scan_oids, childreloid);
+		}
+
+		foreach_oid(scan_oid, scan_oids)
+		{
+			TupleTableSlot *slot;
+			TableScanDesc scandesc;
+			Relation		scan_rel;
+
+			scan_rel = table_open(scan_oid, AccessShareLock);
+			scandesc = table_beginscan(scan_rel, GetActiveSnapshot(), 0, NULL);
+			slot = table_slot_create(scan_rel, NULL);
+
+			while (table_scan_getnextslot(scandesc, ForwardScanDirection, slot))
+			{
+CHECK_FOR_INTERRUPTS();
+
+/* Deconstruct the tuple ... */
+slot_getallattrs(slot);
+
+/* Format and send the data */
+CopyOneRowTo(cstate, slot);
+
+pgstat_progress_update_pa

Re: [PROPOSAL] : Disallow use of empty column name in (column_name '') in ALTER or CREATE of foreign table.

2024-12-19 Thread Nishant Sharma

On Tue, Dec 17, 2024 at 10:12 PM Robert Haas  wrote:

> On Wed, Oct 9, 2024 at 7:12 AM Nishant Sharma
>  wrote:
> > I have included them in v3. v3 does not allow empty schema_name &
> > table_name options along with column_name.
>
> I was looking at these patches today and trying to understand whether
> the proposed error message is consistent with what we have done
> elsewhere.
>
> The proposed error message was "cannot use empty value for option
> \"%s\". I looked for error messages that contained the phrase "empty"
> or "zero". I did not find a case exactly like this one. The closet
> analogues I found were things like this:
>
> invalid cursor name: must not be empty
> output cannot be empty string
> DSM segment name cannot be empty
> row path filter must not be empty string
>
> These messages aren't quite as consistent as one might wish, so it's a
> little hard to judge here. Nonetheless, I propose that the error
> message we use here should end with either "cannot" or "must not"
> followed by either "be empty" or "be empty string". I think my
> preferred variant would be "value for option \"%s\" must not be empty
> string". But there's certainly oodles of room for bikeshedding.
>
> Apart from that, I see very little to complain about here. Once we
> agree on the error message text I think something can be committed.
> For everyone's convenience, I suggest merging the two patches into
> one. I highly approve of separating patches by topic, but there's
> usually no point in separating them by directory.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com


Thanks Robert for your response and the review comments!

I have addressed both your suggestions, by changing the error
message and merging both the patches to one.

PFA, patch set v4.

Regards,
Nishant.


v4-0001-Disallow-empty-Foreign-Table-column_name-schema_n.patch
Description: Binary data

Re: log_min_messages per backend type

2024-12-19 Thread Euler Taveira

On Thu, Dec 19, 2024, at 1:13 AM, Kirill Reshke wrote:
> Hi! Looks like a sane proposal. Correct me if I'm wrong, but with your
> patch it is not possible to configure something like "everybody ERROR,
> but autovac DEBUG5"? I mean,
> set log_min_messages to 'ERROR:default, DEBUG5:autovacuum' (not
> specifying all possible cases?)
> 
No. Good point. I forgot to mention that ALL as backend type could be added.


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: New "single" COPY format

2024-12-19 Thread Andrew Dunstan




On 2024-12-16 Mo 10:09 AM, Joel Jacobson wrote:

Hi hackers,

After further consideration, I'm withdrawing the patch.
Some fundamental questions remain unresolved:

- Should round-trip fidelity be a strict goal? By "round-trip fidelity",
   I mean that data exported and then re-imported should yield exactly
   the original values, including the distinction between NULL and empty 
strings.
- If round-trip fidelity is a requirement, how do we distinguish NULL from empty
   strings without delimiters or escapes?
- Is automatic newline detection (as in "csv" and "text") more valuable than
   the ability to embed \r (CR) characters?
- Would it be better to extend the existing COPY options rather than introducing
   a new format?
- Or should we consider a JSONL format instead, one that avoids the NULL/empty
   string problem entirely?

No clear solution or consensus has emerged. For now, I'll step back from the
proposal. If someone wants to revisit this later, I'd be happy to contribute.

Thanks again for all the feedback and consideration.



We seem to have got seriously into the weeds, here. I'd be sorry to see 
this dropped. After all, it's not something new, and while we have a 
sort of workaround for "one json doc per line" it's far from obvious, 
and except in a few blog posts undocumented.


I think we're trying to be far too general here but in the absence of 
more general use cases. The ones I recall having encountered in the wild 
are:


  . one json datum per line

  . one json document per file

  . a sequence of json documents per file

The last one is hard to deal with, and I think I've only seen it once or 
twice, so I suggest leaving it aside for now.


Notice these are all JSON. I could imagine XML might have similar 
requirements, but I encounter it extremely rarely.


Regarding NULL, an empty string is not a valid JSON literal, so there 
should be no confusion there. It is valid for XML, though.


Given all that I think restricting ourselves to just the JSON cases, and 
possibly just to JSONL, would be perfectly reasonable.


Regarding CR, it's not a valid character in a JSON string item, although 
it is valid in JSON whitespace. I would not treat it as magical unless 
it immediately precedes an NL. That gives rise to a very sight 
ambiguity, but I think it's one we could live with.


As for what the format is called, I don't like the "LIST" proposal much, 
even for the general case. Seems too close to an array.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-12-19 Thread Ranier Vilela

Hi.

Sorry for not responding quickly.
I have been without communication until now.

Em qua., 18 de dez. de 2024 às 17:13, Melanie Plageman <
melanieplage...@gmail.com> escreveu:

> On Wed, Dec 18, 2024 at 1:23 PM Ranier Vilela  wrote:
> >
> > Hi.
> >
> > Em qua., 18 de dez. de 2024 às 14:01, Melanie Plageman <
> melanieplage...@gmail.com> escreveu:
> >>
> >> For now, I've committed the version of the patch I proposed above (v3).
> >
> > What happens if *rs_tuples* equal to zero in function
> *SampleHeapTupleVisible*?
> > end = hscan->rs_ntuples - 1;
>
> Ah yes, it seems like just changing the top if statement to
> if (scan->rs_flags & SO_ALLOW_PAGEMODE &&
> hscan->rs_ntuples > 0)
>
> Thanks for identifying this.
>
> > Would be good to fix the binary search too, now that unsigned types are
> used.
>
> You just mean the one in SampleHeapTupleVisible(), right?
>
> > Patch attached.
>
> I'm not sure the attached patch is quite right because if rs_ntuples
> is 0, it should still enter the first if statement and then return
> false. In fact, with your patch, I think we would incorrectly not
> return a value when rs_ntuples is 0 from SampleHeapTupleVisible().
>
I'm wondering if *rs_tuples* is zero, would be run the second search
*HeapTupleSatisfiesVisibility*.


> How about one of these options:
>
> option 1:
> most straightforward fix
>
> diff --git a/src/backend/access/heap/heapam_handler.c
> b/src/backend/access/heap/heapam_handler.c
> index d0e5922eed7..fa03bedd4b8 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2577,6 +2577,11 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
> buffer,
>
> if (scan->rs_flags & SO_ALLOW_PAGEMODE)
> {
> +   int start, end;
> +
> +   if (hscan->rs_ntuples == 0)
> +   return false;
> +
> /*
>  * In pageatatime mode, heap_prepare_pagescan() already did
> visibility
>  * checks, so just look at the info it left in rs_vistuples[].
> @@ -2586,8 +2591,8 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
> buffer,
>  * in increasing order, but it's not clear that there would be
> enough
>  * gain to justify the restriction.
>  */
> -   int start = 0,
> -   end = hscan->rs_ntuples - 1;
> +   start = 0;
> +   end = hscan->rs_ntuples - 1;
>
> while (start <= end)
> {
>
> option 2:
> change the binary search code a bit more but avoid the extra branch
> introduced by option 1.
>
> diff --git a/src/backend/access/heap/heapam_handler.c
> b/src/backend/access/heap/heapam_handler.c
> index d0e5922eed7..c8e25bdd197 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2586,18 +2586,17 @@ SampleHeapTupleVisible(TableScanDesc scan,
> Buffer buffer,
>  * in increasing order, but it's not clear that there would be
> enough
>  * gain to justify the restriction.
>  */
> -   int start = 0,
> -   end = hscan->rs_ntuples - 1;
> +   uint32 start = 0, end = hscan->rs_ntuples;
>
> -   while (start <= end)
> +   while (start < end)
> {
> -   int mid = (start + end) / 2;
> +   int mid = (start + end) / 2;
> OffsetNumber curoffset = hscan->rs_vistuples[mid];
>
> if (tupoffset == curoffset)
> return true;
> else if (tupoffset < curoffset)
> -   end = mid - 1;
> +   end = mid;
> else
> start = mid + 1;
> }
>

I'm looking at the commit and the replies in the thread.

best regards,
Ranier Vilela

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-12-19 Thread Ranier Vilela

Em qua., 18 de dez. de 2024 às 23:50, Richard Guo 
escreveu:

> On Thu, Dec 19, 2024 at 8:18 AM Melanie Plageman
>  wrote:
> > I pushed the straightforward option for now so that it's fixed.
>
> I think this binary search code now has a risk of underflow.  If 'mid'
> is calculated as zero, the second 'if' branch will cause 'end' to
> underflow.
>
> Maybe we need to do something like below.
>
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2600,7 +2600,11 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer
> buffer,
> if (tupoffset == curoffset)
> return true;
> else if (tupoffset < curoffset)
> +   {
> +   if (mid == 0)
> +   return false;
> end = mid - 1;
> +   }
> else
> start = mid + 1;
> }
>
> Alternatively, we can revert 'start' and 'end' to signed int as they
> were before.
>
How would it be *signed*?
Wouldn't overflow happen in this case?
rs_tuples now can be

UINT_MAX   = 4294967295

best regards,
Ranier Vilela

Re: [PATCH] Add support for displaying database service in psql prompt

2024-12-19 Thread Michael Banck

Hi,

On Wed, Dec 18, 2024 at 03:17:36PM +0900, Michael Paquier wrote:
> On Tue, Dec 17, 2024 at 09:42:36AM +0100, Michael Banck wrote:
> > Done.
> > 
> > V3 attached.
> 
> Done.

Thanks!


Michael

Re: pure parsers and reentrant scanners


On 17.12.24 01:46, Andreas Karlsson wrote:

On 12/16/24 8:39 AM, Peter Eisentraut wrote:

I'll leave it at this for now and wait for some reviews.


I really like this work since it makes the code cleaner to read on top 
of paving the way for threading.


Reviewed the patches and found a couple of issues.

- Shouldn't yyext in syncrep_scanner_init() be allocated on the heap? Or 
at least on the stack but by the caller?


I think it's correct the way it is.  It's only a temporary space for the 
scanner, so we can allocate it in the innermost scope.


- I think you have flipped the parameters of replication_yyerror(), see 
attached fixup patch.


Good catch.  There was also a similar issue with syncrep_yyerror().


- Some white space issues fixed in an attached fixup patch.


committed

- Also fixed the static remaining variables in the replication parser in 
an attached patch.


Thanks, I'll take a look at that.

- There seems to be a lot left to do to make the plpgsql scanner 
actually re-entrant so I do not think it would makes sense to commit the 
patch which sets the re-entrant option before that is done.


Ok, we can hold that one back until the full stack including the parser 
is done.

Re: [PROPOSAL] : Disallow use of empty column name in (column_name '') in ALTER or CREATE of foreign table.

2024-12-19 Thread Nishant Sharma

Hi,


Summary of this thread-

Tom had the following concern:-
Regarding location of check in
ATExecAlterColumnGenericOptions. And spell check issue
comparison. Which I believe was addressed by v2 and
its response.

Ashutosh had the suggestion:-
Check should be delegated to an FDW validator. Which I
believe was addressed in v2.

Michael had the following concern:-
We should also care about empty values in schema_name
and table_name. Which I believe I have addressed in v3
patch.

Robert had the following concern:-
Regarding error message and single patch for this. Which I
believe I have addressed in v4 patch.

Tom, Ashutosh, Michael, Robert please let me know if I was
able to address your concerns or if you feel differently.

Assuming Tom, Ashutosh, Michael and Robert feel as though
I have addressed the concerns mentioned above, does anybody
have any additional feedback or concerns with this patch? If not,
I would request we move to commit phase.
Thank you!


Regards,
Nishant Sharma (EDB).

>

Re: pure parsers and reentrant scanners


On 18.12.24 18:43, Tom Lane wrote:

Peter Eisentraut  writes:

I started committing the cube and seg pieces.  There were a couple of
complaints from the buildfarm, like
segscan.c:348:15: error: redefinition of typedef 'yyscan_t' is a C11
feature [-Werror,-Wtypedef-redefinition]
typedef void* yyscan_t;
...
(Also, we should probably figure out a way to get these warnings before
things hit the buildfarm.)


Interestingly, while sifaka shows that, its sibling indri doesn't.
Same compiler, same CFLAGS.  I think the relevant difference must
be that sifaka is using a much older Bison version (the Apple-supplied
2.3, versus MacPorts' up-to-the-minute version).  I think that sort of
thing is exactly why we have the buildfarm.  It would not be
reasonable to expect CI to cover that many cases.  Trying to do so
would just make CI slow enough that we'd start looking for a new test
phase to put in front of it.


The situation is that most current compilers default to some newer C 
standard version.  And so they won't complain about use of C11 features. 
 But the affected buildfarm members for whatever reason run with 
CC='clang -std=gnu99', and so they correctly reject C11 features.  We 
could do something similar in the Cirrus configuration.  I'll start a 
separate thread about that.

Re: Proposal to add a new URL data type.

2024-12-19 Thread Victor Yegorov

ср, 11 дек. 2024 г. в 19:04, Alexander Borisov :

> > I've created a commitfest entry for the patch: https://
> > commitfest.postgresql.org/51/5432/  > commitfest.postgresql.org/51/5432/>
> > I was not able to find you, please, register a community account and set
> > yourself as an author for the patch.
>
> Done.
>

I've marked this patch as Rejected, per discussion.

Still, I find this functionality nice to have, I'd be happy if you could
create an extension on github (or similar platform).

-- 
Victor Yegorov

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-12-19 Thread Melanie Plageman

On Thu, Dec 19, 2024 at 10:23 AM Melanie Plageman
 wrote:
>
> On Thu, Dec 19, 2024 at 10:12 AM Richard Guo  wrote:
> >
> > On Thu, Dec 19, 2024 at 6:15 PM Richard Guo  wrote:
> > > I think we need to check whether rs_tbmiterator is NULL before calling
> > > tbm_end_iterate on it, like below.
> > >
> > > --- a/src/backend/executor/nodeBitmapHeapscan.c
> > > +++ b/src/backend/executor/nodeBitmapHeapscan.c
> > > @@ -572,9 +572,11 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
> > > if (scan)
> > > {
> > > /*
> > > -* End iteration on iterators saved in scan descriptor.
> > > +* End iteration on iterators saved in scan descriptor, if they
> > > +* haven't already been cleaned up.
> > >  */
> > > -   tbm_end_iterate(&scan->st.rs_tbmiterator);
> > > +   if (!tbm_exhausted(&scan->st.rs_tbmiterator))
> > > +   tbm_end_iterate(&scan->st.rs_tbmiterator);
> > >
> > > /* rescan to release any page pin */
> > > table_rescan(node->ss.ss_currentScanDesc, NULL);
> >
> > This change may also be needed in ExecEndBitmapHeapScan().
>
> Thanks, Richard! I'm working on the fix and adding the test case you found.

Okay, pushed. Thanks so much, Richard. I also noticed another
oversight and fixed it.

- Melanie

Re: sslinfo extension - add notbefore and notafter timestamps

2024-12-19 Thread Cary Huang

 > > The recent bump in minmum required versions of OpenSSL and LibreSSL made me
 > > remember to revisit this patch which was previously reverted due to library
 > > incompatibility (with *both* OpenSSL and LibreSSL on different APIs).
 > > 
 > > The attached removes the timestamp conversion workaround which is no longer
 > > needed.
 > 
 > The patch was marked as ready for committer and is currently failing
 > in the CI.  I've moved it to next CF waiting on author.  Could you
 > provide a rebase?

Since the minimum OpenSSL version is now 1.1.1, the v13 patch would fail the CI 
because
it uses the old APIs to obtain notBefore and notAfter timestamps:

- X509_get_notBefore
- X509_get_notAfter

which where deprecated in OpenSSL 1.1.0...
Instead, it should use:

- X509_get0_notBefore
- X509_get0_notAfter

which are available in version 1.1.1 and beyond. These APIs now return a "const 
ASN1_TIME*"
instead of "ASN1_TIME*".

The changes below should remove the CI failing when applied to v13 patch:

---   a/contrib/sslinfo/sslinfo.c
+++ b/contrib/sslinfo/sslinfo.c

-static Datum ASN1_TIME_to_timestamptz(ASN1_TIME *time);
+static Datum ASN1_TIME_to_timestamptz(const ASN1_TIME *time);

-ASN1_TIME_to_timestamptz(ASN1_TIME *ASN1_cert_ts)
+ASN1_TIME_to_timestamptz(const ASN1_TIME *ASN1_cert_ts)

-   return ASN1_TIME_to_timestamptz(X509_get_notBefore(cert));
+  return ASN1_TIME_to_timestamptz(X509_get0_notBefore(cert));

-   return ASN1_TIME_to_timestamptz(X509_get_notAfter(cert));
+  return ASN1_TIME_to_timestamptz(X509_get0_notAfter(cert));


---   a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c

-static TimestampTz ASN1_TIME_to_timestamptz(ASN1_TIME *time);
+static TimestampTz ASN1_TIME_to_timestamptz(const ASN1_TIME *time);

-ASN1_TIME_to_timestamptz(ASN1_TIME *ASN1_cert_ts)
+ASN1_TIME_to_timestamptz(const ASN1_TIME *ASN1_cert_ts)

-*ptr = ASN1_TIME_to_timestamptz(X509_get_notBefore(port->peer));
+*ptr = ASN1_TIME_to_timestamptz(X509_get0_notBefore(port->peer));

-  *ptr = ASN1_TIME_to_timestamptz(X509_get_notAfter(port->peer));
+ *ptr = ASN1_TIME_to_timestamptz(X509_get0_notAfter(port->peer));

can you make a rebase with the above changes?


Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca

Re: Fix crash when non-creator being an iteration on shared radix tree

On Wed, Dec 18, 2024 at 10:32 PM John Naylor  wrote:
>
> On Thu, Dec 19, 2024 at 1:00 AM Masahiko Sawada  wrote:
> >
> > On Tue, Dec 17, 2024 at 11:12 PM John Naylor  
> > wrote:
> > > +1 in general, but I wonder if instead the iter_context should be
> > > created within RT_BEGIN_ITERATE -- I imagine that would have less
> > > duplication and would be as safe, but I haven't tried it. Is there
> > > some reason not  to do that?
> >
> > I agree that it has less duplication. There is no strong reason I
> > didn't do that. I just didn't want to check 'if (!tree->iter_context)'
> > in RT_BEGIN_ITERATE for simplicity. I've changed the patch
> > accordingly.
>
> I see what you mean. For v17, a bit of duplication is probably worth
> it for simplicity, so I'd say v1 is fine there.
>
> However, I think on master we should reconsider some aspects of memory
> management more broadly:
>
> 1. The creator allocates the root of the tree in a new child context,
> but an attaching process allocates it in its current context, and we
> pfree it when the caller wants to detach. It seems like we could
> always allocate this small struct in CurrentMemoryContext for
> consistency.
>
> 2. The iter_context is separate because the creator's new context
> could be a bump context which doesn't support pfree. But above we
> assume we can pfree in the caller's context. Also, IIUC we only
> allocate small iter objects, and it'd be unusual to need more than one
> at a time per backend, so it's a bit strange to have an entire context
> for that. Since we use a standard pattern of "begin; while(iter);
> end;", it seems unlikely that someone will cause a leak because of a
> coding mistake in iteration.
>
> If these tiny admin structs were always, not sometimes, in the callers
> current context, I think it would be easier to reason about because
> then the creator's passed context would be used only for local memory,
> specifically only for leaves and the inner node child contexts.
> Thoughts?

Fair points. Given that we need only one iterator at a time per
backend, it would be simpler if the caller passes the pointer to an
iterator that is a stack variable to RT_BEGIN_ITEREATE(). For example,
TidStoreBeginIterate() would be like:

if (TidStoreIsShared(ts))
shared_ts_begin_iterate(ts->tree.shared, &iter->tree_iter.shared);
else
   local_ts_begin_iterate(ts->tree.local, &iter->tree_iter.shared);

>
> Further,
>
> 3. I was never a fan of trying to second-guess the creator's new
> context and instead use slab for fixed-sized leaf allocations. If the
> creator passes a bump context, we say "no, no, no, use slab -- it's
> good for ya". Let's assume the caller knows what they're doing.

That's a valid argument but how can a user use the slab context for
leaf allocations? If the caller passes an allocset context to
RT_CREATE(), it still makes sense to usa slab context for leaf
allocation in terms of avoiding possible space wasting.

> 4. For local memory, an allocated "control object" serves no real
> purpose and wastes a few cycles on every access. I'm not sure it
> matters that much as a future micro-optimization, but I mention it
> here because if we did start allocating outer structs in the callers
> context, embedding would also remove the need to pfree it.

Using an allocated "control object" can simplify the codes for local
and shared trees. We cannot embed the control object into
RT_RADIX_TREE in shared cases. I agree to embed the control data if we
can implement that cleanly.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: JIT compilation per plan node

2024-12-19 Thread David Rowley

Thanks for taking a look and thinking about this problem.

On Fri, 20 Dec 2024 at 03:49, Matheus Alcantara
 wrote:
> static void
> plan_consider_jit(Plan *plan)
> {
> plan->jit = false;
>
> if (jit_enabled)
> {
> Costtotal_cost;
>
> total_cost = plan->total_cost * plan->plan_rows;
>
> if (total_cost > jit_above_cost)
> plan->jit = true;
> }
> }

Unfortunately, it's not that easy. "plan_rows" isn't the estimated
number of times the node will be rescanned, it's the number of rows we
expect it to return. Multiplying that by the plan->total_cost is a
nonsensical value.

I made an attempt in [1] to get the nloop estimate down to where it's
needed so we could multiply the total_cost by that. IIRC, a few people
weren't happy about the churn that caused. I didn't come up with an
alternative.

Generally, I think that going with the number of expected evaluations
of the expression is a more flexible option. The total_cost *
est_nloops of a plan node really isn't a great indication of how much
effort will be spent evaluating the expression. The join filter on a
non-parameterised nested loop is likely the upper extreme end, and
maybe something like a join filter on a parameterized nested loop or a
Bitmap Heap Scan recheck condition is maybe the lower extreme where
the expr eval cost just isn't that well correlated to the total node
cost.

I'll restate we don't need perfect right away, but maybe if we come up
with infrastructure that can be incrementally improved, that would be
the best direction to move towards. My best guess at how that might
work was in the email you replied to. However, I admit to not having
spent much time relooking at it or even thinking about it in the past
few years.

David

[1] 
https://postgr.es/m/CAApHDvoq5VhV%3D2euyjgBN2bC8Bds9Dtr0bG7R%3DreeefJWKJRXA%40mail.gmail.com

Re: Converting SetOp to read its two inputs separately

David Rowley  writes:
> One minor detail... I think the only thing I'd like to see is the
> moving of the enable_hashagg checks to increment the disabled_nodes
> count in create_setop_path() instead of where it's being called.  I
> understand there's only 1 caller of that function that passes
> SETOP_HASHED, but it does seem nicer to put that logic where it
> belongs. With how you have it now, if we were ever to grow any more
> places that built SETOP_HASHED SetOpPaths, they'd also need to adjust
> disabled_nodes manually and that seems easy to forget.  Also, looking
> around for references to "disabled_nodes", it looks like all other
> places where we fiddle with the value of disabled_nodes are in
> costsize.c.

Looks like costsize.c and pathnode.c to me, but either way I take your
point.  I'd not realized that Robert set it up that way, but now I see
he did.  I agree that moving that bit of logic into
create_setop_path() seems better.  I'll make it so and push.

regards, tom lane

Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).

On Thu, Dec 19, 2024 at 02:28:21PM +0800, jian he wrote:
> Since materialized views have physical storage,
> 
> we can make materialized views also using COPY table_name, instead of
> COPY(query).
> 
> Some simple tests show around %3.7 or 4.3% speed up.

This restriction comes from 3bf3ab8c5636 as such relations may not be
scannable when they have no data, no?  Perhaps this restriction could
be lifted, but I'd suggest to dig more into the lists, there should be
arguments and ideas explaining what could be done in this case
(spoiler: I did not look at that).
--
Michael


signature.asc
Description: PGP signature

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

On Thu, Dec 19, 2024 at 11:07:25AM -0500, Andres Freund wrote:
> On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:
>> LGTM
> 
> Dito.

Thanks for double-checking.  Done this one.
--
Michael


signature.asc
Description: PGP signature

Re: Converting SetOp to read its two inputs separately

David Rowley  writes:
> On Thu, 19 Dec 2024 at 15:44, Tom Lane  wrote:
>> The common_result_slot_type() function I wrote here perhaps
>> should be made generally available, but I didn't do that yet.

> I think it would be good to make this generic as it can be at least
> used in nodeRecursiveunion.c and nodeAppend.c.

OK, done, and I added an 0006 patch that uses that infrastructure
in the obvious places.

I also addressed your remarks about comments.  Otherwise I'm feeling
like this is about ready to push.

regards, tom lane

From 3e5c08bdadb64d87945b7328d3ba7e28b7590883 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Thu, 19 Dec 2024 12:25:31 -0500
Subject: [PATCH v5 1/6] Convert SetOp to read its inputs as outerPlan and
 innerPlan.

The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from.  Then the SetOp node
pries them apart again based on the flag.  This is bizarre.  The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two.  But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input.  On top of that, adding the flag column frequently
requires an additional projection step that adds cycles, and then
the Append node isn't free either.  Let's get rid of all of that
and make the SetOp node have two separate children, using the
existing outerPlan/innerPlan infrastructure.

This initial patch re-implements nodeSetop.c and does a bare minimum
of work on the planner side to generate correctly-shaped plans.
In particular, I've tried not to change the cost estimates here,
so that the visible changes in the regression test results will only
involve removal of useless projection steps and not any changes in
whether to use sorted vs hashed mode.

For SORTED mode, we combine successive identical tuples from each
input into groups, and then merge-join the groups.  The tuple
comparisons now use SortSupport instead of simple equality, but
the group-formation part should involve roughly the same number of
tuple comparisons as before.  The cross-comparisons between left and
right groups probably add to that, but I'm not sure to quantify how
many more comparisons we might need.

For HASHED mode, nodeSetop's logic is almost the same as before,
just refactored into two separate loops instead of one loop that
has an assumption that it will see all the left-hand inputs first.

In both modes, I added early-exit logic to not bother reading the
right-hand relation if the left-hand input is empty, since neither
INTERSECT nor EXCEPT modes can produce any output if the left input
is empty.  This could have been done before in the hashed mode, but
not in sorted mode.  Sorted mode can also stop as soon as it exhausts
the left input; any remaining right-hand tuples cannot have matches.

Also, this patch adds some infrastructure for detecting whether
child plan nodes all output the same type of tuple table slot.
If they do, the hash table logic can use slightly more efficient
code based on assuming that that's the input slot type it will see.
We'll make use of that infrastructure in other plan node types later.

Discussion: https://postgr.es/m/1850138.1731549...@sss.pgh.pa.us
---
 src/backend/executor/execUtils.c|  43 ++
 src/backend/executor/nodeSetOp.c| 537 ++--
 src/backend/optimizer/README|   2 +-
 src/backend/optimizer/plan/createplan.c |  81 ++--
 src/backend/optimizer/prep/prepunion.c  | 156 ---
 src/backend/optimizer/util/pathnode.c   |  50 ++-
 src/include/executor/executor.h |   3 +
 src/include/nodes/execnodes.h   |  31 +-
 src/include/nodes/pathnodes.h   |   9 +-
 src/include/nodes/plannodes.h   |  19 +-
 src/include/optimizer/pathnode.h|   7 +-
 src/test/regress/expected/subselect.out |  20 +-
 src/test/regress/expected/union.out | 254 +--
 src/test/regress/sql/union.sql  |   6 +
 src/tools/pgindent/typedefs.list|   2 +
 15 files changed, 696 insertions(+), 524 deletions(-)

diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 740e8fb148..df0223129c 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -526,6 +526,49 @@ ExecGetResultSlotOps(PlanState *planstate, bool *isfixed)
 	return planstate->ps_ResultTupleSlot->tts_ops;
 }

+/*
+ * ExecGetCommonSlotOps - identify common result slot type, if any
+ *
+ * If all the given PlanState nodes return the same fixed tuple slot type,
+ * return the slot ops struct for that slot type.  Else, return NULL.
+ */
+const TupleTableSlotOps *
+ExecGetCommonSlotOps(PlanState **planstates, int nplans)
+{
+	const TupleTableSlotOps *result;
+	bool		isfixed;
+
+	if (nplans <= 0)
+		return NULL;
+	result = ExecGe

Re: Converting SetOp to read its two inputs separately

2024-12-19 Thread David Rowley

On Fri, 20 Dec 2024 at 08:38, Tom Lane  wrote:
>
> David Rowley  writes:
> > On Thu, 19 Dec 2024 at 15:44, Tom Lane  wrote:
> >> The common_result_slot_type() function I wrote here perhaps
> >> should be made generally available, but I didn't do that yet.
>
> > I think it would be good to make this generic as it can be at least
> > used in nodeRecursiveunion.c and nodeAppend.c.
>
> OK, done, and I added an 0006 patch that uses that infrastructure
> in the obvious places.

That looks good. Thanks for adjusting the other node types too.

> I also addressed your remarks about comments.  Otherwise I'm feeling
> like this is about ready to push.

I think so too.

One minor detail... I think the only thing I'd like to see is the
moving of the enable_hashagg checks to increment the disabled_nodes
count in create_setop_path() instead of where it's being called.  I
understand there's only 1 caller of that function that passes
SETOP_HASHED, but it does seem nicer to put that logic where it
belongs. With how you have it now, if we were ever to grow any more
places that built SETOP_HASHED SetOpPaths, they'd also need to adjust
disabled_nodes manually and that seems easy to forget.  Also, looking
around for references to "disabled_nodes", it looks like all other
places where we fiddle with the value of disabled_nodes are in
costsize.c. I understand we do check enable_hashagg in other places,
but those all seem to be so we avoid generating some Path rather than
to determine the disabled_node value.

David

Re: Converting SetOp to read its two inputs separately

Pushed ... and now I have one more beef about the way things are
in this area.  I don't think we should leave the compatibility
function BuildTupleHashTable() in place in HEAD.  Making it a
wrapper around a new function BuildTupleHashTableExt() was a fine
solution for preserving ABI in released branches, but that doesn't
mean we should clutter the code with unused ABI hacks forevermore.
Attached is a patch to take it out and then rename
BuildTupleHashTableExt() back to BuildTupleHashTable().

Since BuildTupleHashTableExt has already grown more arguments
in HEAD than it had in v17, renaming it doesn't increase the number of
places that will have to be touched in any extensions that were using
this infrastructure.  Removal of the compatibility wrapper could force
some code updates, but really we want those places to update anyway.

I also made an effort at fixing the woefully out of date
header comment for it.

regards, tom lane

From 322de6ae7d91c28e97f31ba1aed878181bcdc5aa Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Thu, 19 Dec 2024 15:25:57 -0500
Subject: [PATCH v1] Get rid of old version of BuildTupleHashTable().

It was reasonable to preserve the old API of BuildTupleHashTable()
in the back branches, but in HEAD we should actively discourage use
of that version.  There are no remaining callers in core, so just
get rid of it.  Then rename BuildTupleHashTableExt() back to
BuildTupleHashTable().

While at it, fix up the miserably-poorly-maintained header comment
for BuildTupleHashTable[Ext].  It looks like more than one patch in
this area has had the opinion that updating comments is beneath them.
---
 src/backend/executor/execGrouping.c   | 93 +--
 src/backend/executor/nodeAgg.c| 28 +++
 src/backend/executor/nodeRecursiveunion.c | 30 
 src/backend/executor/nodeSetOp.c  | 30 
 src/backend/executor/nodeSubplan.c| 58 +++---
 src/include/executor/executor.h   | 22 ++
 6 files changed, 115 insertions(+), 146 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 7491e53c03..0aa9e92ad9 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -135,36 +135,43 @@ execTuplesHashPrepare(int numCols,
 /*
  * Construct an empty TupleHashTable
  *
- *  inputOps: slot ops for input hash values, or NULL if unknown or not fixed
- *	numCols, keyColIdx: identify the tuple fields to use as lookup key
- *	eqfunctions: equality comparison functions to use
- *	hashfunctions: datatype-specific hashing functions to use
+ *	parent: PlanState node that will own this hash table
+ *	inputDesc: tuple descriptor for input tuples
+ *	inputOps: slot ops for input tuples, or NULL if unknown or not fixed
+ *	numCols: number of columns to be compared (length of next 4 arrays)
+ *	keyColIdx: indexes of tuple columns to compare
+ *	eqfuncoids: OIDs of equality comparison functions to use
+ *	hashfunctions: FmgrInfos of datatype-specific hashing functions to use
+ *	collations: collations to use in comparisons
  *	nbuckets: initial estimate of hashtable size
  *	additionalsize: size of data stored in ->additional
  *	metacxt: memory context for long-lived allocation, but not per-entry data
  *	tablecxt: memory context in which to store table entries
  *	tempcxt: short-lived context for evaluation hash and comparison functions
+ *	use_variable_hash_iv: if true, adjust hash IV per-parallel-worker
  *
- * The function arrays may be made with execTuplesHashPrepare().  Note they
+ * The hashfunctions array may be made with execTuplesHashPrepare().  Note they
  * are not cross-type functions, but expect to see the table datatype(s)
  * on both sides.
  *
- * Note that keyColIdx, eqfunctions, and hashfunctions must be allocated in
- * storage that will live as long as the hashtable does.
+ * Note that the keyColIdx, hashfunctions, and collations arrays must be
+ * allocated in storage that will live as long as the hashtable does.
  */
 TupleHashTable
-BuildTupleHashTableExt(PlanState *parent,
-	   TupleDesc inputDesc,
-	   const TupleTableSlotOps *inputOps,
-	   int numCols, AttrNumber *keyColIdx,
-	   const Oid *eqfuncoids,
-	   FmgrInfo *hashfunctions,
-	   Oid *collations,
-	   long nbuckets, Size additionalsize,
-	   MemoryContext metacxt,
-	   MemoryContext tablecxt,
-	   MemoryContext tempcxt,
-	   bool use_variable_hash_iv)
+BuildTupleHashTable(PlanState *parent,
+	TupleDesc inputDesc,
+	const TupleTableSlotOps *inputOps,
+	int numCols,
+	AttrNumber *keyColIdx,
+	const Oid *eqfuncoids,
+	FmgrInfo *hashfunctions,
+	Oid *collations,
+	long nbuckets,
+	Size additionalsize,
+	MemoryContext metacxt,
+	MemoryContext tablecxt,
+	MemoryContext tempcxt,
+	bool use_variable_hash_iv)
 {
 	TupleHashTable hashtable;
 	Size		entrysize = sizeof(TupleHashEntryD

Re: AIO v2.0

2024-12-19 Thread Andres Freund

Hi,

Sorry for loosing track of your message for this long, I saw it just now
because I was working on posting a new version.

On 2024-11-18 13:19:58 +0100, Jakub Wartak wrote:
>  On Fri, Sep 6, 2024 at 9:38 PM Andres Freund  wrote:
> Thank You for worth admiring persistence on this. Please do not take it as
> criticism, just more like set of questions regarding the patchset v2.1 that
> I finally got little time to play with:
>
> 0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in
> pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ?

Yea, it shouldn't ever happen, but it's worth adding a check.

> Otherwise we'll never know that SQ is full in theory, perhaps at least such
> a check should be made with Assert() ? (I understand right now that we
> allow just up to io_uring_queue_init(io_max_concurrency), but what happens
> if:
> a. previous io_uring_submit() failed for some reason and we do not have
> free space for SQ?

We'd have PANICed at that failure :)

> b. (hypothetical) someday someone will try to make PG multithreaded and the
> code starts using just one big queue, still without checking for
> io_uring_get_sqe()?

That'd not make sense - you'd still want to use separate rings, to avoid
contention.

> 1. In [0] you wrote that there's this high amount of FDs consumed for
> io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are
> many customers who are using extremely high max_connections (4k-5k, but
> there outliers with 10k in the wild too) - so they won't even start - and I
> have one doubt on the user-friendliness impact of this. I'm quite certain
> it's going to be the same as with pgbouncer where one is forced to tweak
> OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries
> to preallocate and then close() a lot of FDs, so that's safer in runtime.
> IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the
> max_files_per_process looses it's spirit a little bit and PG is going to
> start loose efficiency too due to frequent open()/close() calls as fd cache
> is too small. Tomas also complained about it some time ago in [1])

My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.

In most distros the soft ulimit is set to something like 1024, but the hard
limit is much higher.  The reason for that is that some applications try to
close all fds between 0 and RLIMIT_NOFILE - which takes a long time if
RLIMIT_NOFILE is high. By setting only the soft limit to a low value any
application needing higher limits can just opt into using more FDs.

On several of my machines the hard limit is 1073741816.

> So maybe it would be good to introduce couple of sanity checks too (even
> after setting higher limit):
> - issue FATAL in case of using io_method = io_ring && max_connections would
> be close to getrusage(RLIMIT_NOFILE)
> - issue warning in case of using io_method = io_ring && we wouldnt have
> even real 1k FDs free for handling relation FDs (detect something bad like:
> getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)

Probably still worth adding something like this, even if we were to do what I
am suggesting above.

> 2. In pgaio_uring_postmaster_child_init_local() there
> "io_uring_queue_init(32,...)" - why 32? :) And also there's separate
> io_uring_queue_init(io_max_concurrency) which seems to be derived from
> AioChooseMaxConccurrency() which can go up to 64?

Yea, that's probably not right.

> 3. I find having two GUCs named literally the same
> (effective_io_concurrency, io_max_concurrency). It is clear from IO_URING
> perspective what is io_max_concurrency all about, but I bet having also
> effective_io_concurrency in the mix is going to be a little confusing for
> users (well, it is to me). Maybe that README.md could elaborate a little
> bit on the relation between those two? Or maybe do you plan to remove
> io_max_concurrency and bind it to effective_io_concurrency in future?

io_max_concurrency is a hard maximum that needs to be set at server start,
because it requires allocating shared memory. Whereas effective_io_concurrency
can be changed on a per-session and per-tablespace
basis. I.e. io_max_concurrency is a hard upper limit for an entire backend,
whereas effective_io_concurrency controls how much one scan (or whatever does
prefetching) can issue.

>  To add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while
> the earlier mentioned AioChooseMaxConccurrency() goes up to just 64

Yea, that should probably be disambiguated.

> 4. While we are at this, shouldn't the patch rename debug_io_direct to
> simply io_direct so that GUCs are consistent in terms of naming?

I used to have a patch like that in the series and it was a pain to
rebase...

I also suspect sure this is quite enough to make debug_io_direct quite
production ready, even if just considering io_direct=data. Without streaming
read use

Re: AIO v2.0

Andres Freund  writes:
> My current thoughts around this are that we should generally, independent of
> io_uring, increase the FD limit ourselves.

I'm seriously down on that, because it amounts to an assumption that
we own the machine and can appropriate all its resources.  If ENFILE
weren't a thing, it'd be all right, but that is a thing.  We have no
business trying to consume resources the DBA didn't tell us we could.

regards, tom lane

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-12-19 Thread Melanie Plageman

On Wed, Dec 18, 2024 at 9:50 PM Richard Guo  wrote:
>
> On Thu, Dec 19, 2024 at 8:18 AM Melanie Plageman
>  wrote:
> > I pushed the straightforward option for now so that it's fixed.
>
> I think this binary search code now has a risk of underflow.  If 'mid'
> is calculated as zero, the second 'if' branch will cause 'end' to
> underflow.

Thanks, Richard!

> Maybe we need to do something like below.
>
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2600,7 +2600,11 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer 
> buffer,
> if (tupoffset == curoffset)
> return true;
> else if (tupoffset < curoffset)
> +   {
> +   if (mid == 0)
> +   return false;
> end = mid - 1;
> +   }
> else
> start = mid + 1;
> }
>
> Alternatively, we can revert 'start' and 'end' to signed int as they
> were before.

What about this instead:

diff --git a/src/backend/access/heap/heapam_handler.c
b/src/backend/access/heap/heapam_handler.c
index 2da4e4da13e..fb90fd869c2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2574,11 +2574,8 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,

if (scan->rs_flags & SO_ALLOW_PAGEMODE)
{
-   uint32  start,
-   end;
-
-   if (hscan->rs_ntuples == 0)
-   return false;
+   uint32  start = 0,
+   end = hscan->rs_ntuples;

/*
 * In pageatatime mode, heap_prepare_pagescan() already did visibility
@@ -2589,10 +2586,8 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 * in increasing order, but it's not clear that there would be enough
 * gain to justify the restriction.
 */
-   start = 0;
-   end = hscan->rs_ntuples - 1;

-   while (start <= end)
+   while (start < end)
{
uint32  mid = (start + end) / 2;
OffsetNumber curoffset = hscan->rs_vistuples[mid];
@@ -2600,7 +2595,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
if (tupoffset == curoffset)
return true;
else if (tupoffset < curoffset)
-   end = mid - 1;
+   end = mid;
else
start = mid + 1;
}

Or to make it easier to read, here:

uint32start = 0,
  end = hscan->rs_ntuples;

while (start < end)
{
uint32mid = (start + end) / 2;
OffsetNumber curoffset = hscan->rs_vistuples[mid];

if (tupoffset == curoffset)
return true;
else if (tupoffset < curoffset)
end = mid;
else
start = mid + 1;
}

I think this gets rid of any subtraction and is still the same.

- Melanie

Re: Converting SetOp to read its two inputs separately

2024-12-19 Thread David Rowley

On Fri, 20 Dec 2024 at 11:23, Tom Lane  wrote:
> Pushed ... and now I have one more beef about the way things are
> in this area.  I don't think we should leave the compatibility
> function BuildTupleHashTable() in place in HEAD.  Making it a
> wrapper around a new function BuildTupleHashTableExt() was a fine
> solution for preserving ABI in released branches, but that doesn't
> mean we should clutter the code with unused ABI hacks forevermore.
> Attached is a patch to take it out and then rename
> BuildTupleHashTableExt() back to BuildTupleHashTable().

No complaints here. Thanks for cleaning that up.

I couldn't help but also notice the nbuckets parameter is using the
"long" datatype. The code in BuildTupleHashTable seems to think it's
fine to pass the long as the uint32 parameter to tuplehash_create().
I just thought if we're in the area of adjusting this API function's
signature then it might be worth fixing the "long" issue at the same
time, or at least in the same release.

I'm also quite keen to see less use of long as it's not a very
consistently sized datatype on all platforms which can lead to
platform dependent bugs.

David

Allow ILIKE forward matching to use btree index

2024-12-19 Thread Yugo Nagata

Hi,

Currently, btree indexes cannot used for ILIKE (~~*) operator if the pattern
has case-varying characters although LIKE (~~) expression can be converted
to indexable clauses by the planner support function (if the collation
is "C" or operator class 'text_pattern_ops' is used).

For example, "t ~~ '123foo%'" is converted to "(t >= '123foo' AND t < 
'123fop')" 
and index scan can be used for this condition. On the other hand, "t ~~* 
'123foo'"
cannot be converted and sequential scan is used. 

Even in this case, we can use a bitmap index scan for the condition 
"(t >= '123f' AND t < '123g') OR "(t >= '123F' AND t < '123G')" followed by
recheck by the original condition "t ~~* '123foo'", and this could be faster
than seqscan. 

However, even if the support function would return OR clauses, the current
planner implementation cannot not build bitmap scan paths using them. 


The attached patches allow to ILIKE (~~*) forward matching to use btree index. 

The patch 0001 enables get_index_clause_from_support to receive OR clauses
from support functions and use them to build bitmap index scan later. OR clauses
returned by support functions are collected in the new argument 'orclauses" of
match_restriction_clauses_to_index(), and they are passed to
generate_bitmap_or_paths() later to build bitmap scan paths.

The patch 0002 modifies the support function to return OR clauses as an example
above when ILIKE's pattern has case-varying characters in forward matching. The
OR clauses contains two index conditions for the upper and the lower letter of
the first case-varying character in the pattern respectively. Although the
subsequent characters are not considered in the index scans, it could enough be
faster then sequent scan.

Here is an example. 

1. Create a table with random text records

=# CREATE TABLE tbl (t text);
=# INSERT INTO tbl SELECT CASE WHEN i%2=1 THEN upper(x) ELSE x END
  FROM (SELECT i, md5(i::text) x FROM 
generate_series(1,500) i);

2. Create an index
=# CREATE INDEX ON tbl (t text_pattern_ops);

3. Before applying patch: Seq Scan was selected. It takes about 4 sec.

=# EXPLIN ANALYZE SELECT * FROM tbl WHERE t ILIKE '12ab%';

   QUERY PLAN   
 
-
 Gather  (cost=1000.00..58712.97 rows=500 width=33) (actual 
time=101.096..4110.979 rows=64 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   Buffers: shared hit=10778 read=31260
   ->  Parallel Seq Scan on tbl  (cost=0.00..57662.97 rows=125 width=33) 
(actual time=440.232..4101.023 rows=13 loops=5)
 Filter: (t ~~* '12ab%'::text)
 Rows Removed by Filter: 87
 Buffers: shared hit=10778 read=31260
 Planning Time: 0.415 ms
 Execution Time: 4111.051 ms
(10 rows)

4. After applying patch: Bitmap Index Scan was selected. It takes only 10 ms.

=# EXPLIN ANALYZE SELECT * FROM tbl WHERE t ILIKE '12ab%';
   
QUERY PLAN  
   
  
--
--
 Bitmap Heap Scan on tbl  (cost=9.38..13.40 rows=500 width=33) (actual 
time=3.720..9.660 rows=64 loops=1)
   Recheck Cond: (((t ~>=~ '12A'::text) AND (t ~<~ '12B'::text) AND (t ~~* 
'12ab%'::text)) OR ((t ~>=~ '12a'::text) AND (t ~<~ '12b'::text) AND (t ~~* 
'12ab%'::text))
)
   Filter: (t ~~* '12ab%'::text)
   Rows Removed by Filter: 1205
   Heap Blocks: exact=1254
   Buffers: shared hit=1271
   ->  BitmapOr  (cost=9.38..9.38 rows=1 width=0) (actual time=2.370..2.372 
rows=0 loops=1)
 Buffers: shared hit=17
 ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..4.57 rows=1 width=0) 
(actual time=1.192..1.193 rows=604 loops=1)
   Index Cond: ((t ~>=~ '12A'::text) AND (t ~<~ '12B'::text))
   Buffers: shared hit=8
 ->  Bitmap Index Scan on tbl_t_idx  (cost=0.00..4.57 rows=1 width=0) 
(actual time=1.174..1.174 rows=675 loops=1)
   Index Cond: ((t ~>=~ '12a'::text) AND (t ~<~ '12b'::text))
   Buffers: shared hit=9
 Planning Time: 1.144 ms
 Execution Time: 9.785 ms
(16 rows)

What do you think about it?

I think another application of OR-clause returning support function might be
allowing to use an index for NOT LIKE (!~~) expression  because, for example, 
"t !~~ '123foo%'" could be converted to "(t < '123foo' OR t >= '123fooz')". 
(The second condition is a bit loose but this would be safe and not problem
since tuples are filtered by the original condition after the index scan.) 
However, it would not very useful unless the distribution is much skew so that
NOT LIKE expression's selectivity is

Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Dec 19, 2024 at 2:56 AM Amit Kapila  wrote:
>
> On Thu, Dec 19, 2024 at 7:14 AM Masahiko Sawada  wrote:
> >
> > On Mon, Dec 9, 2024 at 9:09 PM Amit Kapila  wrote:
> > >
> > > On Tue, Nov 26, 2024 at 3:03 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > I've attached a new version patch that incorporates all comments I got 
> > > > so far.
> > > >
> > >
> > > Review comments:
> >
> > Thank you for reviewing the patch!
> >
> > > ===
> > > 1.
> > > + * The given transaction is marked as streamed if appropriate and the 
> > > caller
> > > + * requested it by passing 'mark_txn_streaming' as true.
> > > + *
> > >   * 'txn_prepared' indicates that we have decoded the transaction at 
> > > prepare
> > >   * time.
> > >   */
> > >  static void
> > > -ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > bool txn_prepared)
> > > +ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
> > > bool txn_prepared,
> > > + bool mark_txn_streaming)
> > >  {
> > > ...
> > >   }
> > > + else if (mark_txn_streaming && (rbtxn_is_toptxn(txn) ||
> > > (txn->nentries_mem != 0)))
> > > + {
> > > + /*
> > > + * Mark the transaction as streamed, if appropriate.
> > >
> > > The comments related to the above changes don't clarify in which cases
> > > the 'mark_txn_streaming' should be set. Before this patch, it was
> > > clear from the comments and code about the cases where we would decide
> > > to mark it as streamed.
> >
> > I think we can rename it to txn_streaming for consistency with
> > txn_prepared. I've changed the comment for that.
> >
>
> @@ -2067,7 +2143,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
> ReorderBufferTXN *txn,
> ReorderBufferChange *specinsert)
>  {
>   /* Discard the changes that we just streamed */
> - ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
> + ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
>
> @@ -1924,7 +2000,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb,
> ReorderBufferTXN *txn)
>   * full cleanup will happen as part of the COMMIT PREPAREDs, so now
>   * just truncate txn by removing changes and tuplecids.
>   */
> - ReorderBufferTruncateTXN(rb, txn, true);
> + ReorderBufferTruncateTXN(rb, txn, true, true);
>
> In both the above places, the patch unconditionally passes the
> 'txn_streaming' even for prepared transactions when it wouldn't be a
> streaming xact. Inside the function, the patch handled that by first
> checking whether the transaction is prepared (txn_prepared). So, the
> logic will work but the function signature and the way its callers are
> using make it difficult to use and extend in the future.
>

Valid concern.

> I think for the first case, we should get the streaming parameter in
> ReorderBufferResetTXN(),

I think we cannot pass 'rbtxn_is_streamed(txn)' to
ReorderBufferTruncateTXN() in the first case. ReorderBufferResetTXN()
is called to handle the concurrent abort of the streaming transaction
but the transaction might not have been marked as streamed at that
time. Since ReorderBufferTruncateTXN() is responsible for both
discarding changes and marking the transaction as streamed, we need to
unconditionally pass txn_streaming = true in this case.

> and for the second case
> ReorderBufferStreamCommit(), we should pass it as false because by
> that time transaction is already streamed and prepared.  We are
> invoking it for cleanup.

Agreed.

> Even when we call ReorderBufferTruncateTXN()
> from ReorderBufferCheckAndTruncateAbortedTXN(), it will be better to
> write a comment at the caller about why we are passing this parameter
> as false.

Agreed.

On second thoughts, I think the confusion related to txn_streaming
came from the fact that ReorderBufferTruncateTXN() does both
discarding changes and marking the transaction as streamed. If we make
the function do just discarding changes, we don't need to introduce
the txn_streaming function argument. Instead, we need to have a
separate function to mark the transaction as streamed and call it
before ReorderBufferTruncateTXN() where appropriate. And
ReorderBufferCheckAndTruncateAbortedTXN() just calls
ReorderBufferTruncateTXN().

>
> >
> > >
> > > 4.
> > > + /*
> > > + * Remember if the transaction is already aborted so we can detect when
> > > + * the transaction is concurrently aborted during the replay.
> > > + */
> > > + already_aborted = rbtxn_is_aborted(txn);
> > > +
> > >   ReorderBufferReplay(txn, rb, xid, txn->final_lsn, txn->end_lsn,
> > >   txn->xact_time.prepare_time, txn->origin_id, txn->origin_lsn);
> > >
> > > @@ -2832,10 +2918,10 @@ ReorderBufferPrepare(ReorderBuffer *rb,
> > > TransactionId xid,
> > >   * when rollback prepared is decoded and sent, the downstream should be
> > >   * able to rollback such a xact. See comments atop DecodePrepare.
> > >   *
> > > - * Note, for the concurrent_abort + streaming case a stream_prepare was
> > > + * Note, for the concurrent abort + streaming case a stream_prepare was
> > >   * already sent wi

Re: Parallel heap vacuum

On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra  wrote:
>
> On 12/13/24 00:04, Tomas Vondra wrote:
> > ...
> >
> > The main difference is here:
> >
> >
> > master / no parallel workers:
> >
> >   pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
> >
> > 1 parallel worker:
> >
> >   pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
> >
> >
> > Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
> > essentially just those with deleted tuples, which is ~1/20 of pages.
> > That's close to the 15x speedup.
> >
> > This effect is clearest without indexes, but it does affect even runs
> > with indexes - having to scan the indexes makes it much less pronounced,
> > though. However, these indexes are pretty massive (about the same size
> > as the table) - multiple times larger than the table. Chances are it'd
> > be clearer on realistic data sets.
> >
> > So the question is - is this correct? And if yes, why doesn't the
> > regular (serial) vacuum do that?
> >
> > There's some more strange things, though. For example, how come the avg
> > read rate is 0.000 MB/s?
> >
> >   avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
> >
> > It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
> > not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
> > pages are in shared buffers (thanks to the DELETE earlier in that session).
> >
>
> OK, after looking into this a bit more I think the reason is rather
> simple - SKIP_PAGES_THRESHOLD.
>
> With serial runs, we end up scanning all pages, because even with an
> update every 5000 tuples, that's still only ~25 pages apart, well within
> the 32-page window. So we end up skipping no pages, scan and vacuum all
> everything.
>
> But parallel runs have this skipping logic disabled, or rather the logic
> that switches to sequential scans if the gap is less than 32 pages.
>
>
> IMHO this raises two questions:
>
> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
> sequential scans is the pages are close enough. Maybe there is a reason
> for this difference? Workers can reduce the difference between random
> and sequential I/0, similarly to prefetching. But that just means the
> workers should use a lower threshold, e.g. as
>
> SKIP_PAGES_THRESHOLD / nworkers
>
> or something like that? I don't see this discussed in this thread.

Each parallel heap scan worker allocates a chunk of blocks which is
8192 blocks at maximum, so we would need to use the
SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
need to evaluate the differences anyway. WIll do the benchmark test
and share the results.

>
> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
> storage. If I can get an order of magnitude improvement (or more than
> that) by disabling the threshold, and just doing random I/O, maybe
> there's time to adjust it a bit.

Yeah, you've started a thread for this so let's discuss it there.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection for update_deleted in logical replication

2024-12-19 Thread Nisha Moond

Here is further performance test analysis with v16 patch-set.

In the test scenarios already shared on -hackers [1], where pgbench was run
only on the publisher node in a pub-sub setup, no performance degradation
was observed on either node.

In contrast, when pgbench was run only on the subscriber side with
detect_update_deleted=on [2], the TPS performance was reduced due to dead
tuple accumulation. This performance drop depended on the
wal_receiver_status_interval—larger intervals resulted in more dead tuple
accumulation on the subscriber node. However, after the improvement in
patch v16-0002, which dynamically tunes the status request, the default TPS
reduction was limited to only 1%.

We performed more benchmarks with the v16-patches where pgbench was run on
both the publisher and subscriber, focusing on TPS performance. To
summarize the key observations:

- No performance impact on the publisher as dead tuple accumulation does
not occur on the publisher.

- The performance is reduced on the subscriber side (TPS reduction (~50%)
[3] ) due to dead tuple retention for the conflict detection when
detect_update_deleted=on.

- Performance reduction happens only on the subscriber side, as workload
on the publisher is pretty high and the apply workers must wait for the
amount of transactions with earlier timestamps to be applied and flushed
before advancing the non-removable XID to remove dead tuples.

- To validate this further, we modified the patch to check only each
transaction's commit_time and advance the non-removable XID if the
commit_time is greater than candidate_xid_time. The benchmark results[4]
remained consistent, showing similar performance reduction. This confirms
that the performance impact on the subscriber side is a reasonable behavior
if we want to detect the update_deleted conflict reliably.

We have also tested similar scenarios in physical streaming replication, to
see the effect of enabling the hot_standby_feedback and
recovery_min_apply_delay. The benchmark results[5] showed performance
reduction in these cases as well, though impact was less compared to the
update_deleted scenario because the physical walreceiver does not need to
wait for specified WAL to be applied before sending the hot standby
feedback message. However, as the recovery_min_apply_delay increased, a
similar TPS reduction (~50%) was observed, aligning with the behavior seen
in the update_deleted case.

Based on the above, I think the performance reduction observed with the
update_deleted patch is expected and necessary because the patch's main
goal is to retain dead tuples for reliable conflict detection. Reducing
this retention period would compromise the accuracy of update_deleted
detection.

The detailed benchmark results are as follow:

[1]
https://www.postgresql.org/message-id/CABdArM5SpMyGvQTsX0-d%3Db%2BJAh0VQjuoyf9jFqcrQ3JLws5eOw%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/TYAPR01MB5692B0182356F041DC9DE3B5F53E2%40TYAPR01MB5692.jpnprd01.prod.outlook.com

[3] Test with pgbench run on both publisher and subscriber.

Test setup:

- Tests performed on pgHead + v16 patches

- Created a pub-sub replication system.

- Parameters for both instances were:

share_buffers = 30GB

min_wal_size = 10GB

max_wal_size = 20GB

autovacuum = false

track_commit_timestamp = on (only for subscriber)

-- Note: to avoid the update/delete_missing conflicts, tables to be
modified on the publisher side was renamed: pgbench_XXX -> pgbench_pub_XXX.

Test Run:

- Ran pgbench(read-write) on both the publisher and the subscriber with 30
clients for a duration of 120 seconds, collecting data over 5 runs.

- Note that pgbench was running for different tables on pub and sub.

(The scripts used for test "case1-2_measure.sh" and case1-2_setup.sh" are
attached).

Results:

Run# pub TPS sub TPS

1 32209 13704

2 32378 13684

3 32720 13680

4 31483 13681

5 31773 13813

median 32209 13684

regression 7% -53%

Perf analysis: shows time spent on heap table scans- ```

- 68.22% 0.09% postgres postgres [.] ExecModifyTable

...

- 62.86% index_getnext_slot

- 61.49% index_fetch_heap

- 61.40% heapam_index_fetch_tuple

+ 41.43% heap_hot_search_buffer

+ 13.11% ReleaseAndReadBuffer

6.15% heap_page_prune_opt

+ 1.33% index_getnext_tid

+ 4.35% ExecUpdate

```

[4] Test with modifying the patch as mentioned above. (attached top-up
patch " 0001-wait-for-commit-time")

Test setup:

- Setup is the same as [1], only with the code modified as "
0001-wait-for-commit-time" patch.

Test Run:

- Ran pgbench(read-write) on both the publisher and the subs

Re: Skip collecting decoded changes of already-aborted transactions

On Thu, Dec 19, 2024 at 9:36 PM Amit Kapila  wrote:
>
> On Fri, Dec 20, 2024 at 12:42 AM Masahiko Sawada  
> wrote:
> >
> > On Thu, Dec 19, 2024 at 2:56 AM Amit Kapila  wrote:
> > >
> > >
> > > @@ -2067,7 +2143,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb,
> > > ReorderBufferTXN *txn,
> > > ReorderBufferChange *specinsert)
> > >  {
> > >   /* Discard the changes that we just streamed */
> > > - ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn));
> > > + ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), true);
> > >
> > > @@ -1924,7 +2000,7 @@ ReorderBufferStreamCommit(ReorderBuffer *rb,
> > > ReorderBufferTXN *txn)
> > >   * full cleanup will happen as part of the COMMIT PREPAREDs, so now
> > >   * just truncate txn by removing changes and tuplecids.
> > >   */
> > > - ReorderBufferTruncateTXN(rb, txn, true);
> > > + ReorderBufferTruncateTXN(rb, txn, true, true);
> > >
> > > In both the above places, the patch unconditionally passes the
> > > 'txn_streaming' even for prepared transactions when it wouldn't be a
> > > streaming xact. Inside the function, the patch handled that by first
> > > checking whether the transaction is prepared (txn_prepared). So, the
> > > logic will work but the function signature and the way its callers are
> > > using make it difficult to use and extend in the future.
> > >
> >
> > Valid concern.
> >
> > > I think for the first case, we should get the streaming parameter in
> > > ReorderBufferResetTXN(),
> >
> > I think we cannot pass 'rbtxn_is_streamed(txn)' to
> > ReorderBufferTruncateTXN() in the first case. ReorderBufferResetTXN()
> > is called to handle the concurrent abort of the streaming transaction
> > but the transaction might not have been marked as streamed at that
> > time. Since ReorderBufferTruncateTXN() is responsible for both
> > discarding changes and marking the transaction as streamed, we need to
> > unconditionally pass txn_streaming = true in this case.
> >
>
> Can't we use 'stream_started' variable available at the call site of
> ReorderBufferResetTXN() for our purpose?

Right, we can use it.

>
> >
> > On second thoughts, I think the confusion related to txn_streaming
> > came from the fact that ReorderBufferTruncateTXN() does both
> > discarding changes and marking the transaction as streamed. If we make
> > the function do just discarding changes, we don't need to introduce
> > the txn_streaming function argument. Instead, we need to have a
> > separate function to mark the transaction as streamed and call it
> > before ReorderBufferTruncateTXN() where appropriate. And
> > ReorderBufferCheckAndTruncateAbortedTXN() just calls
> > ReorderBufferTruncateTXN().
> >
>
> That sounds good to me. IIRC, initially, ReorderBufferTruncateTXN()
> was used to truncate changes only for streaming transactions. Later,
> it evolved for prepared facts and now for facts where we explicitly
> detect whether they are aborted. So, I think it makes sense to improve
> it by following your suggestion.

I've changed the patch accordingly.

>
> >
> > >
> > > *
> > > + * Since we don't check the transaction status while replaying the
> > > + * transaction, we don't need to reset toast reconstruction data here.
> > > + */
> > > + ReorderBufferTruncateTXN(rb, txn, rbtxn_prepared(txn), false);
> > > +
> > > + /* All changes should be discarded */
> > > + Assert(txn->size == 0);
> > >
> > > Can we expect the size to be zero without resetting the toast data? In
> > > ReorderBufferToastReset(), we call ReorderBufferReturnChange() which
> > > reduces the change size. So, won't that size still be accounted for in
> > > txn?
> >
> > IIUC the toast reconstruction data is created only while replaying the
> > transaction but the ReorderBufferCheckAndTruncateAbortedTXN() is not
> > called during that. So I think any toast data should not be
> > accumulated at that time.
> >
>
> How about the case where in the first pass, we streamed the
> transaction partially, where it has reconstructed toast data, and
> then, in the second pass, when memory becomes full, the reorder buffer
> contains some partial data, due to which it tries to spill the data
> and finds that the transaction is aborted? I could be wrong here
> because I haven't tried to test this code path, but I see that it is
> theoretically possible.

Yeah, it seems possible. I've changed the patch to reset toast data as well.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v12-0001-Skip-logical-decoding-of-already-aborted-transac.patch
Description: Binary data

Re: Converting SetOp to read its two inputs separately

David Rowley  writes:
> On Fri, 20 Dec 2024 at 11:23, Tom Lane  wrote:
>> Attached is a patch to take it out and then rename
>> BuildTupleHashTableExt() back to BuildTupleHashTable().

> No complaints here. Thanks for cleaning that up.

Thanks, will push.

> I couldn't help but also notice the nbuckets parameter is using the
> "long" datatype. The code in BuildTupleHashTable seems to think it's
> fine to pass the long as the uint32 parameter to tuplehash_create().
> I just thought if we're in the area of adjusting this API function's
> signature then it might be worth fixing the "long" issue at the same
> time, or at least in the same release.

I'm in favor of doing that, but it seems like a separate patch,
because we'd have to chase things back a fairly long way.
For instance, the numGroups fields in Agg, RecursiveUnion, and
SetOp are all "long" at the moment, and some of the callers
are getting their arguments via clamp_cardinality_to_long()
which'd need adjustment, etc etc.

> I'm also quite keen to see less use of long as it's not a very
> consistently sized datatype on all platforms which can lead to
> platform dependent bugs.

Yeah.  Switching all these places to int64 likely would be
worthwhile cleanup (but I'm not volunteering).  Also, if
tuplehash_create expects a uint32, that isn't consistent
either.

regards, tom lane

Re: AIO v2.0

2024-12-19 Thread Andres Freund

Hi,

On 2024-12-19 17:34:29 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > My current thoughts around this are that we should generally, independent of
> > io_uring, increase the FD limit ourselves.
>
> I'm seriously down on that, because it amounts to an assumption that
> we own the machine and can appropriate all its resources.  If ENFILE
> weren't a thing, it'd be all right, but that is a thing.  We have no
> business trying to consume resources the DBA didn't tell us we could.

Arguably the configuration *did* tell us, by having a higher hard limit...

I'm not saying that we should increase the limit without a bound or without a
configuration option, btw.

As I had mentioned, the problem with relying on increasing the soft limit that
is that it's not generally sensible to do so, because it causes a bunch of
binaries to do be weirdly slow.

Another reason to not increase the soft rlimit is that doing so can break
programs relying on select().

But opting into a higher rlimit, while obviously adhering to the hard limit
and perhaps some other config knob, seems fine?

Greetings,

Andres Freund

Re: [PoC] Let libpq reject unexpected authentication requests

2024-12-19 Thread Jacob Champion

Michael Paquier  wrote:
> Please note that the CF entry has been marked as committed.  We should
> really do something about having a cleaner separation between SASL,
> the mechanisms and the AUTH_REQ_* codes, in the long term, though
> honestly I don't know yet what would be the most elegant and the least
> error-prone approach.  And for anything that touches authentication,
> simpler means better.

I've taken another shot at this over on the OAuth thread [1], for
those who are still interested; see v40-0002. It's more code than my
previous attempt, but I think it does a clearer job of separating the
two concerns.

Thanks,
--Jacob

[1] 
https://postgr.es/m/CAOYmi+=FzVg+C-pQHCwjW0qU-POHmzZaD2z3CdsACj==14h...@mail.gmail.com

Re: [PoC] Let libpq reject unexpected authentication requests

On Thu, Dec 19, 2024 at 05:02:19PM -0800, Jacob Champion wrote:
> I've taken another shot at this over on the OAuth thread [1], for
> those who are still interested; see v40-0002. It's more code than my
> previous attempt, but I think it does a clearer job of separating the
> two concerns.
> 
> [1] 
> https://postgr.es/m/CAOYmi+=FzVg+C-pQHCwjW0qU-POHmzZaD2z3CdsACj==14h...@mail.gmail.com

Ah, thanks for the poke on this one.  I'm wondering if v40-0002 and
v40-0001 should be in reverse order, but that's a discussion to keep
on the other thread, of course.
--
Michael


signature.asc
Description: PGP signature

Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).

2024-12-19 Thread jian he

On Fri, Dec 20, 2024 at 8:02 AM Michael Paquier  wrote:
>
> On Thu, Dec 19, 2024 at 02:28:21PM +0800, jian he wrote:
> > Since materialized views have physical storage,
> >
> > we can make materialized views also using COPY table_name, instead of
> > COPY(query).
> >
> > Some simple tests show around %3.7 or 4.3% speed up.
>
> This restriction comes from 3bf3ab8c5636 as such relations may not be
> scannable when they have no data, no?  Perhaps this restriction could
> be lifted, but I'd suggest to dig more into the lists, there should be
> arguments and ideas explaining what could be done in this case
> (spoiler: I did not look at that).
> --


Thanks for the suggestion.
it was mentioned in link [1] and [2].

[1] 
https://www.postgresql.org/message-id/flat/8967.1353167301%40sss.pgh.pa.us#f5e947cfa9357dba780d238f1c5f6932
[2] https://www.postgresql.org/message-id/20121116162558.90150%40gmx.com


Basically we want to have the two directions of COPY.
so
copy the_materialized_view to stdout;
copy the_materialized_view  from stdin;

both will work fine.
obviously "copy the_materialized_view  from stdin; " will not work.

Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

2024-12-19 Thread Robert Pang

Dear Michael,

Thank you for applying this back-patch. I also appreciate everyone's input
on this issue.

Sincerely,
Robert Pang

On Thu, Dec 19, 2024 at 4:13 PM Michael Paquier  wrote:

> On Thu, Dec 19, 2024 at 11:07:25AM -0500, Andres Freund wrote:
> > On 2024-12-19 09:31:14 -0600, Nathan Bossart wrote:
> >> LGTM
> >
> > Dito.
>
> Thanks for double-checking.  Done this one.
> --
> Michael
>

Re: Memory leak in WAL sender with pgoutput (v10~)

On Thu, Dec 19, 2024 at 09:27:04AM -0800, Masahiko Sawada wrote:
> On Wed, Dec 18, 2024 at 6:26 PM Michael Paquier  wrote:
>> v2 does not have these weaknesses by design.
> 
> I agree that v2 is better than v3 in terms of that.

Okay.  In terms of the backbranches, would you prefer that I handle
this patch myself as I have done the HEAD part?  This would need a
second, closer, review but I could do that at the beginning of next
week.

Or perhaps you'd prefer doing it yourself?  That's up to you, happy to
help as always :D
--
Michael

signature.asc
Description: PGP signature

Re: Logical Replication of sequences

2024-12-19 Thread Peter Smith

Hi Vignesh.

Here are some review comments for the patch v20241211-0005.

==
doc/src/sgml/logical-replication.sgml

Section "29.6.1. Sequence Definition Mismatches"

1.
+   
+
+ If there are differences in sequence definitions between the publisher and
+ subscriber, a WARNING is logged.
+
+   

Maybe this should say *when* this happens.

SUGGESTION
During sequence synchronization, the sequence definitions of the
publisher and the subscriber are compared. A WARNING is logged if any
differences are detected.

~~~

Section "29.6.3. Examples"

2.
Should the Examples section also have an example of ALTER SUBSCRIPTION
... REFRESH PUBLICATION to demonstrate (like in the TAP tests) that if
the sequences are already known, then those are not synchronised?

~~~

Section "29.8. Restrictions"

3.
+ Incremental sequence changes are not replicated.  The data in serial or
+ identity columns backed by sequences will of course be replicated as part
+ of the table, but the sequence itself would still show the start value on
+ the subscriber.  If the subscriber is used as a read-only database, then
+ this should typically not be a problem.  If, however, some kind of
+ switchover or failover to the subscriber database is intended, then the
+ sequences would need to be updated to the latest values, either
by executing
+ 
+ ALTER SUBSCRIPTION ... REFRESH PUBLICATION
SEQUENCES
+ or by copying the current data from the publisher (perhaps using
+ pg_dump) or by determining a sufficiently high value
+ from the tables themselves.

I don't know if you need to mention it, or maybe it is too obvious,
but the suggestion here to use "ALTER SUBSCRIPTION ... REFRESH
PUBLICATION SEQUENCES" assumed you've already arranged for the
PUBLICATION to be publishing sequences before this.

==
doc/src/sgml/ref/alter_subscription.sgml

4.
  
-  Specifies whether to copy pre-existing data in the publications
-  that are being subscribed to when the replication starts.
-  The default is true.
+  Specifies whether to copy pre-existing data for tables and
synchronize
+  sequences in the publications that are being subscribed to
when the replication
+  starts. The default is true.
  

This is talking also about "synchronize sequences" when the
replication starts, but it is a bit confusing. IIUC, the .. REFRESH
PUBLICATION only synchronizes *newly added* sequences anyway, so does
it mean even that will not happen if copy_data=false?

I think this option needs more clarification on how it interacts with
sequences. Also, I don't recall seeing any test for sequences and
copy_data in the patch 0004 TAP tests, so maybe something needs to be
added there too.

~~~

5.
+ 
+  See  for recommendations on how
+  to identify sequences and handle out-of-sync sequences.
+ 

/on how to identify sequences and handle out-of-sync sequences./on how
to identify and handle out-of-sync sequences./

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Can rs_cindex be < 0 for bitmap heap scans?

2024-12-19 Thread Richard Guo

On Fri, Dec 20, 2024 at 7:50 AM Melanie Plageman
 wrote:
> On Wed, Dec 18, 2024 at 9:50 PM Richard Guo  wrote:
> > I think this binary search code now has a risk of underflow.  If 'mid'
> > is calculated as zero, the second 'if' branch will cause 'end' to
> > underflow.

> What about this instead:
>
> diff --git a/src/backend/access/heap/heapam_handler.c
> b/src/backend/access/heap/heapam_handler.c
> index 2da4e4da13e..fb90fd869c2 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2574,11 +2574,8 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer 
> buffer,
>
> if (scan->rs_flags & SO_ALLOW_PAGEMODE)
> {
> -   uint32  start,
> -   end;
> -
> -   if (hscan->rs_ntuples == 0)
> -   return false;
> +   uint32  start = 0,
> +   end = hscan->rs_ntuples;
>
> /*
>  * In pageatatime mode, heap_prepare_pagescan() already did visibility
> @@ -2589,10 +2586,8 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer 
> buffer,
>  * in increasing order, but it's not clear that there would be enough
>  * gain to justify the restriction.
>  */
> -   start = 0;
> -   end = hscan->rs_ntuples - 1;
>
> -   while (start <= end)
> +   while (start < end)
> {
> uint32  mid = (start + end) / 2;
> OffsetNumber curoffset = hscan->rs_vistuples[mid];
> @@ -2600,7 +2595,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer 
> buffer,
> if (tupoffset == curoffset)
> return true;
> else if (tupoffset < curoffset)
> -   end = mid - 1;
> +   end = mid;
> else
> start = mid + 1;
> }
>
> Or to make it easier to read, here:
>
> uint32start = 0,
>   end = hscan->rs_ntuples;
>
> while (start < end)
> {
> uint32mid = (start + end) / 2;
> OffsetNumber curoffset = hscan->rs_vistuples[mid];
>
> if (tupoffset == curoffset)
> return true;
> else if (tupoffset < curoffset)
> end = mid;
> else
> start = mid + 1;
> }
>
> I think this gets rid of any subtraction and is still the same.

Looks correct to me.

BTW, I kind of doubt that the overflow risk fixed in 28328ec87 is a
real issue in real-world scenarios.  Is it realistically possible to
have more than INT_MAX tuples fit on one heap page?

If it is, then the statement below could also be prone to overflow.

uint32  mid = (start + end) / 2;

Maybe you want to change it to:

uint32  mid = start + (end - start) / 2;

Thanks
Richard

Re: per backend I/O statistics

2024-12-19 Thread Alexander Lakhin


Hello Michael,

19.12.2024 06:21, Michael Paquier wrote:

Fixed that, bumped the two version counters, and done.


Could you, please, look at recent failures produced by grassquit (which
has fsync = on in it's config), on an added test case? For instance, [1]:
--- /home/bf/bf-build/grassquit/HEAD/pgsql/src/test/regress/expected/stats.out 
2024-12-19 04:44:08.779311933 +
+++ /home/bf/bf-build/grassquit/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2024-12-19 
16:37:41.351784840 +

@@ -1333,7 +1333,7 @@
   AND :my_io_sum_shared_after_fsyncs= 0);
  ?column?
 --
- t
+ f
 (1 row)

The complete query is:
SELECT current_setting('fsync') = 'off'
  OR (:my_io_sum_shared_after_fsyncs = :my_io_sum_shared_before_fsyncs
  AND :my_io_sum_shared_after_fsyncs= 0);

And the corresponding query in 027_stream_regress_primary.log is:
2024-12-19 16:37:39.907 UTC [4027467][client backend][15/1980:0] LOG:  
statement: SELECT current_setting('fsync') = 'off'
  OR (1 = 1
  AND 1= 0);

(I can reproduce this locally with an asan-enabled build too.)

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-12-19%2016%3A28%3A58

Best regards,
Alexander

Re: Added schema level support for publication.

2024-12-19 Thread Amit Kapila

On Thu, Dec 19, 2024 at 12:02 AM vignesh C  wrote:
>
> On Wed, 18 Dec 2024 at 16:34, Artur Zakirov  wrote:
> >
> > On Tue, 17 Dec 2024 at 10:43, vignesh C  wrote:
> > > > If I understand your suggestion correctly I think this will break the
> > > > "--exclude-schema" option of pg_dump. That change will dump all
> > > > mappings between publications and schemas for publications which are
> > > > dumped.
> > > >
> > > > That solves the issue with special schemas, but restore will fail if
> > > > some schemas were explicitly excluded. pg_dump will include in the
> > > > dump ALTER PUBLICATION  ADD TABLES IN SCHEMA  even for
> > > > those schemas which are not created during restore.
> > >
> > > This is already the case in the existing implementation, so users
> > > should not be surprised by the proposed change.
> >
> > Currently the behavior isn't the same as the proposed change.
> >
> > Sorry, I might have been not clear when I described what might be
> > wrong with this. Here is the example with the proposed patch [1].
> >
> > Create necessary objects to test:
> >
> > create schema nsp;
> > create publication pub for tables in schema nsp;
> >
> > If you run pg_dump excluding the schema "nsp":
> >
> > pg_dump -d postgres -U postgres -f backup --exclude-schema=nsp
> >
> > In the resulting file "backup" you will have:
> >
> > ...
> > ALTER PUBLICATION pub ADD TABLES IN SCHEMA nsp;
> > ...
> >
> > which you won't have on the current master. And I think this is not
> > what users might expect and it can break some of the scenarios because
> > during restore they will have an error:
> >
> > ERROR:  schema "nsp" does not exist
>
> Yes, this is done intentionally in the proposed patch to keep it
> consistent with other scenarios in HEAD.
> For example, consider the following case:
> -- Create schema and user defined function in schema sch2
> create schema sch2;
> CREATE FUNCTION sch2.add1(integer, integer)
> RETURNS integer
> LANGUAGE sql IMMUTABLE STRICT
> AS $_$select $1 + $2;$_$;
>
> -- Create a view which references user defined function of a different schema
> create schema sch1;
> CREATE TABLE sch1.t1 (c1 integer, c2 integer);
> CREATE VIEW sch1.v1 AS SELECT c1 FROM sch1.t1 WHERE (sch2.add1(c1, c2) >= 10);
>
> -- Exclude schema sch2 which has the user defined function while dumping
> ./pg_dump -d postgres -Fc -f dump1 -N sch2
>
> The dump file has the reference to sch2.add1 even though sch2 schema
> was excluded, dump will not have the user defined functions defined in
> schema sch2:
> CREATE VIEW sch1.v1 AS
>  SELECT c1
>FROM sch1.t1
>   WHERE (sch2.add1(c1, c2) >= 10);
>
> Restore using the above dump that was generated will fail with the below 
> error:
> ./pg_restore -d test1 dump1
> pg_restore: error: could not execute query: ERROR:  schema "sch2" does not 
> exist
> LINE 4:   WHERE (sch2.add1(c1, c2) >= 10);
>  ^
> Command was: CREATE VIEW sch1.v1 AS
>  SELECT c1
>FROM sch1.t1
>   WHERE (sch2.add1(c1, c2) >= 10);
>
> The proposed patch is in similar lines.
>

I agree with the proposed patch and this behavior. It is on the lines
of what Tom proposed as one of the ways to address the issue raised in
his initial email. Also, it follows the existing behavior in cases
like view<->function dependency as shown by you in this email, and to
some extent subscription<->publication dependency as shown in email
[1].

Let's wait till the beginning of the next CF to see if there are other
suggestions or any arguments against this proposed change.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BZYanA51c9NzKM31AqJSw-j0-edGz91%2BVh-nsoKdzKfQ%40mail.gmail.com
-- 
With Regards,
Amit Kapila.

Re: Statistics Import and Export

2024-12-19 Thread Corey Huinker

>
> The biggest functional change is the way dependencies are handled for
> matview stats. Materialized views ordinarily end up in
> SECITON_PRE_DATA, but in some cases they can be postponed to
> SECTION_POST_DATA. You solved that by always putting the matview stats
> in SECTION_POST_DATA.
>

Accurate.



> I took a different approach here and, when the matview is postponed,
> also postpone the matview stats. It's slightly more code, but it felt
> closer to the rest of the structure, where postponing is a special case
> (that we might be able to remove in the future).


+1. The fact that this quirk was a knock-on effect of the postponing-quirk,
which could go away, makes this change compelling.

Re: pure parsers and reentrant scanners

I noticed that lapwing is bleating about

ccache gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Werror=vla -Wendif-labels 
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -g -O2 -fPIC -fvisibility=hidden -I. -I. 
-I../../src/include  -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS -D_GNU_SOURCE 
-I/usr/include/libxml2  -I/usr/include/et  -c -o cubescan.o cubescan.c
cubescan.c:1689:5: warning: no previous prototype for 'cube_yyget_column' 
[-Wmissing-prototypes]
cubescan.c:1765:6: warning: no previous prototype for 'cube_yyset_column' 
[-Wmissing-prototypes]

and likewise in segscan.c.  lapwing is using flex 2.5.35, so probably
this is the same bug worked around in parser/scan.l:

/*
 * Work around a bug in flex 2.5.35: it emits a couple of functions that
 * it forgets to emit declarations for.  Since we use -Wmissing-prototypes,
 * this would cause warnings.  Providing our own declarations should be
 * harmless even when the bug gets fixed.
 */
extern int  core_yyget_column(yyscan_t yyscanner);
extern void core_yyset_column(int column_no, yyscan_t yyscanner);

regards, tom lane

Re: Logical Replication of sequences

2024-12-19 Thread Peter Smith

Hi Vignesh,

Here are some review comments for the patch v20241211-0004.

==
GENERAL

1.
There are more than a dozen places where the relation (relkind) is
checked to see if it is a SEQUENCE:

e.g. + get_rel_relkind(subrel->srrelid) != RELKIND_SEQUENCE &&
e.g. + if (get_rel_relkind(subrel->srrelid) != RELKIND_SEQUENCE)
e.g. + if (relkind == RELKIND_SEQUENCE && !get_sequences)
e.g. + if (relkind != RELKIND_SEQUENCE && !get_tables)
e.g. + relkind == RELKIND_SEQUENCE ? "sequence" : "table",
e.g. + if (relkind != RELKIND_SEQUENCE)
e.g. + relkind == RELKIND_SEQUENCE ? "sequence" : "table",
e.g. + if (get_rel_relkind(sub_remove_rels[off].relid) == RELKIND_SEQUENCE)
e.g. + if (get_rel_relkind(relid) != RELKIND_SEQUENCE)
e.g. + relkind != RELKIND_SEQUENCE)
e.g. + Assert(get_rel_relkind(rstate->relid) == RELKIND_SEQUENCE);
e.g. + Assert(relkind == RELKIND_SEQUENCE);
e.g. + if (get_rel_relkind(rstate->relid) == RELKIND_SEQUENCE)
e.g. + Assert(get_rel_relkind(rstate->relid) != RELKIND_SEQUENCE);

I am wondering if the code might be improved slightly by adding one new macro:

#define RELKIND_IS_SEQUENCE(relkind) ((relkind) == RELKIND_SEQUENCE)

==
Commit message

2.
1) CREATE SUBSCRIPTION
- (PG17 command syntax is unchanged)
- The subscriber retrieves sequences associated with publications.
- Publisher sequences are added to pg_subscription_rel with INIT state.
- Initiates the sequencesync worker (see above) to synchronize all
  sequences.

~

Shouldn't this say "Published sequences" instead of "Publisher sequences"?

I guess if the patch currently supports only ALL SEQUENCES then maybe
it amounts to the same thing, but IMO "Published" seems more correct.

~~~

3.
2) ALTER SUBSCRIPTION ... REFRESH PUBLICATION
- (PG17 command syntax is unchanged)
- Dropped publisher sequences are removed from pg_subscription_rel.
- New publisher sequences are added to pg_subscription_rel with INIT state.
- Initiates the sequencesync worker (see above) to synchronize only
  newly added sequences.

3) ALTER SUBSCRIPTION ... REFRESH PUBLICATION SEQUENCES
- The patch introduces this new command to refresh all sequences
- Dropped publisher sequences are removed from pg_subscription_rel.
- New publisher sequences are added to pg_subscription_rel
- All sequences in pg_subscription_rel are reset to INIT state.
- Initiates the sequencesync worker (see above) to synchronize all
  sequences.

~

Ditto previous comment -- maybe those should say "Newly published
sequences" instead of "New publisher sequences"/

~~~

4.
Should there be some mention of the WARNING logged if sequence
parameter differences are detected?

==
src/backend/catalog/pg_subscription.c

GetSubscriptionRelations:

5.
+ * all_states:
+ * If getting tables, if all_states is true get all tables, otherwise
+ * only get tables that have not reached READY state.
+ * If getting sequences, if all_states is true get all sequences,
+ * otherwise only get sequences that are in INIT state.
+ *
+ * The returned list is palloc'ed in the current memory context.

and

- if (not_ready)
+ if (!all_states)
  ScanKeyInit(&skey[nkeys++],
  Anum_pg_subscription_rel_srsubstate,
  BTEqualStrategyNumber, F_CHARNE,
CharGetDatum(SUBREL_STATE_READY));
~

It was a bit confusing that the code for (!all_states) is excluding
everything that is not in SUBREL_STATE_READY, but OTOH the function
comment for sequences it said "that are in INIT state". Maybe that
function comment for sequence also should have said "that have not
reached READY state (i.e. are still in INIT state)" to better match
the code.

~~~

6.
+ /* Skip sequences if they were not requested */
+ if (relkind == RELKIND_SEQUENCE && !get_sequences)
+ continue;
+
+ /* Skip tables if they were not requested */
+ if (relkind != RELKIND_SEQUENCE && !get_tables)
+ continue;

Somehow, I feel this logic would seem simpler if expressed differently
to make it more explicit that the relation is either a table or a
sequence. e.g. by adding some variables.

SUGGESTION:

bool is_sequence;
bool is_table;

...

/* Relation is either a sequence or a table */
is_sequence = get_rel_relkind(subrel->srrelid) == RELKIND_SEQUENCE;
is_table = !is_sequence;

/* Skip sequences if they were not requested */
if (!get_sequences && is_sequence)
  continue;

/* Skip tables if they were not requested */
if (!get_tables && is_table)
  continue;

==
src/backend/commands/sequence.c

7.
 /*
- * Implement the 2 arg setval procedure.
- * See do_setval for discussion.
+ * Implement the 2 arg set sequence procedure.
+ * See SetSequence for discussion.
  */
 Datum
 setval_oid(PG_FUNCTION_ARGS)

~

Not sure this function comment is right. Shouldn't it still say
"Implement the 2 arg setval procedure."

~~~

8.
 /*
- * Implement the 3 arg setval procedure.
- * See do_setval for discussion.
+ * Implement the 3 arg set sequence procedure.
+ * See SetSequence for discussion.
  */
 Datum
 setval3

Re: [PATCH] Fixed creation of empty .log files during log rotation