date:20240307

On Thu, Mar 7, 2024 at 1:49 PM Masahiko Sawada  wrote:
> odr-violation seems to refer to One Definition Rule (ODR). According
> to Wikipedia[1]:
>
> The One Definition Rule (ODR) is an important rule of the C++
> programming language that prescribes that classes/structs and
> non-inline functions cannot have more than one definition in the
> entire program and template and types cannot have more than one
> definition by translation unit. It is defined in the ISO C++ Standard
> (ISO/IEC 14882) 2003, at section 3.2. Some other programming languages
> have similar but differently defined rules towards the same objective.
>
> I don't fully understand this concept yet but are these two different
> build failures related?

I thought it may have something to do with the prerequisite commit
that moved some symbols from bitmapset.c to .h:

/* Select appropriate bit-twiddling functions for bitmap word size */
#if BITS_PER_BITMAPWORD == 32
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
#define bmw_popcount(w) pg_popcount32(w)
#elif BITS_PER_BITMAPWORD == 64
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
#define bmw_popcount(w) pg_popcount64(w)
#else
#error "invalid BITS_PER_BITMAPWORD"
#endif

...but olingo's error seems strange to me, because it is complaining
of pg_leftmost_one_pos, which refers to the lookup table in
pg_bitutils.c -- I thought all buildfarm members used the bitscan
instructions.

grassquit is complaining of pg_popcount64, which is a global function,
also in pg_bitutils.c. Not sure what to make of this, since we're just
pointing symbols at things which should have a single definition...

Re: Stack overflow issue

2024-03-07 Thread Alexander Korotkov

Hi, Egor!

On Thu, Mar 7, 2024 at 9:53 AM Egor Chindyaskin  wrote:
>
> > 6 march 2024 г., at 19:17, Alexander Korotkov  wrote:
> >
> > The revised set of remaining patches is attached.
> >
> > 0001 Turn tail recursion into iteration in CommitTransactionCommand()
> > I did minor revision of comments and code blocks order to improve the
> > readability.
> >
> > 0002 Avoid stack overflow in ShowTransactionStateRec()
> > I didn't notice any issues, leave this piece as is.
> >
> > 0003 Avoid recursion in MemoryContext functions
> > I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(),
> > which I think is a bit more intuitive.  Also I fixed
> > MemoryContextMemConsumed(), which was still trying to use the removed
> > argument "print" of MemoryContextStatsInternal() function.
> >
> > Generally, I think this patchset fixes important stack overflow holes.
> > It is quite straightforward, clear and the code has a good shape.  I'm
> > going to push this if no objections.
>
> I have tested the scripts from message [1]. After applying these patches and 
> Tom Lane’s patch from message [2], all of the above mentioned functions no 
> longer caused the server to crash. I also tried increasing the values in the 
> presented scripts, which also did not lead to server crashes. Thank you!
> Also, I would like to clarify something. Will fixes from message [3] and 
> others be backported to all other branches, not just the master branch? As 
> far as I remember, Tom Lane made corrections to all branches. For example [4].
>
> Links:
> 1. 
> https://www.postgresql.org/message-id/343ff14f-3060-4f88-9cc6-efdb390185df%40mail.ru
> 2. https://www.postgresql.org/message-id/386032.1709765547%40sss.pgh.pa.us
> 3. 
> https://www.postgresql.org/message-id/CAPpHfduZqAjF%2B7rDRP-RGNHjOXy7nvFROQ0MGS436f8FPY5DpQ%40mail.gmail.com
> 4. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e07ebd4b

Thank you for your feedback!

Initially I didn't intend to backpatch any of these.  But on second
thought with the references you provided, I think we should backpatch
simple check_stack_depth() checks from d57b7cc333 to all supported
branches, but apply refactoring of memory contextes and transaction
commit/abort just to master.  Opinions?

--
Regards,
Alexander Korotkov

Re: speed up a logical replica setup

2024-03-07 Thread vignesh C

On Thu, 7 Mar 2024 at 10:05, Euler Taveira  wrote:
>
> On Wed, Mar 6, 2024, at 7:02 AM, Hayato Kuroda (Fujitsu) wrote:
>
> Thanks for updating the patch!
>
>
> Thanks for the feedback. I'm attaching v26 that addresses most of your 
> comments
> and some issues pointed by Vignesh [1].

Few comments:
1) We are disconnecting database again in error case too, it will lead
to a double free in this scenario,
+   PQclear(res);
+
+   disconnect_database(conn, false);
+
+   if (max_repslots < num_dbs)
+   {
+   pg_log_error("subscriber requires %d replication
slots, but only %d remain",
+num_dbs, max_repslots);
+   pg_log_error_hint("Consider increasing
max_replication_slots to at least %d.",
+ num_dbs);
+   disconnect_database(conn, true);
+   }
+
+   if (max_lrworkers < num_dbs)
+   {
+   pg_log_error("subscriber requires %d logical
replication workers, but only %d remain",
+num_dbs, max_lrworkers);
+   pg_log_error_hint("Consider increasing
max_logical_replication_workers to at least %d.",
+ num_dbs);
+   disconnect_database(conn, true);
+   }

pg_createsubscriber: error: subscriber requires 5 logical replication
workers, but only 4 remain
pg_createsubscriber: hint: Consider increasing
max_logical_replication_workers to at least 5.
free(): double free detected in tcache 2
Aborted

2) We can also check that the primary is not using
synchronous_standby_names, else all the transactions in the primary
will wait infinitely once the standby server is stopped, this could be
added in the documentation:
+/*
+ * Is the primary server ready for logical replication?
+ */
+static void
+check_publisher(struct LogicalRepInfo *dbinfo)
+{
+   PGconn *conn;
+   PGresult   *res;
+
+   char   *wal_level;
+   int max_repslots;
+   int cur_repslots;
+   int max_walsenders;
+   int cur_walsenders;
+
+   pg_log_info("checking settings on publisher");
+
+   conn = connect_database(dbinfo[0].pubconninfo, true);
+
+   /*
+* If the primary server is in recovery (i.e. cascading replication),
+* objects (publication) cannot be created because it is read only.
+*/
+   if (server_is_in_recovery(conn))
+   {
+   pg_log_error("primary server cannot be in recovery");
+   disconnect_database(conn, true);
+   }

3) This check is present only for publication, we do not have this in
case of creating a subscription. We can keep both of them similar,
i.e. have the check in both or don't have the check for both
publication and subscription:
+   /* Check if the publication already exists */
+   appendPQExpBuffer(str,
+ "SELECT 1 FROM
pg_catalog.pg_publication "
+ "WHERE pubname = '%s'",
+ dbinfo->pubname);
+   res = PQexec(conn, str->data);
+   if (PQresultStatus(res) != PGRES_TUPLES_OK)
+   {
+   pg_log_error("could not obtain publication information: %s",
+PQresultErrorMessage(res));
+   disconnect_database(conn, true);
+   }
+

4) Few options are missing:
+ 
+  
+   pg_createsubscriber
+   option
+   
+
+ -d
+ --database
+
+dbname
+
+ -D 
+ --pgdata
+
+datadir
+
+ -P
+ --publisher-server
+
+connstr
+   
+  
+ 

4a) -n, --dry-run
4b) -p, --subscriber-port
4c) -r, --retain
4d) -s, --socket-directory
4e) -t, --recovery-timeout
4f) -U, --subscriber-username

5) typo connnections should be connections
+   
+The port number on which the target server is listening for
connections.
+Defaults to running the target server on port 50432 to avoid unintended
+client connnections.

6) repliation should be replication
+ * Create the subscriptions, adjust the initial location for logical
+ * replication and enable the subscriptions. That's the last step for logical
+ * repliation setup.

7) I did not notice these changes in the latest patch:
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d808aad8b0..08de2bf4e6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -517,6 +517,7 @@ CreateSeqStmt
 CreateStatsStmt
 CreateStmt
 CreateStmtContext
+CreateSubscriberOptions
 CreateSubscriptionStmt
 CreateTableAsStmt
 CreateTableSpaceStmt
@@ -1505,6 +1506,7 @@ LogicalRepBeginData
 LogicalRepCommitData
 LogicalRepCommitPreparedTxnData
 LogicalRepCtxStruct
+LogicalRepInfo
 LogicalRepMsgType
 LogicalRepPartMapEntry
 LogicalRepPreparedTxnData

Regards,
Vignes

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 1:19 PM John Naylor  wrote:
>
> In addition, olingo and grassquit are showing different kinds of
> "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> make of -- example:

This might be relevant:

$ git grep 'link_with: pgport_srv'
src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,

No other test module uses this directive, and indeed, removing this
still builds fine for me. Thoughts?

Re: 035_standby_logical_decoding unbounded hang

2024-03-07 Thread Andrey M. Borodin

> On 20 Feb 2024, at 04:09, Noah Misch  wrote:
> 

I’m not sure if it is connected, but so far many patches in CFbot keep hanging 
in this test. For example [0].

I haven’t done root cause analysis yet, but hangs may be related to this 
thread. Maybe someone more familiar with similar issues could take a look there?

Thanks!

Best regards, Andrey Borodin.

[0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 6:37 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 1:19 PM John Naylor  wrote:
> >
> > In addition, olingo and grassquit are showing different kinds of
> > "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> > make of -- example:
>
> This might be relevant:
>
> $ git grep 'link_with: pgport_srv'
> src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
>
> No other test module uses this directive, and indeed, removing this
> still builds fine for me. Thoughts?

Yeah, it could be the culprit. The test_radixtree/meson.build is the
sole extension that explicitly specifies a link with pgport_srv. I
think we can get rid of it as I've also confirmed the build still fine
even without it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: "type with xxxx does not exist" when doing ExecMemoize()

On Thu, 7 Mar 2024 at 15:24, Tender Wang  wrote:
>
> Andrei Lepikhov  于2024年3月6日周三 11:37写道：
>> I think, it is a bug. Should it be fixed (and back-patched) earlier?
>
> Agreed.   Need David to review it as he knows this area best.

This is on my list of things to do. Just not at the top yet.

David

Re: Function and Procedure with same signature?

2024-03-07 Thread Hannu Krosing

Hi Tom

On Sat, Feb 10, 2024 at 12:38 AM Tom Lane  wrote:
>
> "David G. Johnston"  writes:
> > On Fri, Feb 9, 2024, 12:05 Deepak M  wrote:
> >> Folks, When tried to create a function with the same signature as
> >> procedure it fails.
>
> > That seems like a good hint you cannot do it.  Specifically because they
> > get defined in the same internal catalog within which names must be unique.

The fact that we currently enforce it this way seems like an
implementation deficiency that should be fixed.

Bringing this up again, as there seems to be no good reason to have
this restriction as there is no place in the call syntax where it
would be unclear if a FUNCTION or a PROCEDURE is used.

And at least Oracle does allow this (and possibly others) so having
this unnecessary restriction in PostgreSQL makes migrations from
Oracle applications where this feature is used needlessly complicated.

> Worth noting perhaps that this is actually required by the SQL
> standard: per spec, functions and procedures are both "routines"
> and share the same namespace,

Can you point me to a place in the standard where it requires all
kinds of ROUTINES to be using the same namespace ?

All I could find was that ROUTINES are either FUNCTIONS, PROCEDURES or
METHODS and then samples of their usage which made clear that all
three are different and usage is disjoint at syntax level.

As for DROP ROUTINE we could just raise an error and recommend using
more specific DROP FUNCTION or DROP PROCEDURE if there is ambiguity.

--
Best Regards
Hannu

Re: a wrong index choose when statistics is out of date

On Wed, 6 Mar 2024 at 02:09, Andy Fan  wrote:
> This patch introduces a new attoptions like this:
>
> ALTER TABLE t ALTER COLUMN col set (force_generic=true);
>
> Then selfunc.c realizes this and ignore the special Const value, then
> average selectivity is chosen. This fall into the weakness of generic
> plan, but this patch doesn't introduce any new weakness and we leave the
> decision to user which could resolve some problem. Also this logic only
> apply to eqsel since the ineqsel have the get_actual_variable_range
> mechanism which is helpful for index choose case at least.

If you don't want the planner to use the statistics for the column why
not just do the following?

ALTER TABLE t ALTER COLUMN col SET STATISTICS 0;

ANALYZE won't delete any existing statistics, so that might need to be
done manually.

David

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-07 Thread Jelte Fennema-Nio

Attached is a new patchset with various changes. I created a dedicated
0002 patch to add tests for the already existing cancellation
functions, because that seemed useful for another thread where changes
to the cancellation protocol are being proposed[1].

[1]: 
https://www.postgresql.org/message-id/flat/508d0505-8b7a-4864-a681-e7e5edfe32aa%40iki.fi

On Wed, 6 Mar 2024 at 19:22, Alvaro Herrera  wrote:
>
> Docs: one bogus "that that".

This was already fixed by my previous doc changes in v32, I guess that
email got crossed with this one

> Did we consider having PQcancelConn() instead be called
> PQcancelCreate()?

Done

> "Asynchronously cancel a query on the given connection. This requires
> polling the returned PGcancelConn to actually complete the cancellation
> of the query." but this is no longer a good description of what this
> function does.

Fixed

>  Anyway, maybe there are reasons for this; but in any case we
> should set ->cancelRequest in all cases, not only after the first tests
> for errors.

Done

> I think the extra PGconn inside pg_cancel_conn is useless; it would be
> simpler to typedef PGcancelConn to PGconn in fe-cancel.c, and remove the
> indirection through the extra struct.  You're actually dereferencing the
> object in two ways in the new code, both by casting the outer object
> straight to PGconn (taking advantage that the struct member is first in
> the struct), and by using PGcancelConn->conn.  This seems pointless.  I
> mean, if we're going to cast to "PGconn *" in some places anyway, then
> we may as well access all members directly.  Perhaps, if you want, you
> could add asserts that ->cancelRequest is set true in all the
> fe-cancel.c functions.  Anyway, we'd still have compiler support to tell
> you that you're passing the wrong struct to the function.  (I didn't
> actually try to change the code this way, so I might be wrong.)

Turns out you were wrong about the compiler support to tell us we're
passing the wrong struct: When both the PGconn and PGcancelConn
typedefs refer to the same struct, the compiler allows passing PGconn
to PGcancelConn functions and vice versa without complaining. This
seems enough reason for me to keep indirection through the extra
struct.

So instead of adding the proposed typed this typedef I chose to add a
comment to pg_cancel_conn explaining its purpose, as well as not
casting PGcancelConn to PGconn but always accessing the conn field for
consistency.

> We could move the definition of struct pg_cancel to fe-cancel.c.  Nobody
> outside that needs to know that definition anyway.

Done in 0003


v33-0003-libpq-Move-pg_cancel-to-fe-cancel.c.patch
Description: Binary data


v33-0002-Add-tests-for-libpq-query-cancellation-APIs.patch
Description: Binary data


v33-0001-Add-missing-connection-statuses-to-docs.patch
Description: Binary data


v33-0005-Start-using-new-libpq-cancel-APIs.patch
Description: Binary data


v33-0004-libpq-Add-encrypted-and-non-blocking-versions-of.patch
Description: Binary data

Re: bug report: some issues about pg_15_stable(8fa4a1ac61189efffb8b851ee77e1bc87360c445)

2024-03-07 Thread Dean Rasheed

On Tue, 5 Mar 2024 at 13:55, Dean Rasheed  wrote:
>
> >  If I only execute merge , I will get the following error:
> > merge into tgt a using src1 c on  a.a = c.a when matched then update 
> > set b = c.b when not matched then insert (a,b) values(c.a,c.b);  -- excute 
> > fail
> > ERROR:  MERGE command cannot affect row a second time
> > HIINT:  Ensure that not more than one source row matches any one target 
> > row.
> >
> >  But when I execute the update and merge concurrently, I will get the 
> > following result set.
>
> Yes, this should still produce that error, even after a concurrent update.
>
> My initial reaction is that neither of those blocks of code is
> entirely correct, and that they should both be doing both of those
> checks. I.e., something like the attached (which probably needs some
> additional test cases).
>

OK, I've pushed and back-patched that fix for this issue, after adding
some tests (nice catch, by the way!).

That wasn't related to the original issue though, so the problem with
UNION ALL still remains to be fixed. The patch from [1] looks
promising (for HEAD at least), but it really needs more pairs of eyes
on it (bearing in mind that it's just a rough proof-of-concept patch
at this stage).

[1] 
https://www.postgresql.org/message-id/CAEZATCVa-mgPuOdgd8%2BYVgOJ4wgJuhT2mJznbj_tmsGAP8hHJw%40mail.gmail.com

Regards,
Dean

Re: a wrong index choose when statistics is out of date

On Thu, 7 Mar 2024 at 21:17, Andrei Lepikhov  wrote:
> I would like to ask David why the var_eq_const estimator doesn't have an
> option for estimation with a histogram. Having that would relieve a
> problem with skewed data. Detecting the situation with incoming const
> that is out of the covered area would allow us to fall back to ndistinct
> estimation or something else. At least, histogram usage can be
> restricted by the reltuples value and ratio between the total number of
> MCV values and the total number of distinct values in the table.

If you can think of a way how to calculate it, you should propose a patch.

IIRC, we try to make the histogram buckets evenly sized based on the
number of occurrences. I've not followed the code in default, I'd
guess that doing that allows us to just subtract off the MCV
frequencies and assume the remainder is evenly split over each
histogram bucket, so unless we had an n_distinct per histogram bucket,
or at the very least n_distinct_for_histogram_values, then how would
the calculation look for what we currently record?

David

Re: a wrong index choose when statistics is out of date

2024-03-07 Thread Andy Fan

David Rowley  writes:

> On Wed, 6 Mar 2024 at 02:09, Andy Fan  wrote:
>> This patch introduces a new attoptions like this:
>>
>> ALTER TABLE t ALTER COLUMN col set (force_generic=true);
>>
>> Then selfunc.c realizes this and ignore the special Const value, then
>> average selectivity is chosen. This fall into the weakness of generic
>> plan, but this patch doesn't introduce any new weakness and we leave the
>> decision to user which could resolve some problem. Also this logic only
>> apply to eqsel since the ineqsel have the get_actual_variable_range
>> mechanism which is helpful for index choose case at least.
>
> If you don't want the planner to use the statistics for the column why
> not just do the following?

Acutally I didn't want the planner to ignore the statistics totally, I
want the planner to treat the "Const" which probably miss optimizer part
average, which is just like what we did for generic plan for the blow
query.  

prepare s as SELECT * FROM t WHERE a = $1 and b = $2;
explain (costs off) execute s(109, 8);
   QUERY PLAN
-
 Index Scan using t_a_c_idx on t
   Index Cond: (a = 109)
   Filter: (b = 8) 

(3 rows)

custom plan, Wrong index due to we have a bad estimation for a = 109.

set plan_cache_mode to force_generic_plan ;
explain (costs off) execute s(109, 8);
  QUERY PLAN   
---
 Index Scan using t_a_b_idx on t
   Index Cond: ((a = $1) AND (b = $2))   -- Correct index.
(2 rows)

Generic plan - we use the average estimation for the missed optimizer
statistics part and *if the new value is not so different from existing
ones*, we can get a disired result. 

It is true that the "generic" way is not as exactly accurate as the
"custom" way since the later one can use the data in MCV, but that is
the cost we have to pay to make the missed optimizer statistics less
imporant and generic plan has the same issue as well. As for this
aspect, I think the way you proposed probably have a wider use case.

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

2024-03-07 Thread Andy Fan



Andrei Lepikhov  writes:

> On 5/3/2024 19:56, Andy Fan wrote:
>> I think it is OK for a design review, for the implementaion side, the
>> known issue includes:
>> 1. Support grap such infromation from its parent for partitioned table
>> if the child doesn't have such information.
>> 2. builtin document and testing.
>> Any feedback is welcome.
> Thanks for your efforts.
> I was confused when you showed the problem connected to clauses like
> "Var op Const" and "Var op Param".

hmm, then what is the soluation in your mind when you say the "ticky" in
[1]?  I am thinking we have some communication gap here.

> As far as I know, the estimation logic of such clauses uses MCV and
> number-distinct statistics. So, being out of MCV values, it becomes
> totally insensitive to any internal skew in data and any data outside
> the statistics boundaries.
> Having studied the example you provided with the patch, I think it is
> not a correct example:
> Difference between var_eq_const and var_eq_non_const quite obvious:

The response should be same as what I did in [2], let's see if we can
make the gap between us smaller.

> In the second routine, you don't have information about the const value
> and can't use MCV for estimation. Also, you can't exclude MCV values
> from the estimation. And it is just luck that you've got the right
> answer. I think if you increased the weight of the unknown part, you
> would get a bad result, too.

> I would like to ask David why the var_eq_const estimator doesn't have an
> option for estimation with a histogram. Having that would relieve a
> problem with skewed data. Detecting the situation with incoming const
> that is out of the covered area would allow us to fall back to ndistinct
> estimation or something else. At least, histogram usage can be
> restricted by the reltuples value and ratio between the total number of
> MCV values and the total number of distinct values in the table.

I think an example which show your algorithm is better would be pretty
helpful for communication. 

[1] 
https://www.postgresql.org/message-id/15381eea-cbc3-4087-9d90-ab752292bd54%40postgrespro.ru
[2] https://www.postgresql.org/message-id/87msra9vgo.fsf%40163.com
-- 
Best Regards
Andy Fan

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 6:37 PM John Naylor  wrote:

> > $ git grep 'link_with: pgport_srv'
> > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> >
> > No other test module uses this directive, and indeed, removing this
> > still builds fine for me. Thoughts?
>
> Yeah, it could be the culprit. The test_radixtree/meson.build is the
> sole extension that explicitly specifies a link with pgport_srv. I
> think we can get rid of it as I've also confirmed the build still fine
> even without it.

olingo and grassquit have turned green, so that must have been it.

Re: Invalid query generated by postgres_fdw with UNION ALL and ORDER BY

On Thu, 7 Mar 2024 at 19:09, Michał Kłeczek  wrote:
>
> The following query:
>
> SELECT * FROM (
>   SELECT 2023 AS year, * FROM remote_table_1
>   UNION ALL
>   SELECT 2022 AS year, * FROM remote_table_2
> )
> ORDER BY year DESC;
>
> yields the following remote query:
>
> SELECT [columns] FROM remote_table_1 ORDER BY 2023 DESC
>
> and subsequently fails remote execution.
>
>
> Not really sure where the problem is - the planner or postgres_fdw.
> I guess it is postgres_fdw not filtering out ordering keys.

Interesting.  I've attached a self-contained recreator for the casual passerby.

I think the fix should go in appendOrderByClause().  It's at that
point we look for the EquivalenceMember for the relation and can
easily discover if the em_expr is a Const.  I think we can safely just
skip doing any ORDER BY  stuff and not worry about if the
literal format of the const will appear as a reference to an ordinal
column position in the ORDER BY clause.

Something like the attached patch I think should work.

I wonder if we need a test...

David
diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 8fc66fa11c..2f4ed33173 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -3914,11 +3914,11 @@ appendOrderByClause(List *pathkeys, bool has_final_sort,
int nestlevel;
const char *delim = " ";
StringInfo  buf = context->buf;
+   boolgotone = false;
 
/* Make sure any constants in the exprs are printed portably */
nestlevel = set_transmission_modes();
 
-   appendStringInfoString(buf, " ORDER BY");
foreach(lcell, pathkeys)
{
PathKey*pathkey = lfirst(lcell);
@@ -3949,6 +3949,21 @@ appendOrderByClause(List *pathkeys, bool has_final_sort,
if (em == NULL)
elog(ERROR, "could not find pathkey item to sort");
 
+   /*
+* If the member just has a Const expression then we needn't 
add it to
+* the ORDER BY clause.  This can happen in UNION ALL queries 
where the
+* union child targetlist has a Const.  Adding these would be 
wasteful,
+* but also, for INT columns, putting an integer literal will 
be seen
+* as an ordinal column position rather than a value to sort 
by, so we
+* must skip these.
+*/
+   if (IsA(em->em_expr, Const))
+   continue;
+
+   if (!gotone)
+   appendStringInfoString(buf, " ORDER BY");
+   gotone = true;
+
em_expr = em->em_expr;
 
/*


demo_of_postgres_fdw_order_by_const_bug.sql
Description: Binary data

Re: remaining sql/json patches




On 3/7/24 06:18, Himanshu Upadhyaya wrote:
> On Wed, Mar 6, 2024 at 9:04 PM Tomas Vondra 
> wrote:
> 
>>
>>
>> I'm pretty sure this is the correct & expected behavior. The second
>> query treats the value as string (because that's what should happen for
>> values in double quotes).
>>
>>  ok, Then why does the below query provide the correct conversion, even if
> we enclose that in double quotes?
> ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{
>  "id" : "1234567890",
>  "FULL_NAME" : "JOHN DOE"}',
> '$'
> COLUMNS(
>  name varchar(20) PATH 'lax $.FULL_NAME',
>  id int PATH 'lax $.id'
>   )
>)
> ;
>name   | id
> --+
>  JOHN DOE | 1234567890
> (1 row)
> 
> and for bigger input(string) it will leave as empty as below.
> ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{
>  "id" : "12345678901",
>  "FULL_NAME" : "JOHN DOE"}',
> '$'
> COLUMNS(
>  name varchar(20) PATH 'lax $.FULL_NAME',
>  id int PATH 'lax $.id'
>   )
>)
> ;
>name   | id
> --+
>  JOHN DOE |
> (1 row)
> 
> seems it is not something to do with data enclosed in double quotes but
> somehow related with internal casting it to integer and I think in case of
> bigger input it is not able to cast it to integer(as defined under COLUMNS
> as id int PATH 'lax $.id')
> 
> ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{
>  "id" : "12345678901",
>  "FULL_NAME" : "JOHN DOE"}',
> '$'
> COLUMNS(
>  name varchar(20) PATH 'lax $.FULL_NAME',
>  id int PATH 'lax $.id'
>   )
>)
> ;
>name   | id
> --+
>  JOHN DOE |
> (1 row)
> )
> 
> if it is not able to represent it to integer because of bigger input, it
> should error out with a similar error message instead of leaving it empty.
> 
> Thoughts?
> 

Ah, I see! Yes, that's a bit weird. Put slightly differently:

test=# SELECT * FROM JSON_TABLE(jsonb '{"id" : "20"}',
'$' COLUMNS(id int PATH '$.id'));
 id

 20
(1 row)

Time: 0.248 ms
test=# SELECT * FROM JSON_TABLE(jsonb '{"id" : "30"}',
'$' COLUMNS(id int PATH '$.id'));
 id


(1 row)

Clearly, when converting the string literal into int value, there's some
sort of error handling that realizes 3B overflows, and returns NULL
instead. I'm not sure if this is intentional.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: A problem about partitionwise join

2024-03-07 Thread Ashutosh Bapat

On Thu, Feb 22, 2024 at 2:56 PM Ashutosh Bapat 
wrote:

> On Wed, Feb 21, 2024 at 4:55 PM Richard Guo 
> wrote:
> >
> >
> > On Tue, Aug 2, 2022 at 4:24 AM Jacob Champion 
> wrote:
> >>
> >> Once you think you've built up some community support and the patchset
> >> is ready for review, you (or any interested party) can resurrect the
> >> patch entry by visiting
> >>
> >> https://commitfest.postgresql.org/38/2266/
> >>
> >> and changing the status to "Needs Review", and then changing the
> >> status again to "Move to next CF". (Don't forget the second step;
> >> hopefully we will have streamlined this in the near future!)
> >
> >
> > This patch was returned due to 'lack of interest'.  However, upon
> > verification, it appears that the reported issue still exists, and the
> > proposed fix in the thread remains valid.  Hence, resurrect this patch
> > after rebasing it on master.  I've also written a detailed commit
> > message which hopefully can help people review the changes more
> > effectively.
>
>
I did a deeper review of the patch. Here are some comments

Approach

The equijoin condition between partition keys doesn't appear in the join's
restrictilist because of 'best_score' strategy as you explained well in
[2]. What if we add an extra score for clauses between partition keys and
give preference to equijoin between partition keys? Have you given it a
thought? I feel that having an equijoin clause involving partition keys has
more usages compared to a clause with any random column. E.g. nextloop may
be able to prune partitions from inner relation if the clause contains a
partition key.

Partition pruning requires equality clauses on partition keys as well.
create_append_plan() fetches those from best_path->param_info. If we
created and saved the clauses involving partition keys somewhere
separately, similar to the clauses involving index keys, it might help this
case as well as the partition pruning code. Have you considered this idea?

There was a proposal to use ECs for outer joins as well and then use only
ECs to decide whether equijoins between partition keys exist. I don't think
the proposal has materialized. So we have to continue looking at
restrictlist as well. I don't see a point waiting for it, but others might
feel differently.

I am just trying to find ways to avoid two loops in
have_partkey_equi_join(). If the alternatives are worse, I think the
current approach is fine.

Documentation
-
The patch does not modify any documentation. The only documentation I could
find about partitionwise join is the one for GUC
'enable_partitionwise_join'. It says
--- quote
"Partitionwise join currently applies only when the join conditions include
all the partition keys, which must be of the same data type and have
one-to-one matching sets of child partitions.".
--- unquote
This sentence is general and IMO covers the case this patch considers. But
in general I feel that partitionwise join and aggregation deserve separate
sections next to "partition pruning" in [1]; It should mention advanced
partition matching algorithm as well. Would you be willing to write one and
then expand it for the case in the patch?

Tests
-
The patch adds a testcase for single column partitioning. I think we need
to do better like
1. Test for partitioning on expression, multilevel partitioning, advanced
partition matching. Those all might just work. Having tests helps us to
notice any future breakage.
2. Some negative test case e.g. equijoin clauses with disjunction, with
inequality operator, equality operators with operators from different
families etc.
3. The testcase added looks artificial. it outputs data that has same value
for two columns which is equal to the primary key of the other table - when
would somebody do that?. Is there some real life example where this change
will be useful?

Code

Minor comment for now. It will be better to increment num_equal_pks
immediately after setting pk_known_equal[ipk] = true. Otherwise the code
gets confusing around line 2269. I will spend more time reviewing the code
next week.

[1] https://www.postgresql.org/docs/current/ddl-partitioning.html
[2]
https://www.postgresql.org/message-id/can_9jtxucgdvy9tv6uxq0cdhrw98bztxpkfbf_75qdpi5wb...@mail.gmail.com

-- 
Best Wishes,
Ashutosh Bapat

Re: Properly pathify the union planner

On Thu, 15 Feb 2024 at 17:30, David Rowley  wrote:
>
> On Tue, 6 Feb 2024 at 22:05, Richard Guo  wrote:
> > I'm thinking that maybe it'd be better to move the work of sorting the
> > subquery's paths to the outer query level, specifically within the
> > build_setop_child_paths() function, just before we stick SubqueryScanPath
> > on top of the subquery's paths.  I think this is better because:
> >
> > 1. This minimizes the impact on subquery planning and reduces the
> > footprint within the grouping_planner() function as much as possible.
> >
> > 2. This can help avoid the aforementioned add_path() issue because the
> > two involved paths will be structured as:
>
> Yes, this is a good idea. I agree with both of your points.

> v2 attached.

If anyone else or if you want to take another look, let me know soon.
Otherwise, I'll assume that's the reviews over and I can take another
look again.

If nobody speaks up before Monday next week (11th), New Zealand time,
I'm going to be looking at this again from the point of view of
committing it.

Thanks

David

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-03-07 Thread Nazir Bilal Yavuz

Hi,

On Wed, 6 Mar 2024 at 18:23, Cédric Villemain
 wrote:
>
> Hi Nazir,
>
>
> thank you for your review. I comment below.
>
>
> On 05/03/2024 12:07, Nazir Bilal Yavuz wrote:
> >> 2. The second one does implement smgrprefetch with range and loops by
> >> default per segment to still have a check for interrupts.
> > It looks good codewise but RELSEG_SIZE is too big to prefetch. Man
> > page of posix_fadvise [1] states that: "The amount of data read may be
> > decreased by the kernel depending on virtual memory load. (A few
> > megabytes will usually be fully satisfied, and more is rarely
> > useful.)". It is trying to prefetch 1GB data now. That could explain
> > your observation about differences between nr_cache numbers.
>
>  From an "adminsys" point of view I will find beneficial to get a single
> syscall per file, respecting the logic and behavior of underlying system
> call.

I agree.

> The behavior is 100% OK, and in fact it might a bad idea to prefetch
> block by block as the result is just to put more pressure on a system if
> it is already under pressure.
>
> Though there are use cases and it's nice to be able to do that too at
> this per page level.

Yes, I do not know which one is more important, cache more blocks but
create more pressure or create less pressure but cache less blocks.
Also, pg_prewarm is designed to be run at startup so I guess there
will not be much pressure.

> About [1], it's very old statement about resources. And Linux manages a
> part of the problem for us here I think [2]:
>
> /*
>   * Chunk the readahead into 2 megabyte units, so that we don't pin too much
>   * memory at once.
>   */
> void force_page_cache_ra()

Thanks for pointing out the actual code. Yes, it looks like the kernel
is already doing that. I would like to do more testing when you
forward vm_relation functions into pgfincore.

> >> Q: posix_fadvise may not work exactly the way you think it does, or does
> >> it ?
> >>
> >>
> >> In details, and for the question:
> >>
> >> However,  if instead you provide a real range, or the magic len=0 to
> >> posix_fadvise, then blocks are "more" loaded according to effective vm
> >> pressure (which is not the case on the previous example).
> >> As a result only a small part of the relation might be loaded, and this
> >> is probably not what end-users expect despite being probably a good
> >> choice (you can still free cache beforehand to help the kernel).
>
> I think it's a matter of documenting well the feature, and if at all
> possible, as usual, not let users be negatively impacted by default.
>
>
> >> An example, below I'm using vm_relation_cachestat() which provides linux
> >> cachestat output, and vm_relation_fadvise() to unload cache, and
> >> pg_prewarm for the demo:
> >>
> >> # clear cache: (nr_cache is the number of file system pages in cache,
> >> not postgres blocks)
> >>
> >> ```
> >> postgres=# select block_start, block_count, nr_pages, nr_cache from
> >> vm_relation_cachestat('foo',range:=1024*32);
> >> block_start | block_count | nr_pages | nr_cache
> >> -+-+--+--
> >> 0 |   32768 |65536 |0
> >> 32768 |   32768 |65536 |0
> >> 65536 |   32768 |65536 |0
> >> 98304 |   32768 |65536 |0
> >>131072 |1672 | 3344 |0
> >> ```
> >>
> >> # load full relation with pg_prewarm (patched)
> >>
> >> ```
> >> postgres=# select pg_prewarm('foo','prefetch');
> >> pg_prewarm
> >> 
> >>   132744
> >> (1 row)
> >> ```
> >>
> >> # Checking results:
> >>
> >> ```
> >> postgres=# select block_start, block_count, nr_pages, nr_cache from
> >> vm_relation_cachestat('foo',range:=1024*32);
> >> block_start | block_count | nr_pages | nr_cache
> >> -+-+--+--
> >> 0 |   32768 |65536 |  320
> >> 32768 |   32768 |65536 |0
> >> 65536 |   32768 |65536 |0
> >> 98304 |   32768 |65536 |0
> >>131072 |1672 | 3344 |  320  <-- segment 1
> >>
> >> ```
> >>
> >> # Load block by block and check:
> >>
> >> ```
> >> postgres=# select from generate_series(0, 132743) g(n), lateral
> >> pg_prewarm('foo','prefetch', 'main', n, n);
> >> postgres=# select block_start, block_count, nr_pages, nr_cache from
> >> vm_relation_cachestat('foo',range:=1024*32);
> >> block_start | block_count | nr_pages | nr_cache
> >> -+-+--+--
> >> 0 |   32768 |65536 |65536
> >> 32768 |   32768 |65536 |65536
> >> 65536 |   32768 |65536 |65536
> >> 98304 |   32768 |65536 |65536
> >>131072 |1672 | 3344 | 3344
> >>
> >> ```
> >>
> >> The duration of the last example is also really significant: full
> >> relation is 0.3ms and block by block is 1550ms

Re: a wrong index choose when statistics is out of date

On Thu, 7 Mar 2024 at 23:40, Andy Fan  wrote:
>
> David Rowley  writes:
> > If you don't want the planner to use the statistics for the column why
> > not just do the following?
>
> Acutally I didn't want the planner to ignore the statistics totally, I
> want the planner to treat the "Const" which probably miss optimizer part
> average, which is just like what we did for generic plan for the blow
> query.

I'm with Andrei on this one and agree with his "And it is just luck
that you've got the right answer".

I think we should fix the general problem of the planner not choosing
the better index.  I understand you've had a go at that before, but I
didn't think fudging the costs was the right fix to coax the planner
into the safer choice.

I'm not personally interested in any bandaid fixes for this. I'd
rather see us come up with a long-term solution that just makes things
better.

I also understand you're probably frustrated and just want to make
something better. However, it's not like it's a new problem. The more
general problem of the planner making risky choices outdates both of
our time spent working on PostgreSQL, so I don't think a hasty
solution that fixes some small subset of the problem is that helpful.

David

Re: Properly pathify the union planner

2024-03-07 Thread Richard Guo

On Thu, Mar 7, 2024 at 7:16 PM David Rowley  wrote:

> On Thu, 15 Feb 2024 at 17:30, David Rowley  wrote:
> >
> > On Tue, 6 Feb 2024 at 22:05, Richard Guo  wrote:
> > > I'm thinking that maybe it'd be better to move the work of sorting the
> > > subquery's paths to the outer query level, specifically within the
> > > build_setop_child_paths() function, just before we stick
> SubqueryScanPath
> > > on top of the subquery's paths.  I think this is better because:
> > >
> > > 1. This minimizes the impact on subquery planning and reduces the
> > > footprint within the grouping_planner() function as much as possible.
> > >
> > > 2. This can help avoid the aforementioned add_path() issue because the
> > > two involved paths will be structured as:
> >
> > Yes, this is a good idea. I agree with both of your points.
>
> > v2 attached.
>
> If anyone else or if you want to take another look, let me know soon.
> Otherwise, I'll assume that's the reviews over and I can take another
> look again.


Hi David,

I would like to have another look, but it might take several days.
Would that be too late?

Thanks
Richard

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-07 Thread Kartyshov Ivan


Intro
==
The main purpose of the feature is to achieve
read-your-writes-consistency, while using async replica for reads and
primary for writes. In that case lsn of last modification is stored
inside application. We cannot store this lsn inside database, since
reads are distributed across all replicas and primary.


Two implementations of one feature
==
We left two different solutions. Help me please to choose the best.


1) Classic (wait_classic_v7.patch)
https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
Synopsis
==
advantages: multiple wait events, separate WAIT FOR statement
disadvantages: new words in grammar



WAIT FOR  [ANY | ALL] event [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ WAIT FOR [ANY | ALL] event [, ...]]
event:
LSN value
TIMEOUT number_of_milliseconds
timestamp



2) After style: Kyotaro and Freund (wait_after_within_v6.patch)
https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
Synopsis
==
advantages: no new words in grammar
disadvantages: a little harder to understand, fewer events to wait



AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]
START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
[ AFTER lsn_event [ WITHIN delay_milliseconds ]]


Examples
==

primary standby
--- 
postgresql.conf
recovery_min_apply_delay = 10s

CREATE TABLE tbl AS SELECT generate_series(1,10) AS a;
INSERT INTO tbl VALUES (generate_series(11, 20));
SELECT pg_current_wal_lsn();

BEGIN WAIT FOR LSN '0/3002AE8';
SELECT * FROM tbl; // read fresh insertions
COMMIT;

Rebased and ready for review.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.comdiff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 4a42999b18..657a217e27 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -188,6 +188,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..b3af16c09f 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  
 
-BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] wait_for_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+where wait_for_event is:
+WAIT FOR [ANY | ALL] event [, ...]
+
+and event is one of:
+LSN lsn_value
+TIMEOUT number_of_milliseconds
+timestamp
 
  
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..1b54ed2084 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,21 @@ PostgreSQL documentation
 
  
 
-START TRANSACTION [ transaction_mode [, ...] ]
+START TRANSACTION [ transaction_mode [, ...] ] wait_for_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+where wait_for_event is:
+WAIT FOR [ANY | ALL] event [, ...]
+
+and event is one of:
+LSN lsn_value
+TIMEOUT number_of_milliseconds
+timestamp
 
  
 
diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml
new file mode 100644
index 00..26cae3ad85
--- /dev/null
+++ b/doc/src/sgml/ref/wait.sgml
@@ -0,0 +1,146 @@
+
+
+
+ 
+  WAIT FOR
+ 
+
+ 
+  WAIT FOR
+  7
+  SQL - Language Statements
+ 
+
+ 
+  WAIT FOR
+  wait for the target LSN to be replayed or for specified time to pass
+ 
+
+ 
+
+WAIT FOR [ANY | ALL] event [, ...]
+
+where event is one of:
+LSN value
+TIMEOUT number_of_milliseconds
+timestamp
+
+WAIT FOR LSN 'lsn_number'
+WAIT FOR LSN 'lsn_number' TIMEOUT wait_timeout
+WAIT FOR LSN 'lsn_number', TIMESTAMP wait_time
+WAIT FOR TIMESTAMP wait_time
+WAIT FOR ALL LSN 'lsn_number' TIMEOUT wait_timeout, TIMESTAMP wait_time
+WAIT FOR ANY LSN 'lsn_number', TIMESTAMP wait_time
+
+ 
+
+ 
+  Description
+
+  
+   WAIT FOR provides a simple interprocess communication
+   mechanism to wait for timestamp, timeout or the target log sequence number
+   (LSN) on standby in PostgreSQL
+   databases with master-standby asynchronous replication. When run with the
+   LSN option, the WAIT FOR
+   command waits for the specified LSN to be replayed.
+   If no timestamp or timeout was specified, wait time is unlimited.
+   Waiting can be interrupted using Ctrl+C, or
+   by shutting down the postgres server.
+  
+
+ 
+
+ 
+  Parameters
+
+  
+   
+

Re: Shared detoast Datum proposal

On 3/7/24 08:33, Nikita Malakhov wrote:
> Hi!
> 
> Tomas, I thought about this issue -
>> What if you only need the first slice? In that case decoding everything
>> will be a clear regression.
> And completely agree with you, I'm currently working on it and will post
> a patch a bit later.

Cool. I don't plan to work on improving my patch - it was merely a PoC,
so if you're interested in working on that, that's good.

> Another issue we have here - if we have multiple slices requested -
> then we have to update cached slice from both sides, the beginning
> and the end.
> 

No opinion. I haven't thought about how to handle slices too much.

> On update, yep, you're right
>> Surely the update creates a new TOAST value, with a completely new TOAST
>> pointer, new rows in the TOAST table etc. It's not updated in-place. So
>> it'd end up with two independent entries in the TOAST cache.
> 
>> Or are you interested just in evicting the old value simply to free the
>> memory, because we're not going to need that (now invisible) value? That
>> seems interesting, but if it's too invasive or complex I'd just leave it
>> up to the LRU eviction (at least for v1).
> Again, yes, we do not need the old value after it was updated and
> it is better to take care of it explicitly. It's a simple and not invasive
> addition to your patch.
> 

OK

> Could not agree with you about large values - it makes sense
> to cache large values because the larger it is the more benefit
> we will have from not detoasting it again and again. One way
> is to keep them compressed, but what if we have data with very
> low compression rate?
> 

I'm not against caching large values, but I also think it's not
difficult to construct cases where it's a losing strategy.

> I also changed HASHACTION field value from HASH_ENTER to
> HASH_ENTER_NULL to softly deal with case when we do not
> have enough memory and added checks for if the value was stored
> or not.
> 

I'm not sure about this. HASH_ENTER_NULL is only being used in three
very specific places (two are lock management, one is shmeminitstruct).
This hash table is not unique in any way, so why not to use HASH_ENTER
like 99% other hash tables?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Invalid query generated by postgres_fdw with UNION ALL and ORDER BY

2024-03-07 Thread Ashutosh Bapat

On Thu, Mar 7, 2024 at 4:39 PM David Rowley  wrote:

> On Thu, 7 Mar 2024 at 19:09, Michał Kłeczek  wrote:
> >
> > The following query:
> >
> > SELECT * FROM (
> >   SELECT 2023 AS year, * FROM remote_table_1
> >   UNION ALL
> >   SELECT 2022 AS year, * FROM remote_table_2
> > )
> > ORDER BY year DESC;
> >
> > yields the following remote query:
> >
> > SELECT [columns] FROM remote_table_1 ORDER BY 2023 DESC
> >
> > and subsequently fails remote execution.
> >
> >
> > Not really sure where the problem is - the planner or postgres_fdw.
> > I guess it is postgres_fdw not filtering out ordering keys.
>
> Interesting.  I've attached a self-contained recreator for the casual
> passerby.
>
> I think the fix should go in appendOrderByClause().  It's at that
> point we look for the EquivalenceMember for the relation and can
> easily discover if the em_expr is a Const.  I think we can safely just
> skip doing any ORDER BY  stuff and not worry about if the
> literal format of the const will appear as a reference to an ordinal
> column position in the ORDER BY clause.
>

deparseSortGroupClause() calls deparseConst() with showtype = 1.
appendOrderByClause() may want to do something similar for consistency. Or
remove it from deparseSortGroupClause() as well?


>
> Something like the attached patch I think should work.
>
> I wonder if we need a test...
>

Yes.

-- 
Best Wishes,
Ashutosh Bapat

Re: remaining sql/json patches

2024-03-07 Thread Amit Langote

On Thu, Mar 7, 2024 at 8:13 PM Tomas Vondra
 wrote:
> On 3/7/24 06:18, Himanshu Upadhyaya wrote:

Thanks Himanshu for the testing.

> > On Wed, Mar 6, 2024 at 9:04 PM Tomas Vondra 
> > wrote:
> >>
> >> I'm pretty sure this is the correct & expected behavior. The second
> >> query treats the value as string (because that's what should happen for
> >> values in double quotes).
> >>
> >>  ok, Then why does the below query provide the correct conversion, even if
> > we enclose that in double quotes?
> > ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{
> >  "id" : "1234567890",
> >  "FULL_NAME" : "JOHN DOE"}',
> > '$'
> > COLUMNS(
> >  name varchar(20) PATH 'lax $.FULL_NAME',
> >  id int PATH 'lax $.id'
> >   )
> >)
> > ;
> >name   | id
> > --+
> >  JOHN DOE | 1234567890
> > (1 row)
> >
> > and for bigger input(string) it will leave as empty as below.
> > ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{
> >  "id" : "12345678901",
> >  "FULL_NAME" : "JOHN DOE"}',
> > '$'
> > COLUMNS(
> >  name varchar(20) PATH 'lax $.FULL_NAME',
> >  id int PATH 'lax $.id'
> >   )
> >)
> > ;
> >name   | id
> > --+
> >  JOHN DOE |
> > (1 row)
> >
> > seems it is not something to do with data enclosed in double quotes but
> > somehow related with internal casting it to integer and I think in case of
> > bigger input it is not able to cast it to integer(as defined under COLUMNS
> > as id int PATH 'lax $.id')
> >
> > ‘postgres[102531]=#’SELECT * FROM JSON_TABLE(jsonb '{
> >  "id" : "12345678901",
> >  "FULL_NAME" : "JOHN DOE"}',
> > '$'
> > COLUMNS(
> >  name varchar(20) PATH 'lax $.FULL_NAME',
> >  id int PATH 'lax $.id'
> >   )
> >)
> > ;
> >name   | id
> > --+
> >  JOHN DOE |
> > (1 row)
> > )
> >
> > if it is not able to represent it to integer because of bigger input, it
> > should error out with a similar error message instead of leaving it empty.
> >
> > Thoughts?
> >
>
> Ah, I see! Yes, that's a bit weird. Put slightly differently:
>
> test=# SELECT * FROM JSON_TABLE(jsonb '{"id" : "20"}',
> '$' COLUMNS(id int PATH '$.id'));
>  id
> 
>  20
> (1 row)
>
> Time: 0.248 ms
> test=# SELECT * FROM JSON_TABLE(jsonb '{"id" : "30"}',
> '$' COLUMNS(id int PATH '$.id'));
>  id
> 
>
> (1 row)
>
> Clearly, when converting the string literal into int value, there's some
> sort of error handling that realizes 3B overflows, and returns NULL
> instead. I'm not sure if this is intentional.

Indeed.

This boils down to the difference in the cast expression chosen to
convert the source value to int in the two cases.

The case where the source value has no quotes, the chosen cast
expression is a FuncExpr for function numeric_int4(), which has no way
to suppress errors.  When the source value has quotes, the cast
expression is a CoerceViaIO expression, which can suppress the error.
The default behavior is to suppress the error and return NULL, so the
correct behavior is when the source value has quotes.

I think we'll need either:

* fix the code in 0001 to avoid getting numeric_int4() in this case,
and generally cast functions that don't have soft-error handling
support, in favor of using IO coercion.
* fix FuncExpr (like CoerceViaIO) to respect SQL/JSON's request to
suppress errors and fix downstream functions like numeric_int4() to
comply by handling errors softly.

I'm inclined to go with the 1st option as we already have the
infrastructure in place -- input functions can all handle errors
softly.

For the latter, it uses numeric_int4() which doesn't support
soft-error handling, so throws the error.  With quotes, the

--
Thanks, Amit Langote

--
Thanks, Amit Langote

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-07 Thread Amit Kapila

On Thu, Mar 7, 2024 at 5:14 PM Kartyshov Ivan
 wrote:
>
> Intro
> ==
> The main purpose of the feature is to achieve
> read-your-writes-consistency, while using async replica for reads and
> primary for writes. In that case lsn of last modification is stored
> inside application. We cannot store this lsn inside database, since
> reads are distributed across all replicas and primary.
>
>
> Two implementations of one feature
> ==
> We left two different solutions. Help me please to choose the best.
>
>
> 1) Classic (wait_classic_v7.patch)
> https://www.postgresql.org/message-id/3cc883048264c2e9af022033925ff8db%40postgrespro.ru
> Synopsis
> ==
> advantages: multiple wait events, separate WAIT FOR statement
> disadvantages: new words in grammar
>
>
>
> WAIT FOR  [ANY | ALL] event [, ...]
> BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>  [ WAIT FOR [ANY | ALL] event [, ...]]
> event:
> LSN value
> TIMEOUT number_of_milliseconds
> timestamp
>
>
>
> 2) After style: Kyotaro and Freund (wait_after_within_v6.patch)
> https://www.postgresql.org/message-id/d3ff2e363af60b345f82396992595a03%40postgrespro.ru
> Synopsis
> ==
> advantages: no new words in grammar
> disadvantages: a little harder to understand, fewer events to wait
>
>
>
> AFTER lsn_event [ WITHIN delay_milliseconds ] [, ...]
> BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>  [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
> START [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
>  [ AFTER lsn_event [ WITHIN delay_milliseconds ]]
>

+1 for the second one not only because it avoids new words in grammar
but also sounds to convey the meaning. I think you can explain in docs
how this feature can be used basically how will one get the correct
LSN value to specify.

As suggested previously also pick one of the approaches (I would
advocate the second one) and keep an option for the second one by
mentioning it in the commit message. I hope to see more
reviews/discussions or usage like how will users get the LSN value to
be specified on the core logic of the feature at this stage. IF
possible, state, how real-world applications could leverage this
feature.

-- 
With Regards,
Amit Kapila.

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-03-07 Thread Cédric Villemain


Hi Nazir,

On 07/03/2024 12:19, Nazir Bilal Yavuz wrote:

On Wed, 6 Mar 2024 at 18:23, Cédric Villemain
 wrote:

The behavior is 100% OK, and in fact it might a bad idea to prefetch
block by block as the result is just to put more pressure on a system if
it is already under pressure.

Though there are use cases and it's nice to be able to do that too at
this per page level.

Yes, I do not know which one is more important, cache more blocks but
create more pressure or create less pressure but cache less blocks.
Also, pg_prewarm is designed to be run at startup so I guess there
will not be much pressure.


autowarm is designed for that purpose but pg_prewarm is free to use when 
neeed.



About [1], it's very old statement about resources. And Linux manages a
part of the problem for us here I think [2]:

/*
   * Chunk the readahead into 2 megabyte units, so that we don't pin too much
   * memory at once.
   */
void force_page_cache_ra()

Thanks for pointing out the actual code. Yes, it looks like the kernel
is already doing that. I would like to do more testing when you
forward vm_relation functions into pgfincore.



I hope to be able to get back there next week max.



An example, below I'm using vm_relation_cachestat() which provides linux
cachestat output, and vm_relation_fadvise() to unload cache, and
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache,
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-+-+--+--
 0 |   32768 |65536 |0
 32768 |   32768 |65536 |0
 65536 |   32768 |65536 |0
 98304 |   32768 |65536 |0
131072 |1672 | 3344 |0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm

   132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-+-+--+--
 0 |   32768 |65536 |  320
 32768 |   32768 |65536 |0
 65536 |   32768 |65536 |0
 98304 |   32768 |65536 |0
131072 |1672 | 3344 |  320  <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-+-+--+--
 0 |   32768 |65536 |65536
 32768 |   32768 |65536 |65536
 65536 |   32768 |65536 |65536
 98304 |   32768 |65536 |65536
131072 |1672 | 3344 | 3344

```

The duration of the last example is also really significant: full
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work,
here each call is from block 0 with len of 132744 and a range of 1 block
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
-

(1 row)

Time: 25.202 ms
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
-

(1 row)

Time: 1523.636 ms (00:01.524) <- not free !
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
-

(1 row)

Time: 24.967 ms
```

I confirm that there is a time difference between calling pg_prewarm
by full relation and block by block, but IMO this is expected. When
pg_prewarm is called by full relation, it does the initialization part
just once but when it is called block by block, it does initialization
for each call, right?


Not sure what initialization is here exactly, in my example with
WILLNEED/DONTNEED there are exactly the same code pattern and syscall
request(s), just the flag is distinct, so initialization cost are
expected to be very similar.

Sorry, there was a miscommunication. I was talking about pg_prewarm's
initialization, meaning if the pg_prewarm is called block by block (by
using generate_series); it will make block_count times initial

Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

On 2024-Mar-05, Dean Rasheed wrote:

> So I think RelationGetIndexAttrBitmap() should include deferrable PKs,

I tried this, but it doesn't actually lead to a good place, because if
we allow deferrable PKs to identify rows, then they are not useful to
find the tuple to update when replicating.  Consider the following case:

$node_publisher->safe_psql('postgres',
'create table deferred_pk (id int primary key initially deferred, 
hidden int, value text)');
$node_subscriber->safe_psql('postgres',
'create table deferred_pk (id int primary key initially deferred, 
hidden int, value text)');
$node_subscriber->safe_psql('postgres',
'alter subscription tap_sub refresh publication');

$node_publisher->safe_psql('postgres',
"insert into deferred_pk (id, hidden, value) values (1, 1, 'first')");
$node_publisher->wait_for_catchup('tap_sub');
$node_publisher->safe_psql('postgres',
qq{
begin;
insert into deferred_pk values (1, 2, 'conflicting');
update deferred_pk set value = value || ', updated' where id = 1 and 
hidden = 2;
update deferred_pk set id = 3, value = value || ', updated' where 
hidden = 2;
commit});
$node_publisher->wait_for_catchup('tap_sub');
my $pubdata = $node_publisher->safe_psql('postgres',
'select * from deferred_pk order by id');
my $subsdata = $node_subscriber->safe_psql('postgres',
'select * from deferred_pk order by id');
is($subsdata, $pubdata, "data is equal");

Here, the publisher's transaction first creates a new record with the
same PK, which only works because the PK is deferred; then we update its
payload column.  When this is replicated, the row is identified by the
PK ... but replication actually updates the other row, because it's
found first:

#   Failed test 'data is equal'
#   at t/003_constraints.pl line 163.
#  got: '1|2|conflicting
# 3|2|conflicting, updated, updated'
# expected: '1|1|first
# 3|2|conflicting, updated, updated'

Actually, is that what happened here?  I'm not sure, but clearly this is
bogus.

So I think the original developers of REPLICA IDENTITY had the right
idea here (commit 07cacba983ef), and we mustn't change this aspect,
because it'll lead to data corruption in replication.  Using a deferred
PK for DDL considerations seems OK, but it seems certain that for actual
data replication it's going to be a disaster.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2024-03-07 Thread Dean Rasheed

On Thu, 7 Mar 2024 at 13:00, Alvaro Herrera  wrote:
>
> So I think the original developers of REPLICA IDENTITY had the right
> idea here (commit 07cacba983ef), and we mustn't change this aspect,
> because it'll lead to data corruption in replication.  Using a deferred
> PK for DDL considerations seems OK, but it seems certain that for actual
> data replication it's going to be a disaster.
>

Yes, that makes sense. If I understand correctly though, the
replication code uses relation->rd_replidindex (not
relation->rd_pkindex), although sometimes it's the same thing. So can
we get away with making sure that RelationGetIndexList() doesn't set
relation->rd_replidindex to a deferrable PK, while still allowing
relation->rd_pkindex to be one?

Regards,
Dean

Re: remaining sql/json patches

2024-03-07 Thread jian he

On Thu, Mar 7, 2024 at 8:06 PM Amit Langote  wrote:
>
>
> Indeed.
>
> This boils down to the difference in the cast expression chosen to
> convert the source value to int in the two cases.
>
> The case where the source value has no quotes, the chosen cast
> expression is a FuncExpr for function numeric_int4(), which has no way
> to suppress errors.  When the source value has quotes, the cast
> expression is a CoerceViaIO expression, which can suppress the error.
> The default behavior is to suppress the error and return NULL, so the
> correct behavior is when the source value has quotes.
>
> I think we'll need either:
>
> * fix the code in 0001 to avoid getting numeric_int4() in this case,
> and generally cast functions that don't have soft-error handling
> support, in favor of using IO coercion.
> * fix FuncExpr (like CoerceViaIO) to respect SQL/JSON's request to
> suppress errors and fix downstream functions like numeric_int4() to
> comply by handling errors softly.
>
> I'm inclined to go with the 1st option as we already have the
> infrastructure in place -- input functions can all handle errors
> softly.

not sure this is the right way.
but attached patches solved this problem.

Also, can you share the previous memory-hogging bug issue
when you are free, I want to know which part I am missing.


v42-0001-minor-refactor-SQL-JSON-query-functions-based.no-cfbot
Description: Binary data


v42-0002-minor-refactor-JOSN_TABLE-functions-based-on-.no-cfbot
Description: Binary data

Re: remaining sql/json patches

Hi,

I was experimenting with the v42 patches, and I think the handling of ON
EMPTY / ON ERROR clauses may need some improvement. The grammar is
currently defined like this:

| json_behavior ON EMPTY_P json_behavior ON ERROR_P

This means the clauses have to be defined exactly in this order, and if
someone does

NULL ON ERROR NULL ON EMPTY

it results in syntax error. I'm not sure what the SQL standard says
about this, but it seems other databases don't agree on the order. Is
there a particular reason to not allow both orderings?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: remaining sql/json patches

2024-03-07 Thread Amit Langote

On Thu, Mar 7, 2024 at 22:46 jian he  wrote:

> On Thu, Mar 7, 2024 at 8:06 PM Amit Langote 
> wrote:
> >
> >
> > Indeed.
> >
> > This boils down to the difference in the cast expression chosen to
> > convert the source value to int in the two cases.
> >
> > The case where the source value has no quotes, the chosen cast
> > expression is a FuncExpr for function numeric_int4(), which has no way
> > to suppress errors.  When the source value has quotes, the cast
> > expression is a CoerceViaIO expression, which can suppress the error.
> > The default behavior is to suppress the error and return NULL, so the
> > correct behavior is when the source value has quotes.
> >
> > I think we'll need either:
> >
> > * fix the code in 0001 to avoid getting numeric_int4() in this case,
> > and generally cast functions that don't have soft-error handling
> > support, in favor of using IO coercion.
> > * fix FuncExpr (like CoerceViaIO) to respect SQL/JSON's request to
> > suppress errors and fix downstream functions like numeric_int4() to
> > comply by handling errors softly.
> >
> > I'm inclined to go with the 1st option as we already have the
> > infrastructure in place -- input functions can all handle errors
> > softly.
>
> not sure this is the right way.
> but attached patches solved this problem.
>
> Also, can you share the previous memory-hogging bug issue
> when you are free, I want to know which part I am missing.


Take a look at the json_populate_type() call in ExecEvalJsonCoercion() or
specifically compare the new way of passing its void *cache parameter with
the earlier patches.

>

Re: remaining sql/json patches

On 2024-Mar-07, Tomas Vondra wrote:

> I was experimenting with the v42 patches, and I think the handling of ON
> EMPTY / ON ERROR clauses may need some improvement.

Well, the 2023 standard says things like

 ::=
  JSON_VALUE 
  
  [  ]
  [  ON EMPTY ]
  [  ON ERROR ]
  

which implies that if you specify it the other way around, it's a syntax
error.

> I'm not sure what the SQL standard says about this, but it seems other
> databases don't agree on the order. Is there a particular reason to
> not allow both orderings?

I vaguely recall that trying to also support the other ordering leads to
having more rules.  Now maybe we do want that because of compatibility
with other DBMSs, but frankly at this stage I wouldn't bother.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."   (Fotis)
  https://postgr.es/m/200606261359.k5qdxe2p004...@auth-smtp.hol.gr

Re: Add system identifier to backup manifest

On Wed, Mar 6, 2024 at 11:22 PM Amul Sul  wrote:
>> You are not changing silently the internals of get_controlfile(), so
>> no objections here.  The name of the new routine could be shorter, but
>> being short of ideas what you are proposing looks fine by me.
>
> Could be get_controlfile_by_path() ?

It could. I just thought this was clearer. I agree that it's a bit
long, but I don't think this is worth bikeshedding very much. If at a
later time somebody feels strongly that it needs to be changed, so be
it. Right now, getting on with the business at hand is more important,
IMHO.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: remaining sql/json patches

2024-03-07 Thread Amit Langote

On Thu, Mar 7, 2024 at 23:14 Alvaro Herrera  wrote:

> On 2024-Mar-07, Tomas Vondra wrote:
>
> > I was experimenting with the v42 patches, and I think the handling of ON
> > EMPTY / ON ERROR clauses may need some improvement.
>
> Well, the 2023 standard says things like
>
>  ::=
>   JSON_VALUE 
>   
>   [  ]
>   [  ON EMPTY ]
>   [  ON ERROR ]
>   
>
> which implies that if you specify it the other way around, it's a syntax
> error.
>
> > I'm not sure what the SQL standard says about this, but it seems other
> > databases don't agree on the order. Is there a particular reason to
> > not allow both orderings?
>
> I vaguely recall that trying to also support the other ordering leads to
> having more rules.


Yeah, I think that was it.  At one point, I removed rules supporting syntax
that wasn’t documented.

Now maybe we do want that because of compatibility
> with other DBMSs, but frankly at this stage I wouldn't bother.


+1.

>

Re: POC, WIP: OR-clause support for indexes

2024-03-07 Thread Alexander Korotkov

Hi!

On Tue, Mar 5, 2024 at 9:59 AM Andrei Lepikhov 
wrote:
> On 5/3/2024 12:30, Andrei Lepikhov wrote:
> > On 4/3/2024 09:26, jian he wrote:
> ... and the new version of the patchset is attached.

I made some revisions for the patchset.
1) Use hash_combine() to combine hash values.
2) Upper limit the number of array elements by MAX_SAOP_ARRAY_SIZE.
3) Better save the original order of clauses by putting hash entries and
untransformable clauses to the same list.  A lot of differences in
regression tests output have gone.

One important issue I found.

# create table t as (select i::int%100 i from generate_series(1,1) i);
# analyze t;
# explain select * from t where i = 1 or i = 1;
 QUERY PLAN
-
 Seq Scan on t  (cost=0.00..189.00 rows=200 width=4)
   Filter: (i = ANY ('{1,1}'::integer[]))
(2 rows)

# set enable_or_transformation = false;
SET
# explain select * from t where i = 1 or i = 1;
 QUERY PLAN
-
 Seq Scan on t  (cost=0.00..189.00 rows=100 width=4)
   Filter: (i = 1)
(2 rows)

We don't make array values unique.  That might make query execution
performance somewhat worse, and also makes selectivity estimation worse.  I
suggest Andrei and/or Alena should implement making array values unique.

--
Regards,
Alexander Korotkov

v20-0002-Teach-generate_bitmap_or_paths-to-build-BitmapOr.patch
Description: Binary data

v20-0001-Transform-OR-clauses-to-ANY-expression.patch
Description: Binary data

Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

On 2024-Mar-07, Dean Rasheed wrote:

> On Thu, 7 Mar 2024 at 13:00, Alvaro Herrera  wrote:
> >
> > So I think the original developers of REPLICA IDENTITY had the right
> > idea here (commit 07cacba983ef), and we mustn't change this aspect,
> > because it'll lead to data corruption in replication.  Using a deferred
> > PK for DDL considerations seems OK, but it seems certain that for actual
> > data replication it's going to be a disaster.
> 
> Yes, that makes sense. If I understand correctly though, the
> replication code uses relation->rd_replidindex (not
> relation->rd_pkindex), although sometimes it's the same thing. So can
> we get away with making sure that RelationGetIndexList() doesn't set
> relation->rd_replidindex to a deferrable PK, while still allowing
> relation->rd_pkindex to be one?

Well, not really, because the logical replication code for some reason
uses GetRelationIdentityOrPK(), which falls back to rd->pk_index (via
RelationGetPrimaryKeyIndex) if rd_replindex is not set.

Maybe we can add a flag RelationData->rd_ispkdeferred, so that
RelationGetPrimaryKeyIndex returned InvalidOid for deferrable PKs; then
logical replication would continue to not know about this PK, which
perhaps is what we want.  I'll do some testing with this.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: WIP Incremental JSON Parser

2024-03-07 Thread Jacob Champion

Some more observations as I make my way through the patch:

In src/common/jsonapi.c,

> +#define JSON_NUM_NONTERMINALS 6

Should this be 5 now?

> +   res = pg_parse_json_incremental(&(incstate->lex), &(incstate->sem),
> +   
> chunk, size, is_last);
> +
> +   expected = is_last ? JSON_SUCCESS : JSON_INCOMPLETE;
> +
> +   if (res != expected)
> +   json_manifest_parse_failure(context, "parsing failed");

This leads to error messages like

pg_verifybackup: error: could not parse backup manifest: parsing failed

which I would imagine is going to lead to confused support requests in
the event that someone does manage to corrupt their manifest. Can we
make use of json_errdetail() and print the line and column numbers?
Patch 0001 over at [1] has one approach to making json_errdetail()
workable in frontend code.

Top-level scalars like `false` or `12345` do not parse correctly if
the chunk size is too small; instead json_errdetail() reports 'Token
"" is invalid'. With small chunk sizes, json_errdetail() additionally
segfaults on constructions like `[tru]` or `12zz`.

For my local testing, I'm carrying the following diff in
001_test_json_parser_incremental.pl:

> - ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds");
> - ok(!$stderr, "chunk size $size: no error output");
> + like($stdout, qr/SUCCESS/, "chunk size $size: test succeeds");
> + is($stderr, "", "chunk size $size: no error output");

This is particularly helpful when a test fails spuriously due to code
coverage spray on stderr.

Thanks,
--Jacob

[1] 
https://www.postgresql.org/message-id/CAOYmi%2BmSSY4SvOtVN7zLyUCQ4-RDkxkzmTuPEN%2Bt-PsB7GHnZA%40mail.gmail.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 8:06 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 7, 2024 at 6:37 PM John Naylor  wrote:
>
> > > $ git grep 'link_with: pgport_srv'
> > > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> > >
> > > No other test module uses this directive, and indeed, removing this
> > > still builds fine for me. Thoughts?
> >
> > Yeah, it could be the culprit. The test_radixtree/meson.build is the
> > sole extension that explicitly specifies a link with pgport_srv. I
> > think we can get rid of it as I've also confirmed the build still fine
> > even without it.
>
> olingo and grassquit have turned green, so that must have been it.

Cool!

I've attached the remaining patches for CI. I've made some minor
changes in separate patches and drafted the commit message for
tidstore patch.

While reviewing the tidstore code, I thought that it would be more
appropriate to place tidstore.c under src/backend/lib instead of
src/backend/common/access since the tidstore is no longer implemented
only for heap or other access methods, and it might also be used by
executor nodes in the future. What do you think?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

v68-ART.tar.gz
Description: GNU Zip compressed data

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-07 Thread Benoit Tigeot

Hello,

I am not up to date with the last version of patch but I did a regular
benchmark with version 11 and typical issue we have at the moment and the
result are still very very good. [1]

In term of performance improvement the last proposals could be a real game
changer for 2 of our biggest databases. We hope that Postgres 17 will
contain those improvements.

Kind regards,

Benoit

[1] -
https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4972955#gistcomment-4972955
__
Benoit Tigeot



Le jeu. 7 mars 2024 à 15:36, Peter Geoghegan  a écrit :

> On Tue, Jan 23, 2024 at 3:22 PM Peter Geoghegan  wrote:
> > I could include something less verbose, mentioning a theoretical risk
> > to out-of-core amcanorder routines that coevolved with nbtree,
> > inherited the same SAOP limitations, and then never got the same set
> > of fixes.
>
> Attached is v11, which now says something like that in the commit
> message. Other changes:
>
> * Fixed buggy sorting of arrays using cross-type ORDER procs, by
> recognizing that we need to consistently use same-type ORDER procs for
> sorting and merging the arrays during array preprocessing.
>
> Obviously, when we sort, we compare array elements to other array
> elements (all of the same type). This is true independent of whether
> the query itself happens to use a cross type operator/ORDER proc, so
> we will need to do two separate ORDER proc lookups in cross-type
> scenarios.
>
> * No longer subscript the ORDER proc used for array binary searches
> using a scankey subscript. Now there is an additional indirection that
> works even in the presence of multiple redundant scan keys that cannot
> be detected as such due to a lack of appropriate cross-type support
> within an opfamily.
>
> This was subtly buggy before now. Requires a little more coordination
> between array preprocessing and standard/primitive index scan
> preprocessing, which isn't ideal but seems unavoidable.
>
> * Lots of code polishing, especially within _bt_advance_array_keys().
>
> While _bt_advance_array_keys() still works in pretty much exactly the
> same way as it did back in v10, there are now better comments.
> Including something about why its recursive call to itself is
> guaranteed to use a low, fixed amount of stack space, verified using
> an assertion. That addresses a concern held by Matthias.
>
> Outlook
> ===
>
> This patch is approaching being committable now. Current plan is to
> commit this within the next few weeks.
>
> All that really remains now is to research how we might integrate this
> work with the recently added continuescanPrechecked/haveFirstMatch
> stuff from Alexander Korotkov, if at all. I've put that off until now
> because it isn't particularly fundamental to what I'm doing here, and
> seems optional.
>
> I would also like to do more performance validation. Things like the
> parallel index scan code could stand to be revisited once again. Plus
> I should think about the overhead of array preprocessing when
> btrescan() is called many times, from a nested loop join -- I should
> have something to say about that concern (raised by Heikki at one
> point) before too long.
>
> --
> Peter Geoghegan
>

Re: RangeTblEntry.inh vs. RTE_SUBQUERY

2024-03-07 Thread Peter Eisentraut

On 03.03.24 11:02, Peter Eisentraut wrote:

On 29.02.24 19:14, Tom Lane wrote:

Peter Eisentraut writes:

In nodes/parsenodes.h, it says both
This *must* be false for RTEs other than RTE_RELATION entries.

Well, that's true in the parser ...

and also puts it under
Fields valid in all RTEs:
which are both wrong on opposite ends of the spectrum.
I think it would make more sense to group inh under "Fields valid for a
plain relation RTE" and then explain the exception for subqueries, like
it is done for several other fields.

Dunno. The adjacent "lateral" field is also used for only selected
RTE kinds.

The section is

/*
* Fields valid in all RTEs:
*/
Alias *alias; /* user-written alias clause, if any */
Alias *eref; /* expanded reference names */
bool lateral; /* subquery, function, or values is
LATERAL? */

bool inh; /* inheritance requested? */
bool inFromCl; /* present in FROM clause? */
List *securityQuals; /* security barrier quals to apply, if
any */

According to my testing, lateral is used for RTE_RELATION, RTE_SUBQUERY,
RTE_FUNCTION, RTE_TABLEFUNC, RTE_VALUES, which is 5 out of 9 possible.
So I think it might be okay to relabel that section (in actuality or
mentally) as "valid in several/many/most RTEs".

But I'm not sure what reason there would be for having inh there, which
is better described as "valid for RTE_RELATION, but also borrowed by
RTE_SUBQUERY", which is pretty much exactly what is the case for relid,
relkind, etc.

I have committed the patches for this discussion.

Related discussion will continue at
https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-ab20-88dbaadca...@eisentraut.org
/ https://commitfest.postgresql.org/47/4697/ .

Re: improve ssl error code, 2147483650

2024-03-07 Thread Heikki Linnakangas


On 07/03/2024 02:12, David Zhang wrote:

The SSL_CTX_load_verify_locations function in OpenSSL will return NULL
if there is a system error, such as "No such file or directory" in this
case:

const char *ERR_reason_error_string(unsigned long e)
{
      ERR_STRING_DATA d, *p = NULL;
      unsigned long l, r;

      if (!RUN_ONCE(&err_string_init, do_err_strings_init)) {
      return NULL;
      }

      /*
   * ERR_reason_error_string() can't safely return system error strings,
   * since openssl_strerror_r() needs a buffer for thread safety, and we
   * haven't got one that would serve any sensible purpose.
   */
      if (ERR_SYSTEM_ERROR(e))
      return NULL;


That's pretty unfortunate. As typical with OpenSSL, this stuff is not 
very well documented, but I think we could do something like this in 
SSLerrmessage():


if (ERR_SYSTEM_ERROR(e))
errreason = strerror(ERR_GET_REASON(e));

ERR_SYSTEM_ERROR only exists in OpenSSL 3.0 and above, and the only 
documentation I could find was in this one obscure place in the man 
pages: 
https://www.openssl.org/docs/man3.2/man3/BIO_dgram_get_local_addr_enable.html. 
But as a best-effort thing, it would still be better than "SSL error 
code  2147483650".



It would be better to perform a simple SSL file check before passing the
SSL file to OpenSSL APIs so that the system error can be captured and a
meaningful message provided to the end user.


That feels pretty ugly. I agree it would catch most of the common 
mistakes in practice, so maybe we should just hold our noses and do it 
anyway, if the above ERR_SYSTEM_ERROR() method doesn't work.


It's sad that we cannot pass a file descriptor or in-memory copy of the 
file contents to those functions.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 8:06 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 7, 2024 at 6:37 PM John Naylor  wrote:
>
> > > $ git grep 'link_with: pgport_srv'
> > > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> > >
> > > No other test module uses this directive, and indeed, removing this
> > > still builds fine for me. Thoughts?
> >
> > Yeah, it could be the culprit. The test_radixtree/meson.build is the
> > sole extension that explicitly specifies a link with pgport_srv. I
> > think we can get rid of it as I've also confirmed the build still fine
> > even without it.
>
> olingo and grassquit have turned green, so that must have been it.

fairywren is complaining another build failure:

[1931/2156] "gcc"  -o
src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group"
"-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a"
"-Wl,--stack,4194304" "-Wl,--allow-multiple-definition"
"-Wl,--disable-auto-import" "-fvisibility=hidden"
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a"
"-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm"
"-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32"
"-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group"
FAILED: src/test/modules/test_radixtree/test_radixtree.dll
"gcc"  -o src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group"
"-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a"
"-Wl,--stack,4194304" "-Wl,--allow-multiple-definition"
"-Wl,--disable-auto-import" "-fvisibility=hidden"
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a"
"-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm"
"-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32"
"-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group"
C:/tools/nmsys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj:test_radixtree:(.rdata$.refptr.pg_popcount64[.refptr.pg_popcount64]+0x0):
undefined reference to `pg_popcount64'

It looks like it requires a link with pgport_srv but I'm not sure. It
seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
2019, VS 2019 - Meson & ninja, too.

Regards,

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-03-07%2012%3A53%3A20

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Function and Procedure with same signature?

Hannu Krosing writes:
> On Sat, Feb 10, 2024 at 12:38 AM Tom Lane wrote:
>> Worth noting perhaps that this is actually required by the SQL
>> standard: per spec, functions and procedures are both "routines"
>> and share the same namespace,

> Can you point me to a place in the standard where it requires all
> kinds of ROUTINES to be using the same namespace ?

[ digs around a bit... ] Well, the spec is vaguer about this than
I thought. It is clear on one thing: 11.60
conformance rules include

2) Without Feature T341, “Overloading of SQL-invoked functions and
SQL-invoked procedures”, conforming SQL language shall not
contain a in which the schema identified by the
explicit or implicit of the includes a routine descriptor whose routine name is
.

("" means a CREATE FUNCTION or CREATE PROCEDURE
statement.)

That is, basic SQL doesn't allow you to have two routines with the
same qualified name at all, whether they have different types and
parameter lists or not. Now the above text is the entire definition
of T341, and it doesn't say just what an implementation that claims
feature T341 is expected to allow. Looking through the rest of 11.60,
we find

9) u) The schema identified by the explicit or implicit of the shall not include a routine
descriptor whose specific name is equivalent to or
a user-defined type descriptor that includes a method
specification descriptor whose specific name is equivalent to
.

which evidently is describing the base case (with no mention of T341).
We also find

20) Case:

a) If R is an SQL-invoked procedure, then S shall not include
another SQL-invoked procedure whose is equivalent to RN and whose number of SQL parameters is
PN.

which is a weird halfway measure indeed, and there's no
acknowledgement that this contradicts 9u. The other arm of the case
is pretty impenetrable prose, but what it appears to be saying is that
you can't create two functions of the same name and same number of
parameters if that would create any call-time ambiguity, that is that
a call could be written that might refer to either. So I guess these
paras are meant to explain what should happen if T341 is implemented,
but it's far from clear, and certainly their restrictions on
overloading are much stronger than what we allow.

If you look in 10.6 (which is what is
referenced by 11.62 ) you find

4) If specifies ROUTINE, then there shall be
exactly one SQL-invoked routine in the schema identified by SCN
whose is RN such that, for all i,
1 (one) ≤ i ≤ the number of arguments, when the Syntax Rules of
Subclause 9.25, “Data type identity”, are applied with the
declared type of its i-th SQL parameter as TYPE1 and the i-th
in the of MN as TYPE2, those Syntax
Rules are satisfied. The identifies
that SQL-invoked routine.

It could be argued that this doesn't prohibit having both a function
and a procedure with the same data type list, only that you can't
write ROUTINE when trying to drop or alter one. But I think that's
just motivated reasoning. The paragraphs for being
FUNCTION or PROCEDURE are exactly like the above except they say
"exactly one function" or "exactly one procedure". If you argue that
this text means we must allow functions and procedures with the same
parameter lists, then you are also arguing that we must allow multiple
functions with the same parameter lists, and it's just the user's
tough luck if they need to drop one of them.

A related point is that our tolerance for overloading routine
names isn't unlimited: we don't allow duplicate names with the
same list of input parameters, even if the output parameters are
different. This is not something that the SQL spec cares to
address, but there are good ambiguity-avoidance reasons for it.
I think limiting overloading so that a ROUTINE specification is
unambiguous is also a good thing.

I remain unexcited about changing our definition of this.
"Oracle allows it" is not something that has any weight in
my view: they have made a bunch of bad design decisions
as well as good ones, and I think this is a bad one.

regards, tom lane

Re: un-revert the MAINTAIN privilege and the pg_maintain predefined role

On Tue, Mar 05, 2024 at 10:12:35AM -0600, Nathan Bossart wrote:
> Thanks to Jeff's recent work with commits 2af07e2 and 59825d1, the issue
> that led to the revert of the MAINTAIN privilege and the pg_maintain
> predefined role (commit 151c22d) should now be resolved.  Specifically,
> there was a concern that roles with the MAINTAIN privilege could use
> search_path tricks to run arbitrary code as the table owner.  Jeff's work
> prevents this by restricting search_path to a known safe value when running
> maintenance commands.  (This approach and others were discussed on the
> lists quite extensively, and it was also brought up at the developer
> meeting at FOSDEM [0] earlier this year.)
> 
> Given this, I'd like to finally propose un-reverting MAINTAIN and
> pg_maintain.  I created a commitfest entry for this [1] a few weeks ago and
> attached it to Jeff's search_path thread, but I figured it would be good to
> create a dedicated thread for this, too.  The attached patch is a straight
> revert of commit 151c22d except for the following small changes:
> 
> * The catversion bump has been removed for now.  The catversion will need
>   to be bumped appropriately if/when this is committed.
> 
> * The OID for the pg_maintain predefined role needed to be changed.  The
>   original OID has been reused for something else since this feature was
>   reverted.
> 
> * The change in AdjustUpgrade.pm needed to be updated to check for
>   "$old_version < 17" instead of "$old_version < 16".

Given all of this code was previously reviewed and committed, I am planning
to forge ahead and commit this early next week, provided no objections or
additional feedback materialize.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: table inheritance versus column compression and storage settings

2024-03-07 Thread Ashutosh Bapat

Hi Peter and Tom,

> On Tue, Feb 20, 2024 at 3:51 PM Peter Eisentraut 
> wrote:
> > >
> > > I have reverted the patch for now (and re-opened the commitfest entry).
> > > We should continue to work on this and see if we can at least try to
> get
> > > the pg_dump test coverage suitable.
> > >
> >
>

The pg_dump problems arise because we throw an error when parents have
conflicting compression and storage properties. The patch that got
reverted, changed this slightly by allowing a child to override parent's
properties even when they conflict. It still threw an error when child
didn't override and parents conflicted. I guess, MergeAttributes() raises
error when it encounters parents with conflicting properties because it can
not decide which of the conflicting properties the child should inherit.
Instead it could just set the DEFAULT properties when parent properties
conflict but child doesn't override. Thus when compression conflicts,
child's compression would be set to default and when storage conflicts it
will be set to the type's default storage. Child's properties when
specified explicitly would override always. This will solve all the pg_dump
bugs we saw with the reverted patch and also existing bug I reported
earlier.

This change would break backward compatibility but I don't think anybody
would rely on error being thrown when parent properties conflict.

What do you think?

-- 
Best Wishes,
Ashutosh Bapat

Re: autovectorize page checksum code included elsewhere

I don't think anything discussed in this thread is ready for v17, so I am
going to punt it to v18.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] LockAcquireExtended improvement

On Thu, Feb 8, 2024 at 5:28 AM Jingxian Li  wrote:
> Based on your comments above, I improve the commit message and comment for
> InsertSelfIntoWaitQueue in new patch.

Well, I had a look at this patch today, and even after reading the new
commit message, I couldn't really convince myself that it was correct.
It may well be entirely correct, but I simply find it hard to tell. It
would help if the comments had been adjusted a bit more, e.g.

/* Skip the wait and just
grant myself the lock. */
-   GrantLock(lock, proclock, lockmode);
-   GrantAwaitedLock();
return PROC_WAIT_STATUS_OK;

Surely this is not an acceptable change. The comments says "and just
grant myself the lock" but the code no longer does that.

But instead of just complaining, I decided to try writing a version of
the patch that seemed acceptable to me. Here it is. I took a different
approach than you. Instead of splitting up ProcSleep(), I just passed
down the dontWait flag to WaitOnLock() and ProcSleep(). In
LockAcquireExtended(), I moved the existing code that handles giving
up in the don't-wait case from before the call to ProcSleep() to
afterward. As far as I can see, the major way this could be wrong is
if calling ProcSleep() with dontWait = true and having it fail to
acquire the lock changes the state in some way that makes the cleanup
code that I moved incorrect. I don't *think* that's the case, but I
might be wrong.

What do you think of this version?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v3-0001-Allow-a-no-wait-lock-acquisition-to-succeed-in-mo.patch
Description: Binary data

Re: speed up a logical replica setup

Hi,

I decided to take a quick look on this patch today, to see how it works
and do some simple tests. I've only started to get familiar with it, so
I have only some comments / questions regarding usage, not on the code.
It's quite possible I didn't understand some finer points, or maybe it
was already discussed earlier in this very long thread, so please feel
free to push back or point me to the past discussion.

Also, some of this is rather opinionated, but considering I didn't see
this patch before, my opinions may easily be wrong ...


1) SGML docs

It seems the SGML docs are more about explaining how this works on the
inside, rather than how to use the tool. Maybe that's intentional, but
as someone who didn't work with pg_createsubscriber before I found it
confusing and not very helpful.

For example, the first half of the page is prerequisities+warning, and
sure those are useful details, but prerequisities are checked by the
tool (so I can't really miss this) and warnings go into a lot of details
about different places where things may go wrong. Sure, worth knowing
and including in the docs, but maybe not right at the beginning, before
I learn how to even run the tool?

Maybe that's just me, though. Also, I'm sure it's not the only part of
our docs like this. Perhaps it'd be good to reorganize the content a bit
to make the "how to use" stuff more prominent?


2) this is a bit vague

... pg_createsubscriber will check a few times if the connection has
been reestablished to stream the required WAL. After a few attempts, it
terminates with an error.

What does "a few times" mean, and how many is "a few attempts"? Seems
worth knowing when using this tool in environments where disconnections
can happen. Maybe this should be configurable?


3) publication info

For a while I was quite confused about which tables get replicated,
until I realized the publication is FOR ALL TABLES. But I learned that
from this thread, the docs say nothing about this. Surely that's an
important detail that should be mentioned?


4) Is FOR ALL TABLES a good idea?

I'm not sure FOR ALL TABLES is a good idea. Or said differently, I'm
sure it won't work for a number of use cases. I know large databases
it's common to create "work tables" (not necessarily temporary) as part
of a batch job, but there's no need to replicate those tables.

AFAIK that'd break this FOR ALL TABLES publication, because the tables
will qualify for replication, but won't be present on the subscriber. Or
did I miss something?

I do understand that FOR ALL TABLES is the simplest approach, and for v1
it may be an acceptable limitation, but maybe it'd be good to also
support restricting which tables should be replicated (e.g. blacklist or
whitelist based on table/schema name?).

BTW if I'm right and creating a table breaks the subscriber creation,
maybe it'd be good to explicitly mention that in the docs.

Note: I now realize this might fall under the warning about DDL, which
says this:

Executing DDL commands on the source server while running
pg_createsubscriber is not recommended. If the target server has
already been converted to logical replica, the DDL commands must
not be replicated so an error would occur.

But I find this confusing. Surely there are many DDL commands that have
absolutely no impact on logical replication (like creating an index or
view, various ALTER TABLE flavors, and so on). And running such DDL
certainly does not trigger error, right?


5) slot / publication / subscription name

I find it somewhat annoying it's not possible to specify names for
objects created by the tool - replication slots, publication and
subscriptions. If this is meant to be a replica running for a while,
after a while I'll have no idea what pg_createsubscriber_569853 or
pg_createsubscriber_459548_2348239 was meant for.

This is particularly annoying because renaming these objects later is
either not supported at all (e.g. for replication slots), or may be
quite difficult (e.g. publications).

I do realize there are challenges with custom names (say, if there are
multiple databases to replicate), but can't we support some simple
formatting with basic placeholders? So we could specify

--slot-name "myslot_%d_%p"

or something like that?

BTW what will happen if we convert multiple standbys? Can't they all get
the same slot name (they all have the same database OID, and I'm not
sure how much entropy the PID has)?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

DOCS: add helpful partitioning links

2024-03-07 Thread Robert Treat

This patch adds a link to the "attach partition" command section
(similar to the detach partition link above it) as well as a link to
"create table like" as both commands contain additional information
that users should review beyond what is laid out in this section.
There's also a couple of wordsmiths in nearby areas to improve
readability.

Robert Treat
https://xzilla.net


improve-partition-links.patch
Description: Binary data

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-03-07 Thread Maxim Orlov

Quite an interesting patch, in my opinion.  I've decided to work on it a
bit, did some refactoring (sorry) and add
basic tests. Also, I try to take into account as much as possible notes on
the patch, mentioned by Cédric Villemain.

> and maybe better to go with FlushOneBuffer() ?
It's a good idea, but I'm not sure at the moment.  I'll try to dig some
deeper into it.  At least, FlushOneBuffer does
not work for a local buffers.  So, we have to decide whatever
pg_buffercache_invalidate should or should not
work for local buffers.  For now, I don't see why it should not.  Maybe I
miss something?

-- 
Best regards,
Maxim Orlov.


v4-0001-Invalidate-Buffer-By-Bufnum.patch
Description: Binary data

Re: improve ssl error code, 2147483650

2024-03-07 Thread Stephen Frost

Greetings,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> On 07/03/2024 02:12, David Zhang wrote:
> > The SSL_CTX_load_verify_locations function in OpenSSL will return NULL
> > if there is a system error, such as "No such file or directory" in this
> > case:
> > 
> > const char *ERR_reason_error_string(unsigned long e)
> > {
> >       ERR_STRING_DATA d, *p = NULL;
> >       unsigned long l, r;
> > 
> >       if (!RUN_ONCE(&err_string_init, do_err_strings_init)) {
> >       return NULL;
> >       }
> > 
> >       /*
> >    * ERR_reason_error_string() can't safely return system error strings,
> >    * since openssl_strerror_r() needs a buffer for thread safety, and we
> >    * haven't got one that would serve any sensible purpose.
> >    */
> >       if (ERR_SYSTEM_ERROR(e))
> >       return NULL;
> 
> That's pretty unfortunate. As typical with OpenSSL, this stuff is not very
> well documented, but I think we could do something like this in
> SSLerrmessage():
> 
> if (ERR_SYSTEM_ERROR(e))
> errreason = strerror(ERR_GET_REASON(e));
> 
> ERR_SYSTEM_ERROR only exists in OpenSSL 3.0 and above, and the only
> documentation I could find was in this one obscure place in the man pages: 
> https://www.openssl.org/docs/man3.2/man3/BIO_dgram_get_local_addr_enable.html.
> But as a best-effort thing, it would still be better than "SSL error code
> 2147483650".

Agreed that it doesn't seem well documented.  I was trying to figure out
what the 'right' answer here was myself and not having much success.  If
the above works, then +1 to that.

> > It would be better to perform a simple SSL file check before passing the
> > SSL file to OpenSSL APIs so that the system error can be captured and a
> > meaningful message provided to the end user.
> 
> That feels pretty ugly. I agree it would catch most of the common mistakes
> in practice, so maybe we should just hold our noses and do it anyway, if the
> above ERR_SYSTEM_ERROR() method doesn't work.

Yeah, seems better to try and handle this the OpenSSL way ... if that's
possible to do.

> It's sad that we cannot pass a file descriptor or in-memory copy of the file
> contents to those functions.

Agreed.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: 035_standby_logical_decoding unbounded hang

2024-03-07 Thread Noah Misch

On Thu, Mar 07, 2024 at 02:46:55PM +0500, Andrey M. Borodin wrote:
> I’m not sure if it is connected, but so far many patches in CFbot keep 
> hanging in this test. For example [0].

> [0] https://cirrus-ci.com/task/5911176840740864?logs=check_world#L292 

Relevant part:

[22:03:10.292] stderr:
[22:03:10.292] # poll_query_until timed out executing this query:
[22:03:10.292] # SELECT (SELECT catalog_xmin::text::int - 770 from 
pg_catalog.pg_replication_slots where slot_name = 'injection_activeslot') > 0
[22:03:10.292] # expecting this output:
[22:03:10.292] # t
[22:03:10.292] # last actual query output:
[22:03:10.292] # f
[22:03:10.292] # with stderr:
[22:03:10.292] # Tests were run but no plan was declared and done_testing() was 
not seen.
[22:03:10.292] # Looks like your test exited with 255 just after 57.

The injection_activeslot test got added after $SUBJECT, in 08a52ab151
(2024-03-06).  It's now reverted in 65db0cfb4c (2024-03-07).

Re: Add last_commit_lsn to pg_stat_database

2024-03-07 Thread Heikki Linnakangas


I've previously noted in "Add last commit LSN to
pg_last_committed_xact()" [1] that it's not possible to monitor how
many bytes of WAL behind a logical replication slot is (computing such
is obviously trivial for physical slots) because the slot doesn't need
to replicate beyond the last commit. In some cases it's possible for
the current WAL location to be far beyond the last commit. A few
examples:

- An idle server with checkout_timeout at a lower value (so lots of
empty WAL advance).
- Very large transactions: particularly insidious because committing a
1 GB transaction after a small transaction may show almost zero time
lag even though quite a bit of data needs to be processed and sent
over the wire (so time to replay is significantly different from
current lag).
- A cluster with multiple databases complicates matters further,
because while physical replication is cluster-wide, the LSNs that
matter for logical replication are database specific.


Since we don't expose the most recent commit's LSN there's no way to
say "the WAL is currently 1250, the last commit happened at 1000, the
slot has flushed up to 800, therefore there are at most 200 bytes
replication needs to read through to catch up.


I'm not sure I fully understand the problem. What are you doing 
currently to measure the lag? If you look at pg_replication_slots today, 
confirmed_flush_lsn advances also when you do pg_switch_wal(), so 
looking at the diff between confirmed_flush_lsn and pg_current_wal_lsn() 
works, no?


And on the other hand, even if you expose the database's last commit 
LSN, you can have an publication that includes only a subset of tables. 
Or commits that don't write to any table at all. So I'm not sure why the 
database's last commit LSN is relevant. Getting the last LSN that did 
something that needs to be replicated through the publication might be 
useful, but that's not what what this patch does.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

On 2024-Mar-07, Alvaro Herrera wrote:

> Maybe we can add a flag RelationData->rd_ispkdeferred, so that
> RelationGetPrimaryKeyIndex returned InvalidOid for deferrable PKs; then
> logical replication would continue to not know about this PK, which
> perhaps is what we want.  I'll do some testing with this.

This seems to work okay.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
“Cuando no hay humildad las personas se degradan” (A. Christie)
>From 8e3f913ca6a5bdbdbccac8d750ee9a6b6889f9aa Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 7 Mar 2024 16:44:35 +0100
Subject: [PATCH] Admit deferrable PKs into rd_pkindex, but flag them

... and don't return them as replica identity.
---
 src/backend/utils/cache/relcache.c| 26 ++-
 src/include/utils/rel.h   |  1 +
 src/test/regress/expected/publication.out | 10 +++
 .../regress/expected/replica_identity.out |  5 
 src/test/regress/sql/publication.sql  |  9 +++
 src/test/regress/sql/replica_identity.sql |  4 +++
 6 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 4d339ee795..0460796cf5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4761,6 +4761,7 @@ RelationGetIndexList(Relation relation)
 	char		replident = relation->rd_rel->relreplident;
 	Oid			pkeyIndex = InvalidOid;
 	Oid			candidateIndex = InvalidOid;
+	bool		pkdeferred = false;
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the list. */
@@ -4802,12 +4803,12 @@ RelationGetIndexList(Relation relation)
 		result = lappend_oid(result, index->indexrelid);
 
 		/*
-		 * Non-unique, non-immediate or predicate indexes aren't interesting
-		 * for either oid indexes or replication identity indexes, so don't
-		 * check them.
+		 * Non-unique or predicate indexes aren't interesting for either oid
+		 * indexes or replication identity indexes, so don't check them.
+		 * Deferred ones are not useful for replication identity either; but
+		 * we do include them if they are PKs.
 		 */
 		if (!index->indisunique ||
-			!index->indimmediate ||
 			!heap_attisnull(htup, Anum_pg_index_indpred, NULL))
 			continue;
 
@@ -4832,7 +4833,13 @@ RelationGetIndexList(Relation relation)
 		if (index->indisprimary &&
 			(index->indisvalid ||
 			 relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE))
+		{
 			pkeyIndex = index->indexrelid;
+			pkdeferred = !index->indimmediate;
+		}
+
+		if (!index->indimmediate)
+			continue;
 
 		if (!index->indisvalid)
 			continue;
@@ -4854,7 +4861,8 @@ RelationGetIndexList(Relation relation)
 	oldlist = relation->rd_indexlist;
 	relation->rd_indexlist = list_copy(result);
 	relation->rd_pkindex = pkeyIndex;
-	if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex))
+	relation->rd_ispkdeferred = pkdeferred;
+	if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex) && !pkdeferred)
 		relation->rd_replidindex = pkeyIndex;
 	else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
 		relation->rd_replidindex = candidateIndex;
@@ -4957,7 +4965,8 @@ RelationGetStatExtList(Relation relation)
 /*
  * RelationGetPrimaryKeyIndex -- get OID of the relation's primary key index
  *
- * Returns InvalidOid if there is no such index.
+ * Returns InvalidOid if there is no such index, or if the primary key is
+ * DEFERRABLE.
  */
 Oid
 RelationGetPrimaryKeyIndex(Relation relation)
@@ -4972,7 +4981,7 @@ RelationGetPrimaryKeyIndex(Relation relation)
 		Assert(relation->rd_indexvalid);
 	}
 
-	return relation->rd_pkindex;
+	return relation->rd_ispkdeferred ? InvalidOid : relation->rd_pkindex;
 }
 
 /*
@@ -5198,6 +5207,9 @@ RelationGetIndexPredicate(Relation relation)
  * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
  * we can include system attributes (e.g., OID) in the bitmap representation.
  *
+ * Deferred indexes are considered for the primary key, but not for replica
+ * identity.
+ *
  * Caller had better hold at least RowExclusiveLock on the target relation
  * to ensure it is safe (deadlock-free) for us to take locks on the relation's
  * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b3ea2b2042..51f876b5b5 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,6 +63,7 @@ typedef struct RelationData
 	bool		rd_isvalid;		/* relcache entry is valid */
 	bool		rd_indexvalid;	/* is rd_indexlist valid? (also rd_pkindex and
  * rd_replidindex) */
+	bool		rd_ispkdeferred;
 	bool		rd_statvalid;	/* is rd_statlist valid? */
 
 	/*--
diff --git a/src/test/regress/expected/publication.out b/src/test/regress/expected/publication.out
index 16361a91f9..0c5521d2aa 100644
--- a/src/test/regress/expected/publication.out
+++ b/src/test/regress/expected/p

Re: Popcount optimization using AVX512

On Tue, Mar 05, 2024 at 04:52:23PM +, Amonson, Paul D wrote:
> Noted I will try to do the "reduced" bottom-posting. I might slip up
> occasionally because it's an Intel habit.

No worries.

> Is there a way to make Outlook do the leading ">" in a reply for the
> previous message?

I do not know, sorry.  I personally use mutt for the lists.

> BTW: Created the commit-fest submission.

Thanks.  I intend to provide a more detailed review shortly, as I am aiming
to get this one committed for v17.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: improve ssl error code, 2147483650

Stephen Frost  writes:
> * Heikki Linnakangas (hlinn...@iki.fi) wrote:
>> That's pretty unfortunate. As typical with OpenSSL, this stuff is not very
>> well documented, but I think we could do something like this in
>> SSLerrmessage():
>> 
>> if (ERR_SYSTEM_ERROR(e))
>> errreason = strerror(ERR_GET_REASON(e));
>> 
>> ERR_SYSTEM_ERROR only exists in OpenSSL 3.0 and above, and the only
>> documentation I could find was in this one obscure place in the man pages: 
>> https://www.openssl.org/docs/man3.2/man3/BIO_dgram_get_local_addr_enable.html.
>> But as a best-effort thing, it would still be better than "SSL error code
>> 2147483650".

> Agreed that it doesn't seem well documented.  I was trying to figure out
> what the 'right' answer here was myself and not having much success.  If
> the above works, then +1 to that.

My reaction as well --- I was just gearing up to test this idea,
unless one of you are already on it?

regards, tom lane

Re: Potential stack overflow in incremental base backup

On Wed, Mar 6, 2024 at 12:44 AM Thomas Munro  wrote:
> I'll have to move that sucker onto the heap (we banned C99 variable
> length arrays and we don't use nonstandard alloca()), but I think the
> coding in master is already a bit dodgy: that's 131072 *
> sizeof(BlockNumber) = 512kB with default configure options, but
> --with-segsize X could create a stack variable up to 16GB,
> corresponding to segment size 32TB (meaning no segmentation at all).
> That wouldn't work.  Shouldn't we move it to the stack?  See attached
> draft patch.
>
> Even on the heap, 16GB is too much to assume we can allocate during a
> base backup.  I don't claim that's a real-world problem for
> incremental backup right now in master, because I don't have any
> evidence that anyone ever really uses --with-segsize (do they?), but
> if we make it an initdb option it will be more popular and this will
> become a problem.  Hmm.

I don't see a problem with moving it from the stack to the heap. I
don't believe I had any particular reason for wanting it on the stack
specifically.

I'm not sure what to do about really large segment sizes. As long as
the allocation here is < ~100MB, it's probably fine, both because (1)
100MB isn't an enormously large allocation these days and (2) if you
have a good reason to increase the segment size by a factor of 256 or
more, you're probably running on a big machine, and then you should
definitely have 100MB to burn.

However, a 32TB segment size is another thing altogether. We could
avoid transposing relative block numbers to absolute block numbers
whenever start_blkno is 0, but that doesn't do a whole lot when the
segment size is 8TB or 16TB rather than 32TB; plus, in such cases, the
relative block number array is also going to be gigantic. Maybe what
we need to do is figure out how to dynamically size the arrays in such
cases, so that you only make them as big as required for the file you
actually have, rather than as big as the file could theoretically be.
But I think that's really only necessary if we're actually going to
get rid of the idea of segmented relations altogether, which I don't
think is happening at least for v17, and maybe not ever.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Popcount optimization using AVX512

On 2024-Mar-04, Amonson, Paul D wrote:

> > -#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
> > +#define LARGE_OFF_T off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31)
> > +<< 31))
> >
> > IME this means that the autoconf you are using has been patched.  A
> > quick search on the mailing lists seems to indicate that it might be
> > specific to Debian [1].
>  
> I am not sure what the ask is here?  I made changes to the
> configure.ac and ran autoconf2.69 to get builds to succeed. Do you
> have a separate feedback here? 

So what happens here is that autoconf-2.69 as shipped by Debian contains
some patches on top of the one released by GNU.  We use the latter, so
if you run Debian's, then the generated configure script will contain
the differences coming from Debian's version.

Really, I don't think this is very important as a review point, because
if the configure.ac file is changed in the patch, it's best for the
committer to run autoconf on their own, using a pristine GNU autoconf;
the configure file in the submitted patch is not relevant, only
configure.ac matters.

What committers do (or should do) is keep an install of autoconf-2.69
straight from GNU.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Potential stack overflow in incremental base backup

On Wed, Mar 6, 2024 at 6:29 AM Alvaro Herrera  wrote:
> On 2024-Mar-06, Thomas Munro wrote:
> > Even on the heap, 16GB is too much to assume we can allocate during a
> > base backup.  I don't claim that's a real-world problem for
> > incremental backup right now in master, because I don't have any
> > evidence that anyone ever really uses --with-segsize (do they?), but
> > if we make it an initdb option it will be more popular and this will
> > become a problem.  Hmm.
>
> Would it work to use a radix tree from the patchset at
> https://postgr.es/m/canwcazb43znrk03bzftnvrafhzngzh26sjc0ep-sj8+w20v...@mail.gmail.com
> ?

Probably not that much, because we actually send the array to the
client very soon after we construct it:

push_to_sink(sink, &checksum_ctx, &header_bytes_done,
 incremental_blocks,
 sizeof(BlockNumber) * num_incremental_blocks);

This is hard to do without materializing the array somewhere, so I
don't think an alternate representation is the way to go in this
instance.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: why there is not VACUUM FULL CONCURRENTLY?

On 2024-Feb-16, Antonin Houska wrote:

> BTW, I'm failing to understand why cluster_rel() has no argument of the
> BufferAccessStrategy type. According to buffer/README, the criterion for using
> specific strategy is that page "is unlikely to be needed again
> soon". Specifically for cluster_rel(), the page will *definitely* not be used
> again (unless the VACCUM FULL/CLUSTER command fails): BufferTag contains the
> relatin file number and the old relation file is eventually dropped.
> 
> Am I missing anything?

No, that's just an oversight.  Access strategies are newer than that
cluster code.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."   (Nathaniel Smith)
  https://mail.gnu.org/archive/html/monotone-devel/2007-01/msg00080.html

Re: Popcount optimization using AVX512

On Thu, Mar 07, 2024 at 06:53:12PM +0100, Alvaro Herrera wrote:
> Really, I don't think this is very important as a review point, because
> if the configure.ac file is changed in the patch, it's best for the
> committer to run autoconf on their own, using a pristine GNU autoconf;
> the configure file in the submitted patch is not relevant, only
> configure.ac matters.

Agreed.  I didn't intend for this to be a major review point, and I
apologize for the extra noise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: improve ssl error code, 2147483650

David Zhang  writes:
> When configuring SSL on the Postgres server side with the following 
> information:

> ssl = on
> ssl_ca_file = 'root_ca.crt'
> ssl_cert_file = 'server-cn-only.crt'
> ssl_key_file = 'server-cn-only.key'

> If a user makes a mistake, for example, accidentally using 'root_ca.crl' 
> instead of 'root_ca.crt', Postgres will report an error like the one below:
> FATAL:  could not load root certificate file "root_ca.crl": SSL error 
> code 2147483650

Interestingly, this works fine for me on RHEL8 (with openssl-1.1.1k):

2024-03-07 12:57:53.432 EST [547522] FATAL:  F: could not load root 
certificate file "foo.bar": No such file or directory
2024-03-07 12:57:53.432 EST [547522] LOCATION:  be_tls_init, 
be-secure-openssl.c:306

I do reproduce your problem on Fedora 39 with openssl-3.1.1.
So this seems to be a regression on OpenSSL's part.  Maybe
they'll figure out how to fix it sometime; that seems to be
another good argument for not pre-empting their error handling.

regards, tom lane

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2024-03-07 Thread Michail Nikolaev

Hello!

> I'm not a fan of this approach. Changing visibility and cleanup
> semantics to only benefit R/CIC sounds like a pain to work with in
> essentially all visibility-related code. I'd much rather have to deal
> with another index AM, even if it takes more time: the changes in
> semantics will be limited to a new plug in the index AM system and a
> behaviour change in R/CIC, rather than behaviour that changes in all
> visibility-checking code.

Technically, this does not affect the visibility logic, only the
clearing semantics.
All visibility related code remains untouched.
But yes, still an inelegant and a little strange-looking option.

At the same time, perhaps it can be dressed in luxury
somehow - for example, add as a first class citizen in ComputeXidHorizonsResult
a list of blocks to clear some relations.

> But regardless of second scan snapshots, I think we can worry about
> that part at a later moment: The first scan phase is usually the most
> expensive and takes the most time of all phases that hold snapshots,
> and in the above discussion we agreed that we can already reduce the
> time that a snapshot is held during that phase significantly. Sure, it
> isn't great that we have to scan the table again with only a single
> snapshot, but generally phase 2 doesn't have that much to do (except
> when BRIN indexes are involved) so this is likely less of an issue.
> And even if it is, we would still have reduced the number of
> long-lived snapshots by half.

Hmm, but it looks like we don't have the infrastructure to "update" xmin
propagating to the horizon after the first snapshot in a transaction is taken.

One option I know of is to reuse the
d9d076222f5b94a85e0e318339cfc44b8f26022d (1) approach.
But if this is the case, then there is no point in re-taking the
snapshot again during the first
phase - just apply this "if" only for the first phase - and you're done.

Do you know any less-hacky way? Or is it a nice way to go?

[1]: 
https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793

Re: Add system identifier to backup manifest

On Thu, Mar 7, 2024 at 9:16 AM Robert Haas  wrote:
> It could. I just thought this was clearer. I agree that it's a bit
> long, but I don't think this is worth bikeshedding very much. If at a
> later time somebody feels strongly that it needs to be changed, so be
> it. Right now, getting on with the business at hand is more important,
> IMHO.

Here's a new version of the patch set, rebased over my version of 0001
and with various other corrections:

* Tidy up grammar in documentation.
* In manifest_process_version, the test checked whether the manifest
version == 1, but the comment talked about versions >= 2. Make the
comment match the code.
* In load_backup_manifest, avoid changing the existing placement of a
variable declaration.
* Rename verify_system_identifier to verify_control_file because if we
were verifying multiple things about the control file we'd still want
to only read it one.
* Tweak the coding of verify_backup_file and verify_control_file to
avoid repeated path construction.
* Remove saw_system_identifier_field. This looks like it's trying to
enforce a rule that the system identifier must immediately follow the
version, but we don't insist on anything like that for files or wal
ranges, so there seems to be no reason to do it here.
* Remove bogus "unrecognized top-level field" test from
005_bad_manifest.pl. The JSON included here doesn't include any
unrecognized top-level field, so the fact that we were getting that
error message was wrong. After removing saw_system_identifier_field,
we no longer get the wrong error message any more, so the test started
failing.
* Remove "expected system identifier" test from 005_bad_manifest.pl.
This was basically a test that saw_system_identifier_field was
working.
* Header comment adjustment for
json_manifest_finalize_system_identifier. The last sentence was
cut-and-pasted from somewhere that it made sense to here, where it
doesn't. There's only ever one system identifier.

I plan to commit this, barring objections.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v11-0001-Expose-new-function-get_controlfile_by_exact_pat.patch
Description: Binary data

v11-0002-Add-the-system-identifier-to-backup-manifests.patch
Description: Binary data

Re: improve ssl error code, 2147483650

I wrote:
> Stephen Frost  writes:
>> Agreed that it doesn't seem well documented.  I was trying to figure out
>> what the 'right' answer here was myself and not having much success.  If
>> the above works, then +1 to that.

> My reaction as well --- I was just gearing up to test this idea,
> unless one of you are already on it?

I've confirmed that this:

diff --git a/src/backend/libpq/be-secure-openssl.c 
b/src/backend/libpq/be-secure-openssl.c
index e12b1cc9e3..47eee4b59d 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -1363,6 +1363,10 @@ SSLerrmessage(unsigned long ecode)
errreason = ERR_reason_error_string(ecode);
if (errreason != NULL)
return errreason;
+#ifdef ERR_SYSTEM_ERROR
+   if (ERR_SYSTEM_ERROR(ecode))
+   return strerror(ERR_GET_REASON(ecode));
+#endif
snprintf(errbuf, sizeof(errbuf), _("SSL error code %lu"), ecode);
return errbuf;
 }

seems to be enough to fix the problem on OpenSSL 3.1.1.  The #ifdef
is needed to avoid compile failure against OpenSSL 1.1.1 --- but that
version doesn't have the problem, so we don't need to sweat.

This could probably do with a comment, and we need to propagate
the fix into libpq's copy of the function too.  Barring objections,
I'll take care of that and push it later today.

regards, tom lane

Re: improve ssl error code, 2147483650

2024-03-07 Thread Daniel Gustafsson

> On 7 Mar 2024, at 20:58, Tom Lane  wrote:
> 
> I wrote:
>> Stephen Frost  writes:
>>> Agreed that it doesn't seem well documented.  I was trying to figure out
>>> what the 'right' answer here was myself and not having much success.  If
>>> the above works, then +1 to that.
> 
>> My reaction as well --- I was just gearing up to test this idea,
>> unless one of you are already on it?
> 
> I've confirmed that this:
> 
> diff --git a/src/backend/libpq/be-secure-openssl.c 
> b/src/backend/libpq/be-secure-openssl.c
> index e12b1cc9e3..47eee4b59d 100644
> --- a/src/backend/libpq/be-secure-openssl.c
> +++ b/src/backend/libpq/be-secure-openssl.c
> @@ -1363,6 +1363,10 @@ SSLerrmessage(unsigned long ecode)
>   errreason = ERR_reason_error_string(ecode);
>   if (errreason != NULL)
>   return errreason;
> +#ifdef ERR_SYSTEM_ERROR
> + if (ERR_SYSTEM_ERROR(ecode))
> + return strerror(ERR_GET_REASON(ecode));
> +#endif
>   snprintf(errbuf, sizeof(errbuf), _("SSL error code %lu"), ecode);
>   return errbuf;
> }
> 
> seems to be enough to fix the problem on OpenSSL 3.1.1.  The #ifdef
> is needed to avoid compile failure against OpenSSL 1.1.1 --- but that
> version doesn't have the problem, so we don't need to sweat.

This was introduced in OpenSSL 3.0.0 so that makes sense.  Pre-3.0.0 versions
truncates system errorcodes that was outside of the range 1..127 reserving the
rest for OpenSSL specific errors.  To capture the full range possible of system
errors the code is no longer truncated and the ERR_SYSTEM_FLAG flag is set,
which can be tested for with the macro used here.

> This could probably do with a comment, and we need to propagate
> the fix into libpq's copy of the function too.  Barring objections,
> I'll take care of that and push it later today.

LGTM.

--
Daniel Gustafsson

Re: POC, WIP: OR-clause support for indexes

2024-03-07 Thread Alena Rybakina


Hi!

On 07.03.2024 17:51, Alexander Korotkov wrote:

Hi!

On Tue, Mar 5, 2024 at 9:59 AM Andrei Lepikhov 
 wrote:

> On 5/3/2024 12:30, Andrei Lepikhov wrote:
> > On 4/3/2024 09:26, jian he wrote:
> ... and the new version of the patchset is attached.

I made some revisions for the patchset.
1) Use hash_combine() to combine hash values.
2) Upper limit the number of array elements by MAX_SAOP_ARRAY_SIZE.
3) Better save the original order of clauses by putting hash entries 
and untransformable clauses to the same list.  A lot of differences in 
regression tests output have gone.

Thank you for your changes. I agree with them.


One important issue I found.

# create table t as (select i::int%100 i from generate_series(1,1) i);
# analyze t;
# explain select * from t where i = 1 or i = 1;
                     QUERY PLAN
-
 Seq Scan on t  (cost=0.00..189.00 rows=200 width=4)
   Filter: (i = ANY ('{1,1}'::integer[]))
(2 rows)

# set enable_or_transformation = false;
SET
# explain select * from t where i = 1 or i = 1;
                     QUERY PLAN
-
 Seq Scan on t  (cost=0.00..189.00 rows=100 width=4)
   Filter: (i = 1)
(2 rows)

We don't make array values unique.  That might make query execution 
performance somewhat worse, and also makes selectivity estimation 
worse.  I suggest Andrei and/or Alena should implement making array 
values unique.



I have corrected this and some spelling mistakes. The 
unique_any_elements_change.no-cfbot file contains changes.


While I was correcting the test results caused by such changes, I 
noticed that the same behavior was when converting the IN expression,

and this can be seen in the result of the regression test:

 EXPLAIN (COSTS OFF)
 SELECT unique2 FROM onek2
 WHERE stringu1 IN ('A', 'A') AND (stringu1 = 'A' OR stringu1 = 'A');
    QUERY PLAN
---
  Bitmap Heap Scan on onek2
    Recheck Cond: (stringu1 < 'B'::name)
   Filter: ((stringu1 = ANY ('{A,A}'::name[])) AND (stringu1 = 'A'::name))
    ->  Bitmap Index Scan on onek2_u2_prtl
 (4 rows)

--
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index baeb83e58b..e9813ef6ab 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -312,8 +312,8 @@ TransformOrExprToANY(ParseState *pstate, List *args, int 
location)
 
if (unlikely(found))
{
-   entry->consts = lappend(entry->consts, const_expr);
-   entry->exprs = lappend(entry->exprs, orqual);
+   entry->consts = list_append_unique(entry->consts, 
const_expr);
+   entry->exprs = list_append_unique(entry->exprs, orqual);
}
else
{
@@ -352,7 +352,6 @@ TransformOrExprToANY(ParseState *pstate, List *args, int 
location)
entry = (OrClauseGroupEntry *) lfirst(lc);
 
Assert(list_length(entry->consts) > 0);
-   Assert(list_length(entry->exprs) == list_length(entry->consts));
 
if (list_length(entry->consts) == 1 ||
list_length(entry->consts) > MAX_SAOP_ARRAY_SIZE)
diff --git a/src/test/regress/expected/create_index.out 
b/src/test/regress/expected/create_index.out
index 7b721bac71..606a4399f9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1915,16 +1915,16 @@ SELECT count(*) FROM tenk1
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1
   WHERE hundred = 42 AND (thousand < 42 OR thousand < 99 OR 43 > thousand OR 
42 > thousand);
-QUERY PLAN 
   
---
+  QUERY PLAN   

+---
  Aggregate
->  Bitmap Heap Scan on tenk1
- Recheck Cond: ((hundred = 42) AND (thousand < ANY 
('{42,99,43,42}'::integer[])))
+ Recheck Cond: ((hundred = 42) AND (thousand < ANY 
('{42,99,43}'::integer[])))
  ->  BitmapAnd
->  Bitmap Index Scan on tenk1_hundred
  Index Cond: (hundred = 42)
->  Bitmap Index Scan on tenk1_thous_tenthous
- Index Cond: (thousand < ANY ('{42,99,43,42}'::integer[]))
+ Index Cond: (thousand < ANY ('{42,99,43}'::integer[]))
 (8 rows)
 
 EXPLAIN (COSTS OFF)
diff --git a/src/test/regress/expected/select.out 
b/src/test/regress/expected/select.out
index 0ebaf002e8..f4f5493c43 100644
--- a/src/test/r

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-07 Thread Peter Geoghegan

On Wed, Mar 6, 2024 at 4:46 PM Matthias van de Meent
 wrote:
> On Wed, 6 Mar 2024 at 01:50, Peter Geoghegan  wrote:
> > I think that there is supposed to be a closing parenthesis here? So
> > "... (such as those described in ") might
> > perform...".
>
> Correct.
>
> > FWM, if that's what you meant.
>
> WFM, yes?

Then we're in agreement on this.

> > At one point Heikki suggested that I just get rid of
> > BTScanOpaqueData.arrayKeyData (which has been there for as long as
> > nbtree had native support for SAOPs), and use
> > BTScanOpaqueData.keyData exclusively instead. I've finally got around
> > to doing that now.
>
> I'm not sure if it was worth the reduced churn when the changes for
> the primary optimization are already 150+kB in size; every "small"
> addition increases the context needed to review the patch, and it's
> already quite complex.

I agree that the patch is quite complex, especially relative to its size.

> > Note that we no longer need to have an independent representation of
> > so->qual_okay for array keys (the convention of setting
> > so->numArrayKeys to -1 for unsatisfiable array keys is no longer
> > required). There is no longer any need for a separate pass to carry
> > over the contents of BTScanOpaqueData.arrayKeyData to
> > BTScanOpaqueData.keyData, which was confusing.
>
> I wasn't very confused by it, but sure.
>
> > Are you still interested in working directly on the preprocessing
> > stuff?
>
> If you mean my proposed change to merging two equality AOPs, then yes.
> I'll try to fit it in tomorrow with the rest of the review.

I didn't just mean that stuff. I was also suggesting that you could
join the project directly (not just as a reviewer). If you're
interested, you could do general work on the preprocessing of arrays.
Fancier array-specific preprocessing.

For example, something that can transform this: select * from foo
where a in (1,2,3) and a < 3;

Into this: select * from foo where a in (1,2);

Or, something that can just set qual_okay=false, given a query such
as: select * from foo where a in (1,2,3) and a < 5;

This is clearly doable by reusing the binary search code during
preprocessing. The first example transformation could work via a
binary search for the constant "3", followed by a memmove() to shrink
the array in-place (plus the inequality itself would need to be fully
eliminated). The second example would work a little like the array
merging thing that the patch has already, except that there'd only be
one array involved (there wouldn't be a pair of arrays).

> > select * from multi_test where a in (1,99, 182, 183, 184) and a < 183;
> >
> > Maybe we need to do better with that. What do you think?
>
> Let me come back to that when I'm done reviewing the final part of nbtutils.

Even if this case doesn't matter (which I now doubt), it's probably
easier to just get it right than it would be to figure out if we can
live without it.

> > void
> > _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir)
> > @@ -519,159 +888,1163 @@ _bt_start_array_keys(IndexScanDesc scan, 
> > ScanDirection dir)
> > BTScanOpaque so = (BTScanOpaque) scan->opaque;
> > inti;
> >
> > +Assert(so->numArrayKeys);
> > +Assert(so->qual_ok);
>
> Has the requirement for a known scan direction been removed, or should
> this still have an Assert(dir != NoMovementScanDirection)?

I agree that such an assertion is worth having. Added that locally.

> I'm not sure that comment is correct, at least it isn't as clear as
> can be. Maybe something more in the line of the following?
> +pstate.prechecked = false;/* prechecked didn't cover HIKEY */

I agree that that's a little better than what I had.

> +++ b/src/backend/access/nbtree/nbtutils.c
> > @@ -272,7 +319,32 @@ _bt_preprocess_array_keys(IndexScanDesc scan)
> > +elemtype = cur->sk_subtype;
> > +if (elemtype == InvalidOid)
> > +elemtype = rel->rd_opcintype[cur->sk_attno - 1];
>
> Should we Assert() that this elemtype matches the array element type
> ARR_ELEMTYPE(arrayval) used to deconstruct the array?

Yeah, good idea.

> This is not "a" previous equality key, but "the" previous equality
> operator array scan key.
> Do we want to expend some additional cycles for detecting duplicate
> equality array key types in interleaved order like =int[] =bigint[]
> =int[]? I don't think it would be very expensive considering the
> limited number of cross-type equality operators registered in default
> PostgreSQL, so a simple loop that checks matching element types
> starting at the first array key scankey for this attribute should
> suffice. We could even sort the keys by element type if we wanted to
> fix any O(n) issues for this type of behaviour (though this is
> _extremely_ unlikely to become a performance issue).

Yes, I think that we should probably have this (though likely wouldn't
bother sorting the scan keys themselves).

You'd need to look-up cross-type operators for this

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-03-07 Thread Peter Geoghegan

On Wed, Mar 6, 2024 at 4:51 PM Matthias van de Meent
 wrote:
> To clarify, what I mean here is that merging the changes of both the
> SAOPs changes and the removal of arrayKeyData seems to increase the
> patch size and increases the maximum complexity of each component
> patch's review.

Removing arrayKeyData probably makes the patch very slightly smaller,
actually. But even if it's really the other way around, I'd still like
to get rid of it as part of the same commit as everything else. It
just makes sense that way.

> Separate patches may make this more reviewable, or not, but no comment
> was given on why it is better to merge the changes into a single
> patch.

Fair enough. Here's why:

The original SAOP design (commit 9e8da0f7, "Teach btree to handle
ScalarArrayOpExpr quals natively") added a layer of indirection
between scan->keyData (input scan keys) and so->keyData (output scan
keys): it added another scan key array, so->arrayKeyData. There was
array-specific preprocessing in _bt_preprocess_array_keys, that
happened before the first primitive index scan even began -- that
transformed our true input scan keys (scan->keyData) into a copy of
the array with limited amounts of array-specific preprocessing already
performed (so->arrayKeyData).

This made a certain amount of sense at the time, because
_bt_preprocess_keys was intended to be called once per primitive index
scan. Kind of like the inner side of a nested loop join's inner index
scan, where we call _bt_preprocess_keys once per inner-side
scan/btrescan call. (Actually, Tom's design has us call _bt_preprocess
once per primitive index scan per btrescan call, which might matter in
those rare cases where the inner side of a nestloop join had SAOP
quals.)

What I now propose to do is to just call _bt_preprocess_keys once per
btrescan (actually, it's still called once per primitive index scan,
but all calls after the first are just no-ops after v12 of the patch).
This makes sense because SAOP array constants aren't like nestloop
joins with an inner index scans, in one important way: we really can
see everything up-front. We can see all of the array elements, and
operate on whole arrays as necessary during preprocessing (e.g.,
performing the array merging thing I added to
_bt_preprocess_array_keys).

It's not like the next array element is only visible to prepocessing
only after the outer side of a nestloop join runs, and next calls
btrescan -- so why treat it like that? Conceptually, "WHERE a = 1" is
almost the same thing as "WHERE a = any('{1,2}')", so why not use
essentially the same approach to preprocessing in both cases? We don't
need to copy the input keys into so->arrayKeyData, because the
indirection (which is a bit like a fake nested loop join) doesn't buy
us anything.

v13 of the patch doesn't quite 100% eliminate so->arrayKeyData. While
it removes the arrayKeyData field from the BTScanOpaqueData struct, we
still use a temporary array (accessed via a pointer that's just a
local variable) that's also called arrayKeyData. And that also stores
array-preprocessing-only copies of the input scan keys. That happens
within _bt_preprocess_keys.

So we do "still need arrayKeyData", but we only need it for a brief
period at the start of the index scan. It just doesn't make any sense
to keep it around for longer than that, in a world where
_bt_preprocess_keys "operates directly on arrays".  That only made
sense (a bit of sense) back when _bt_preprocess_keys was subordinate
to _bt_preprocess_array_keys, but it's kinda the other way around now.
We could probably even get rid of this remaining limited form of
arrayKeyData, but that doesn't seem like it would add much.

-- 
Peter Geoghegan

Re: Popcount optimization using AVX512

As promised...

> +# Check for Intel AVX512 intrinsics to do POPCNT calculations.
> +#
> +PGAC_AVX512_POPCNT_INTRINSICS([])
> +if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then
> +  PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
> +fi
> +AC_SUBST(CFLAGS_AVX512_POPCNT)

I'm curious why we need both -mavx512vpopcntdq and -mavx512f.  On my
machine, -mavx512vpopcntdq alone is enough to pass this test, so if there
are other instructions required that need -mavx512f, then we might need to
expand the test.

> 13 files changed, 657 insertions(+), 119 deletions(-)

I still think it's worth breaking this change into at least 2 patches.  In
particular, I think there's an opportunity to do the refactoring into
pg_popcnt_choose.c and pg_popcnt_x86_64_accel.c prior to adding the AVX512
stuff.  These changes are likely straightforward, and getting them out of
the way early would make it easier to focus on the more interesting
changes.  IMHO there are a lot of moving parts in this patch.

> +#undef HAVE__GET_CPUID_COUNT
> +
> +/* Define to 1 if you have  immintrin. */
> +#undef HAVE__IMMINTRIN

Is this missing HAVE__CPUIDEX?

>  uint64
> -pg_popcount(const char *buf, int bytes)
> +pg_popcount_slow(const char *buf, int bytes)
>  {
>  uint64  popcnt = 0;
> 
> -#if SIZEOF_VOID_P >= 8
> +#if SIZEOF_VOID_P == 8
>  /* Process in 64-bit chunks if the buffer is aligned. */
>  if (buf == (const char *) TYPEALIGN(8, buf))
>  {
> @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes)
> 
>  buf = (const char *) words;
>  }
> -#else
> +#elif SIZEOF_VOID_P == 4
>  /* Process in 32-bit chunks if the buffer is aligned. */
>  if (buf == (const char *) TYPEALIGN(4, buf))
>  {

Apologies for harping on this, but I'm still not seeing the need for these
SIZEOF_VOID_P changes.  While it's unlikely that this makes any practical
difference, I see no reason to more strictly check SIZEOF_VOID_P here.

> +/* Process any remaining bytes */
> +while (bytes--)
> +popcnt += pg_number_of_ones[(unsigned char) *buf++];
> +return popcnt;
> +#else
> +return pg_popcount_slow(buf, bytes);
> +#endif  /* USE_AVX512_CODE */

nitpick: Could we call pg_popcount_slow() in a common section for these
"remaining bytes?"

> +#if defined(_MSC_VER)
> +pg_popcount_indirect = pg_popcount512_fast;
> +#else
> +pg_popcount = pg_popcount512_fast;
> +#endif

These _MSC_VER sections are interesting.  I'm assuming this is the
workaround for the MSVC linking issue you mentioned above.  I haven't
looked too closely, but I wonder if the CRC32C code (see
src/include/port/pg_crc32c.h) is doing something different to avoid this
issue.

Upthread, Alvaro suggested a benchmark [0] that might be useful.  I scanned
through this thread and didn't see any recent benchmark results for the
latest form of the patch.  I think it's worth verifying that we are still
seeing the expected improvements.

[0] https://postgr.es/m/202402071953.5c4z7t6kl7ts%40alvherre.pgsql

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: postgres_fdw test timeouts

On Sun, Dec 10, 2023 at 12:00:01PM +0300, Alexander Lakhin wrote:
> So I would not say that it's a dominant failure for now, and given that
> 04a09ee lives in master only, maybe we can save two commits (Revert +
> Revert of revert) while moving to a more persistent solution.

I just checked in on this one to see whether we needed to create an "open
item" for v17.  While I'm not seeing the failures anymore, the patch that
Alexander claimed should fix it [0] doesn't appear to have been committed,
either.  Perhaps this was fixed by something else...

[0] 
https://postgr.es/m/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value

On Mon, Feb 05, 2024 at 04:28:23PM +0900, Yugo NAGATA wrote:
> On Thu, 1 Feb 2024 17:59:56 +0800
> jian he  wrote:
>> v6 patch looks good.
> 
> Thank you for your review and updating the status to RwC!

I think this one needs a (pretty trivial) rebase.  I spent a few minutes
testing it out and looking at the code, and it seems generally reasonable
to me.  Do you think it's worth adding something like a
pg_column_toast_num_chunks() function that returns the number of chunks for
the TOASTed value, too?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

CREATE TABLE creates a composite type corresponding to the table row, which is and is not there

2024-03-07 Thread Hannu Krosing

I could not find any explanation of the following behaviour in docs -


Our documentation for CREATE TABLE says:

CREATE TABLE also automatically creates a data type that represents
the composite type corresponding to one row of the table. Therefore,
tables cannot have the same name as any existing data type in the same
schema.


But these composite tables are only sometimes there

hannuk=# CREATE TABLE pair(a int, b int);
CREATE TABLE

hannuk=# INSERT INTO pair VALUES(1,2);
INSERT 0 1

hannuk=# select pg_typeof(p) from pair as p;
 pg_typeof
---
 pair

hannuk=# select pg_typeof(pg_typeof(p)) from pair as p;
 pg_typeof
---
 regtype

# first case where I can not use the table-defined type

hannuk=# create table anoter_pair of pair;
ERROR:  type pair is not a composite type

# the type definitely is there as promised

hannuk=# create type pair as (a int, b int);
ERROR:  type "pair" already exists

# and I can create similar type wit other name and use it to create table

hannuk=# create type pair2 as (a int, b int);
CREATE TYPE

hannuk=# create table anoter_pair of pair2;
CREATE TABLE

# and i can even use it in LIKE

hannuk=# CREATE TABLE pair3(like pair2);
CREATE TABLE

# the type is present in pg_type with type 'c' for Composite

hannuk=# select typname, typtype from pg_type where typname = 'pair';
 typname | typtype
-+-
 pair| c
(1 row)

# and I can add comment to the type

hannuk=# COMMENT ON TYPE pair is 'A Shroedingers type';
COMMENT

# but \dT does not show it (second case)

hannuk=# \dT pair
 List of data types
 Schema | Name | Description
+--+-
(0 rows)

---
Hannu

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-03-07 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Thu, 7 Mar 2024 15:32:01 +0900,
  Michael Paquier  wrote:

> While on it, here are some profiles based on HEAD and v17 with the
> previous tests (COPY TO /dev/null, COPY FROM data sent to the void).
> 
...
> 
> So, in short, and that's not really a surprise, there is no effect
> once we use the dispatching with the routines only when a format would
> want to plug-in with the APIs, but a custom format would still have a
> penalty of a few percents for both if bottlenecked on CPU.

Thanks for sharing these profiles!
I agree with you.

This shows that the v17 approach doesn't affect the current
text/csv/binary implementations. (The v17 approach just adds
2 new structs, Copy{From,To}Rountine, without changing the
current text/csv/binary implementations.)

Can we push the v17 patch and proceed following
implementations? Could someone (especially a PostgreSQL
committer) take a look at this for double-check?

Thanks,
-- 
kou

Re: improve ssl error code, 2147483650

Daniel Gustafsson  writes:
> On 7 Mar 2024, at 20:58, Tom Lane  wrote:
>> This could probably do with a comment, and we need to propagate
>> the fix into libpq's copy of the function too.  Barring objections,
>> I'll take care of that and push it later today.

> LGTM.

Done so far as be-secure-openssl.c and fe-secure-openssl.c are
concerned.  But I noticed that src/common/cryptohash_openssl.c and
src/common/hmac_openssl.c have their own, rather half-baked versions
of SSLerrmessage.  I didn't do anything about that in the initial
patch, because it's not clear to me that those routines would ever
see system-errno-based errors, plus their comments claim that
returning NULL isn't terribly bad.  But if we want to do something
about it, I don't think that maintaining 3 copies of the code is the
way to go.  Maybe push be-secure-openssl.c's version into src/common?

regards, tom lane

Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-07 Thread Melanie Plageman

On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:
> On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote:
> > I made some further changes. I kept them as separate commits for easier
> > review, see the commit messages for details. Any thoughts on those changes?
> 
> I've given some inline feedback on most of the extra patches you added.
> Short answer is they all seem fine to me except I have a reservations
> about 0008 because of the number of blkno variables flying around. I
> didn't have a chance to rebase these into my existing changes today, so
> either I will do it tomorrow or, if you are feeling like you're on a
> roll and want to do it, that also works!

Attached v7 contains all of the changes that you suggested plus some
additional cleanups here and there.

> > I feel heap_vac_scan_get_next_block() function could use some love. Maybe
> > just some rewording of the comments, or maybe some other refactoring; not
> > sure. But I'm pretty happy with the function signature and how it's called.

I've cleaned up the comments on heap_vac_scan_next_block() in the first
couple patches (not so much in the streaming read user). Let me know if
it addresses your feelings or if I should look for other things I could
change.

I will say that now all of the variable names are *very* long. I didn't
want to remove the "state" from LVRelState->next_block_state. (In fact, I
kind of miss the "get". But I had to draw the line somewhere.) I think
without "state" in the name, next_block sounds too much like a function.

Any ideas for shortening the names of next_block_state and its members
or are you fine with them?

> I was wondering if we should remove the "get" and just go with
> heap_vac_scan_next_block(). I didn't do that originally because I didn't
> want to imply that the next block was literally the sequentially next
> block, but I think maybe I was overthinking it.
> 
> Another idea is to call it heap_scan_vac_next_block() and then the order
> of the words is more like the table AM functions that get the next block
> (e.g. heapam_scan_bitmap_next_block()). Though maybe we don't want it to
> be too similar to those since this isn't a table AM callback.

I've done a version of this.

> > From 27e431e8dc69bbf09d831cb1cf2903d16f177d74 Mon Sep 17 00:00:00 2001
> > From: Heikki Linnakangas 
> > Date: Wed, 6 Mar 2024 20:58:57 +0200
> > Subject: [PATCH v6 6/9] Move vmbuffer back to a local varible in
> >  lazy_scan_heap()
> > 
> > It felt confusing that we passed around the current block, 'blkno', as
> > an argument to lazy_scan_new_or_empty() and lazy_scan_prune(), but
> > 'vmbuffer' was accessed directly in the 'scan_state'.
> > 
> > It was also a bit vague, when exactly 'vmbuffer' was valid. Calling
> > heap_vac_scan_get_next_block() set it, sometimes, to a buffer that
> > might or might not contain the VM bit for 'blkno'. But other
> > functions, like lazy_scan_prune(), assumed it to contain the correct
> > buffer. That was fixed up visibilitymap_pin(). But clearly it was not
> > "owned" by heap_vac_scan_get_next_block(), like the other 'scan_state'
> > fields.
> > 
> > I moved it back to a local variable, like it was. Maybe there would be
> > even better ways to handle it, but at least this is not worse than
> > what we have in master currently.
> 
> I'm fine with this. I did it the way I did (grouping it with the
> "next_unskippable_block" in the skip struct), because I think that this
> vmbuffer is always the buffer containing the VM bit for the next
> unskippable block -- which sometimes is the block returned by
> heap_vac_scan_get_next_block() and sometimes isn't.
> 
> I agree it might be best as a local variable but perhaps we could retain
> the comment about it being the block of the VM containing the bit for the
> next unskippable block. (Honestly, the whole thing is very confusing).

In 0001-0004 I've stuck with only having the local variable vmbuffer in
lazy_scan_heap().

In 0006 (introducing pass 1 vacuum streaming read user) I added a
vmbuffer back to the next_block_state (while also keeping the local
variable vmbuffer in lazy_scan_heap()). The vmbuffer in lazy_scan_heap()
contains the block of the VM containing visi information for the next
unskippable block or for the current block if its visi information
happens to be in the same block of the VM as either 1) the next
unskippable block or 2) the most recently processed heap block.

Streaming read vacuum separates this visibility check in
heap_vac_scan_next_block() from the main loop of lazy_scan_heap(), so we
can't just use a local variable anymore. Now the local variable vmbuffer
in lazy_scan_heap() will only already contain the block with the visi
information for the to-be-processed block if it happens to be in the
same VM block as the most recently processed heap block. That means
potentially more VM fetches.

However, by adding a vmbuffer to next_block_state, the callback may be
able to avoid extra VM fetches from one i

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada  wrote:
>
> It looks like it requires a link with pgport_srv but I'm not sure. It
> seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
> 2019, VS 2019 - Meson & ninja, too.

Unfortunately, none of the Windows animals happened to run both after
the initial commit and before removing the (seemingly useless on our
daily platfoms) link. I'll confirm on my own CI branch in a few
minutes.

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 8, 2024 at 10:04 AM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada  wrote:
> >
> > It looks like it requires a link with pgport_srv but I'm not sure. It
> > seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
> > 2019, VS 2019 - Meson & ninja, too.
>
> Unfortunately, none of the Windows animals happened to run both after
> the initial commit and before removing the (seemingly useless on our
> daily platfoms) link. I'll confirm on my own CI branch in a few
> minutes.

Yesterday I've confirmed the something like the below fixes the
problem happened in Windows CI:

--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif

 test_radixtree = shared_module('test_radixtree',
   test_radixtree_sources,
+  link_with: host_system == 'windows' ? pgport_srv : [],
   kwargs: pg_test_mod_args,
 )
 test_install_libs += test_radixtree

But I'm not sure it's the right fix especially because I guess it
could raise "AddressSanitizer: odr-violation" error on Windows.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada  wrote:
>
> Yesterday I've confirmed the something like the below fixes the
> problem happened in Windows CI:

Glad you shared before I went and did it.

> --- a/src/test/modules/test_radixtree/meson.build
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -12,6 +12,7 @@ endif
>
>  test_radixtree = shared_module('test_radixtree',
>test_radixtree_sources,
> +  link_with: host_system == 'windows' ? pgport_srv : [],

I don't see any similar coding elsewhere, so that leaves me wondering
if we're missing something. On the other hand, maybe no test modules
use files in src/port ...

>kwargs: pg_test_mod_args,
>  )
>  test_install_libs += test_radixtree
>
> But I'm not sure it's the right fix especially because I guess it
> could raise "AddressSanitizer: odr-violation" error on Windows.

Well, it's now at zero definitions that it can see, so I imagine it's
possible that adding the above would not cause more than one. In any
case, we might not know since as far as I can tell the MSVC animals
don't have address sanitizer. I'll look around some more, and if I
don't get any revelations, I guess we should go with the above.

Re: Patch: Add parse_type Function

2024-03-07 Thread David E. Wheeler

Hello Hackers,

On Feb 25, 2024, at 13:00, David E. Wheeler  wrote:

>> postgres=# SELECT to_regtypemod('timestamp(-4)');
>> ERROR:  syntax error at or near "-"
>> LINE 1: SELECT to_regtypemod('timestamp(-4)');
>>  ^
>> CONTEXT:  invalid type name "timestamp(-4)"
>> 
>> postgres=# SELECT to_regtypemod('text(-4)');
>> ERROR:  type modifier is not allowed for type "text"
> 
> Yeah, there was quite a bit of discussion of this issue back in September[1].
> 
>> This behaviour is mentioned in the documentation, so I'd say it is ok.
> 
> This is my attempt to make it clearer that it can return an error, but I 
> don’t love the wording TBH.

I’ve rebased the patch and, in an attempt to clarify this behavior, added a 
couple of examples to the docs for to_regtype. Updated patch attached.

Best,

David



v10-0001-Add-to_regtypemod-SQL-function.patch
Description: Binary data

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-03-07 Thread Justin Pryzby

On Mon, Mar 04, 2024 at 05:46:56PM +0900, Michael Paquier wrote:
> > Since InvalidOid is already taken, I guess you might need to introduce a
> > boolean flag, like set_relam, indicating that the statement has an
> > ACCESS METHOD clause.
> 
> Yes, I don't see an alternative.  The default needs a different field
> to be tracked down to the execution.

The data structure you used (defaultAccessMethod) allows this, which
is intended to be prohibited:

postgres=# ALTER TABLE a SET access method default, SET access method default;
ALTER TABLE

As you wrote it, you pass the "defaultAccessMethod" bool to
ATExecSetAccessMethodNoStorage(), which seems odd.  Why not just pass
the target amoid as newAccessMethod ?

When I fooled with this on my side, I called it "chgAccessMethod" to
follow "chgPersistence".  I think "is default" isn't the right data
structure.

Attached a relative patch with my version.

Also: I just realized that instead of adding a bool, we could test
(tab->rewrite & AT_REWRITE_ACCESS_METHOD) != 0

+-- Default and AM set in in clause are the same, relam should be set.

in in?

-- 
Justin
>From 8aca5e443ebb41b7de1ba18b519a33a97a41a897 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 4 Mar 2024 17:42:04 +0900
Subject: [PATCH 1/2] Allow specifying access method of partitioned tables

..to be inherited by partitions
See also:
ca4103025dfe26eaaf6a500dec9170fbb176eebc
8586bf7ed8889f39a59dd99b292014b73be85342
ebfe2dbd6b624e2a428e14b7ee9322cc096f63f7 - prevent DROP AM

Authors: Justin Pryzby, Soumyadeep Chakraborty
---
 doc/src/sgml/catalogs.sgml  |   5 +-
 doc/src/sgml/ref/alter_table.sgml   |   8 ++
 doc/src/sgml/ref/create_table.sgml  |   4 +
 src/backend/commands/tablecmds.c| 183 +---
 src/backend/utils/cache/lsyscache.c |  22 +++
 src/backend/utils/cache/relcache.c  |   7 +
 src/bin/pg_dump/pg_dump.c   |   3 +-
 src/bin/pg_dump/t/002_pg_dump.pl|  35 +
 src/include/catalog/pg_class.h  |   4 +-
 src/include/utils/lsyscache.h   |   1 +
 src/test/regress/expected/create_am.out | 158 +++-
 src/test/regress/sql/create_am.sql  |  91 +++-
 12 files changed, 487 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 0ae97d1adaa..e4fcacb610a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1989,8 +1989,9 @@ SCRAM-SHA-256$:&l
   
   
If this is a table or an index, the access method used (heap,
-   B-tree, hash, etc.); otherwise zero (zero occurs for sequences,
-   as well as relations without storage, such as views)
+   B-tree, hash, etc.); otherwise zero. Zero occurs for sequences,
+   as well as relations without storage, such as views. Partitioned
+   tables may use a non-zero value.
   
  
 
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 96e3d776051..779c8b31226 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -737,6 +737,14 @@ WITH ( MODULUS numeric_literal, REM
   DEFAULT changes the access method of the table
   to .
  
+ 
+  When applied to a partitioned table, there is no data to rewrite, but any
+  partitions created afterwards will use that access method unless
+  overridden by a USING clause. Using
+  DEFAULT switches the partitioned table to rely on
+  the value of default_table_access_method set
+  when new partitions are created, instead.
+ 
 

 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 4cbaaccaf7c..b79081a5ec1 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1330,6 +1330,10 @@ WITH ( MODULUS numeric_literal, REM
   method is chosen for the new table. See  for more information.
  
+ 
+  When creating a partition, the table access method is the access method
+  of its partitioned table, if set.
+ 
 

 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7014be80396..cf595329b6c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -184,6 +184,7 @@ typedef struct AlteredTableInfo
 	bool		verify_new_notnull; /* T if we should recheck NOT NULL */
 	int			rewrite;		/* Reason for forced rewrite, if any */
 	Oid			newAccessMethod;	/* new access method; 0 means no change */
+	bool		defaultAccessMethod;	/* true if SET ACCESS METHOD DEFAULT */
 	Oid			newTableSpace;	/* new tablespace; 0 means no change */
 	bool		chgPersistence; /* T if SET LOGGED/UNLOGGED is used */
 	char		newrelpersistence;	/* if above is true */
@@ -589,6 +590,8 @@ static ObjectAddress ATExecClusterOn(Relation rel, const char *indexName,
 	 LOCKMODE lockmode);
 static void ATExecDropCluster(Relation rel, LOCKMODE lockmode);
 static v

Re: Invalid query generated by postgres_fdw with UNION ALL and ORDER BY

On Fri, 8 Mar 2024 at 00:54, Ashutosh Bapat
 wrote:
>
> On Thu, Mar 7, 2024 at 4:39 PM David Rowley  wrote:
>> I think the fix should go in appendOrderByClause().  It's at that
>> point we look for the EquivalenceMember for the relation and can
>> easily discover if the em_expr is a Const.  I think we can safely just
>> skip doing any ORDER BY  stuff and not worry about if the
>> literal format of the const will appear as a reference to an ordinal
>> column position in the ORDER BY clause.
>
> deparseSortGroupClause() calls deparseConst() with showtype = 1. 
> appendOrderByClause() may want to do something similar for consistency. Or 
> remove it from deparseSortGroupClause() as well?

The fix could also be to use deparseConst() in appendOrderByClause()
and have that handle Const EquivalenceMember instead.  I'd rather just
skip them. To me, that seems less risky than ensuring deparseConst()
handles all Const types correctly.

Also, as far as adjusting GROUP BY to eliminate Consts, I don't think
that's material for a bug fix. If we want to consider doing that,
that's for master only.

>> I wonder if we need a test...
>
> Yes.

I've added two of those in the attached.

I also changed the way the delimiter stuff works as the exiting code
seems to want to avoid having a bool flag to record if we're adding
the first item.  The change I'm making means the bool flag is now
required, so we may as well use that flag to deal with the delimiter
append too.

David
diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 8fc66fa11c..a30f00179e 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -3912,13 +3912,12 @@ appendOrderByClause(List *pathkeys, bool has_final_sort,
 {
ListCell   *lcell;
int nestlevel;
-   const char *delim = " ";
StringInfo  buf = context->buf;
+   boolgotone = false;
 
/* Make sure any constants in the exprs are printed portably */
nestlevel = set_transmission_modes();
 
-   appendStringInfoString(buf, " ORDER BY");
foreach(lcell, pathkeys)
{
PathKey*pathkey = lfirst(lcell);
@@ -3951,6 +3950,25 @@ appendOrderByClause(List *pathkeys, bool has_final_sort,
 
em_expr = em->em_expr;
 
+   /*
+* If the member is a Const expression then we needn't add it 
to the
+* ORDER BY clause.  This can happen in UNION ALL queries where 
the
+* union child targetlist has a Const.  Adding these would be 
wasteful,
+* but also, for INT columns, an integer literal will be seen 
as an
+* ordinal column position rather than a value to sort by, so 
we must
+* skip these.
+*/
+   if (IsA(em_expr, Const))
+   continue;
+
+   if (!gotone)
+   {
+   appendStringInfoString(buf, " ORDER BY ");
+   gotone = true;
+   }
+   else
+   appendStringInfoString(buf, ", ");
+
/*
 * Lookup the operator corresponding to the strategy in the 
opclass.
 * The datatype used by the opfamily is not necessarily the 
same as
@@ -3965,7 +3983,6 @@ appendOrderByClause(List *pathkeys, bool has_final_sort,
 pathkey->pk_strategy, em->em_datatype, 
em->em_datatype,
 pathkey->pk_opfamily);
 
-   appendStringInfoString(buf, delim);
deparseExpr(em_expr, context);
 
/*
@@ -3975,7 +3992,6 @@ appendOrderByClause(List *pathkeys, bool has_final_sort,
appendOrderBySuffix(oprid, exprType((Node *) em_expr),

pathkey->pk_nulls_first, context);
 
-   delim = ", ";
}
reset_transmission_modes(nestlevel);
 }
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index c355e8f3f7..f3339b4ac8 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -919,6 +919,44 @@ EXPLAIN (VERBOSE, COSTS OFF)
  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1"
 (6 rows)
 
+-- Ensure we don't push ORDER BY expressions which are Consts at the UNION
+-- child level to the foreign server.
+EXPLAIN (VERBOSE, COSTS OFF)
+SELECT * FROM (
+SELECT 1 AS type,c1 FROM ft1
+UNION ALL
+SELECT 2 AS type,c1 FROM ft2
+) ORDER BY type,c1;
+   QUERY PLAN  
  
+-
+ Merge Append
+   Sort Key: (1), ft1.c1
+   ->  Foreign Scan on public.ft1
+ Output: 1, ft1.c1
+ Remote SQL: SELECT "C 1" FROM "

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada  wrote:
> Yesterday I've confirmed the something like the below fixes the
> problem happened in Windows CI:
>
> --- a/src/test/modules/test_radixtree/meson.build
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -12,6 +12,7 @@ endif
>
>  test_radixtree = shared_module('test_radixtree',
>test_radixtree_sources,
> +  link_with: host_system == 'windows' ? pgport_srv : [],
>kwargs: pg_test_mod_args,
>  )
>  test_install_libs += test_radixtree

pgport_srv is for backend, shared libraries should be using pgport_shlib

Further, the top level meson.build has:

# all shared libraries not part of the backend should depend on this
frontend_shlib_code = declare_dependency(
  include_directories: [postgres_inc],
  link_with: [common_shlib, pgport_shlib],
  sources: generated_headers,
  dependencies: [shlib_code, os_deps, libintl],
)

...but the only things that declare needing frontend_shlib_code are in
src/interfaces/.

In any case, I'm trying it in CI branch with pgport_shlib now.

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 8, 2024 at 9:53 AM John Naylor  wrote:
>
> On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada  wrote:
> > Yesterday I've confirmed the something like the below fixes the
> > problem happened in Windows CI:
> >
> > --- a/src/test/modules/test_radixtree/meson.build
> > +++ b/src/test/modules/test_radixtree/meson.build
> > @@ -12,6 +12,7 @@ endif
> >
> >  test_radixtree = shared_module('test_radixtree',
> >test_radixtree_sources,
> > +  link_with: host_system == 'windows' ? pgport_srv : [],
> >kwargs: pg_test_mod_args,
> >  )
> >  test_install_libs += test_radixtree
>
> pgport_srv is for backend, shared libraries should be using pgport_shlib

> In any case, I'm trying it in CI branch with pgport_shlib now.

That seems to work, so I'll push that just to get things green again.

Re: Properly pathify the union planner

On Fri, 8 Mar 2024 at 00:39, Richard Guo  wrote:
> I would like to have another look, but it might take several days.
> Would that be too late?

Please look. Several days is fine. I'd like to start looking on Monday
or Tuesday next week.

Thanks

David

Re: Synchronizing slots from primary to standby

2024-03-07 Thread Amit Kapila

On Thu, Mar 7, 2024 at 12:00 PM Zhijie Hou (Fujitsu)
 wrote:
>
>
> Attach the V108 patch set which addressed above and Peter's comments.
> I also removed the check for "*" in guc check hook.
>

Pushed with minor modifications. I'll keep an eye on BF.

BTW, one thing that we should try to evaluate a bit more is the
traversal of slots in StandbySlotsHaveCaughtup() where we verify if
all the slots mentioned in standby_slot_names have received the
required WAL. Even if the standby_slot_names list is short the total
number of slots can be much larger which can lead to an increase in
CPU usage during traversal. There is an optimization that allows to
cache ss_oldest_flush_lsn and ensures that we don't need to traverse
the slots each time so it may not hit frequently but still there is a
chance. I see it is possible to further optimize this area by caching
the position of each slot mentioned in standby_slot_names in
replication_slots array but not sure whether it is worth.

-- 
With Regards,
Amit Kapila.

Re: WIP Incremental JSON Parser

2024-03-07 Thread Andrew Dunstan




On 2024-03-07 Th 10:28, Jacob Champion wrote:

Some more observations as I make my way through the patch:

In src/common/jsonapi.c,


+#define JSON_NUM_NONTERMINALS 6

Should this be 5 now?



Yep.





+   res = pg_parse_json_incremental(&(incstate->lex), &(incstate->sem),
+   chunk, 
size, is_last);
+
+   expected = is_last ? JSON_SUCCESS : JSON_INCOMPLETE;
+
+   if (res != expected)
+   json_manifest_parse_failure(context, "parsing failed");

This leads to error messages like

 pg_verifybackup: error: could not parse backup manifest: parsing failed

which I would imagine is going to lead to confused support requests in
the event that someone does manage to corrupt their manifest. Can we
make use of json_errdetail() and print the line and column numbers?
Patch 0001 over at [1] has one approach to making json_errdetail()
workable in frontend code.



Looks sound on a first look. Maybe we should get that pushed ASAP so we 
can take advantage of it.






Top-level scalars like `false` or `12345` do not parse correctly if
the chunk size is too small; instead json_errdetail() reports 'Token
"" is invalid'. With small chunk sizes, json_errdetail() additionally
segfaults on constructions like `[tru]` or `12zz`.



Ugh. Will investigate.




For my local testing, I'm carrying the following diff in
001_test_json_parser_incremental.pl:


- ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds");
- ok(!$stderr, "chunk size $size: no error output");
+ like($stdout, qr/SUCCESS/, "chunk size $size: test succeeds");
+ is($stderr, "", "chunk size $size: no error output");

This is particularly helpful when a test fails spuriously due to code
coverage spray on stderr.



Makes sense, thanks.


I'll have a fresh patch set soon which will also take care of the bitrot.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Improve eviction algorithm in ReorderBuffer

2024-03-07 Thread Peter Smith

On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada  wrote:
>
> On Tue, Mar 5, 2024 at 3:28 PM Peter Smith  wrote:
> >

> > 4a.
> > The comment in simplehash.h says
> >  *   The following parameters are only relevant when SH_DEFINE is defined:
> >  *   - SH_KEY - ...
> >  *   - SH_EQUAL(table, a, b) - ...
> >  *   - SH_HASH_KEY(table, key) - ...
> >  *   - SH_STORE_HASH - ...
> >  *   - SH_GET_HASH(tb, a) - ...
> >
> > So maybe it is nicer to reorder the #defines in that same order?
> >
> > SUGGESTION:
> > +#define SH_PREFIX bh_nodeidx
> > +#define SH_ELEMENT_TYPE bh_nodeidx_entry
> > +#define SH_KEY_TYPE bh_node_type
> > +#define SH_SCOPE extern
> > +#ifdef FRONTEND
> > +#define SH_RAW_ALLOCATOR pg_malloc0
> > +#endif
> >
> > +#define SH_DEFINE
> > +#define SH_KEY key
> > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0)
> > +#define SH_HASH_KEY(tb, key) \
> > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type))
> > +#include "lib/simplehash.h"
>
> I'm really not sure it helps increase readability. For instance, for
> me it's readable if SH_DEFINE and SH_DECLARE come to the last before
> #include since it's more obvious whether we want to declare, define or
> both. Other simplehash.h users also do so.
>

OK.

> > 5.
> > + *
> > + * If 'indexed' is true, we create a hash table to track of each node's
> > + * index in the heap, enabling to perform some operations such as removing
> > + * the node from the heap.
> >   */
> >  binaryheap *
> > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
> > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > + bool indexed, void *arg)
> >
> > BEFORE
> > ... enabling to perform some operations such as removing the node from the 
> > heap.
> >
> > SUGGESTION
> > ... to help make operations such as removing nodes more efficient.
> >
>
> But these operations literally require the indexed binary heap as we
> have an assertion:
>
> void
> binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> {
> bh_nodeidx_entry *ent;
>
> Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> Assert(heap->bh_indexed);
>

I didn’t quite understand -- the operations mentioned are "operations
such as removing the node", but binaryheap_remove_node() also removes
a node from the heap. So I still felt the comment wording of the patch
is not quite correct.

Now, if the removal of a node from an indexed heap can *only* be done
using binaryheap_remove_node_ptr() then:
- the other removal functions (binaryheap_remove_*) probably need some
comments to make sure nobody is tempted to call them directly for an
indexed heap.
- maybe some refactoring and assertions are needed to ensure those
*cannot* be called directly for an indexed heap.

> > 7b.
> > IMO the parameters would be better the other way around (e.g. 'index'
> > before the 'node') because that's what the assignments look like:
> >
> >
> > heap->bh_nodes[heap->bh_size] = d;
> >
> > becomes:
> > bh_set_node(heap, heap->bh_size, d);
> >
>
> I think it assumes heap->bh_nodes is an array. But if we change it in
> the future, it will no longer make sense. I think it would make more
> sense if we define the parameters in an order like "we set the 'node'
> at 'index'". What do you think?

YMMV. The patch code is also OK by me if you prefer it.

//

And, here are some review comments for v8-0002.

==
1. delete_nodeidx

+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static bool
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!binaryheap_indexed(heap))
+ return false;
+
+ return bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

1a.
In v8 this function was changed to now return bool, so, I think the
function comment should explain the meaning of that return value.

~

1b.
I felt the function body is better expressed positively: "If this then
do that", instead of "If not this then do nothing otherwise do that"

SUGGESTION
if (binaryheap_indexed(heap))
  return bh_nodeidx_delete(heap->bh_nodeidx, node);

return false;

~~~

2.
+static void
+replace_node(binaryheap *heap, int index, bh_node_type new_node)
+{
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ /* Quick return if not necessary to move */
+ if (heap->bh_nodes[index] == new_node)
+ return;
+
+ /*
+ * Remove overwritten node's index. The overwritten node's position must
+ * have been tracked, if enabled.
+ */
+ found = delete_nodeidx(heap, heap->bh_nodes[index]);
+ Assert(!binaryheap_indexed(heap) || found);
+
+ /*
+ * Replace it with the given new node. This node's position must also be
+ * tracked as we assume to replace the node by the existing node.
+ */
+ found = set_node(heap, new_node, index);
+ Assert(!binaryheap_indexed(heap) || found);
+}

2a.
/Remove overwritten/Remove the overwritten/
/replace the node by the existing node/replace the node with the existing node/

~

2b.
It might be helpful to declare another local var...
bh_node_type

Re: remaining sql/json patches

2024-03-07 Thread jian he

I looked at the documentation again.
one more changes for JSON_QUERY:

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3e58ebd2..0c49b321 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18715,8 +18715,8 @@ ERROR:  jsonpath array subscript is out of bounds
 be of type jsonb.


-The ON EMPTY clause specifies the behavior if the
-path_expression yields no value at all; the
+The ON EMPTY clause specifies the behavior
if applying the
+path_expression to the
context_item yields no value at all; the
 default when ON EMPTY is not specified is to return
 a null value.

Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there

On 2024-03-08 01:12 +0100, Hannu Krosing wrote:
> I could not find any explanation of the following behaviour in docs -
> 
> 
> Our documentation for CREATE TABLE says:
> 
> CREATE TABLE also automatically creates a data type that represents
> the composite type corresponding to one row of the table. Therefore,
> tables cannot have the same name as any existing data type in the same
> schema.
>
> But these composite tables are only sometimes there

There's a distinction between stand-alone composite types created with CREATE
TYPE and those created implicitly via CREATE TABLE.  The former is also
called "free-standing" in the docs for pg_type.typrelid[1].

> hannuk=# CREATE TABLE pair(a int, b int);
> CREATE TABLE
> 
> hannuk=# INSERT INTO pair VALUES(1,2);
> INSERT 0 1
> 
> hannuk=# select pg_typeof(p) from pair as p;
>  pg_typeof
> ---
>  pair
> 
> hannuk=# select pg_typeof(pg_typeof(p)) from pair as p;
>  pg_typeof
> ---
>  regtype
> 
> # first case where I can not use the table-defined type
> 
> hannuk=# create table anoter_pair of pair;
> ERROR:  type pair is not a composite type

That error message is simply misleading.  What gets checked here is that
type "pair" was created with CREATE TYPE.  The attached patch fixes the
error message and also documents that requirement.

check_of_type() already addresses this limitation:

/*
 * check_of_type
 *
 * Check whether a type is suitable for CREATE TABLE OF/ALTER TABLE OF. 
 If it
 * isn't suitable, throw an error.  Currently, we require that the type
 * originated with CREATE TYPE AS.  We could support any row type, but 
doing so
 * would require handling a number of extra corner cases in the DDL 
commands.
 * (Also, allowing domain-over-composite would open up a can of worms 
about
 * whether and how the domain's constraints should apply to derived 
tables.)
 */

Not sure what those corner cases are, but table inheritance is one of
them:  I played around with typeOk in check_of_type() to also accept the
composite types implicitly created by CREATE TABLE:

typeOk = (typeRelation->rd_rel->relkind == RELKIND_COMPOSITE_TYPE ||
  typeRelation->rd_rel->relkind == RELKIND_RELATION);

With that creating typed tables of parent and child works as expected:

CREATE TABLE parent (a int);
CREATE TABLE child (b int) INHERITS (parent);
CREATE TABLE of_parent OF parent;
CREATE TABLE of_child OF child;
\d parent
   Table "public.parent"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 a  | integer |   |  | 
Number of child tables: 1 (Use \d+ to list them.)

\d of_parent
 Table "public.of_parent"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 a  | integer |   |  | 
Typed table of type: parent

\d child
   Table "public.child"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 a  | integer |   |  | 
 b  | integer |   |  | 
Inherits: parent

\d of_child
  Table "public.of_child"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 a  | integer |   |  | 
 b  | integer |   |  | 
Typed table of type: child

But adding columns to parent does not change the typed tables:

ALTER TABLE parent ADD c int;
\d parent
   Table "public.parent"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 a  | integer |   |  | 
 c  | integer |   |  | 
Number of child tables: 1 (Use \d+ to list them.)

\d of_parent
 Table "public.of_parent"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 a  | integer |   |  | 
Typed table of type: parent

\d child
   Table "public.child"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 a  | integer |   |  | 
 b  | integer |   |  | 
 c  | integer |   |  | 
Inherits: parent

\d of_child
  Table "public.of_child"
 Column |  Type   | Collation | Nullable | Default

Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there

I wrote:
> The attached patch fixes the error message and also documents that
> requirement.

On second thought, adding a separate error message doesn't really make
sense.  The attached v2 is a simpler patch that instead modifies the
existing error message.

-- 
Erik

Re: Synchronizing slots from primary to standby

2024-03-07 Thread Ajin Cherian

On Fri, Mar 8, 2024 at 2:33 PM Amit Kapila  wrote:

> On Thu, Mar 7, 2024 at 12:00 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> >
> > Attach the V108 patch set which addressed above and Peter's comments.
> > I also removed the check for "*" in guc check hook.
> >
>
>
> Pushed with minor modifications. I'll keep an eye on BF.
>
> BTW, one thing that we should try to evaluate a bit more is the
> traversal of slots in StandbySlotsHaveCaughtup() where we verify if
> all the slots mentioned in standby_slot_names have received the
> required WAL. Even if the standby_slot_names list is short the total
> number of slots can be much larger which can lead to an increase in
> CPU usage during traversal. There is an optimization that allows to
> cache ss_oldest_flush_lsn and ensures that we don't need to traverse
> the slots each time so it may not hit frequently but still there is a
> chance. I see it is possible to further optimize this area by caching
> the position of each slot mentioned in standby_slot_names in
> replication_slots array but not sure whether it is worth.
>
>
>
I tried to test this by configuring a large number of logical slots while
making sure the standby slots are at the end of the array and checking if
there was any performance hit in logical replication from these searches.

Setup:
1. 1 primary server configured with 3 servers in the standby_slot_names, 1
extra logical slot (not configured for failover) + 1 logical subscriber
configures as failover + 3 physical standbys(all configured to sync logical
slots)

2. 1 primary server configured with 3 servers in the standby_slot_names,
100 extra logical slot (not configured for failover) + 1 logical subscriber
configures as failover + 3 physical standbys(all configured to sync logical
slots)

3. 1 primary server configured with 3 servers in the standby_slot_names,
500 extra logical slot (not configured for failover) + 1 logical subscriber
configures as failover + 3 physical standbys(all configured to sync logical
slots)

In the three setups, 3 standby_slot_names are compared with a list of 2,101
and 501 slots respectively.

I ran a pgbench for 15 minutes for all 3 setups:

Case 1: Average TPS - 8.143399 TPS
Case 2: Average TPS - 8.187462 TPS
Case 3: Average TPS - 8.190611 TPS

I see no degradation in the performance, the differences in performance are
well within the run to run variations seen.

Nisha also did some performance tests to record the lag introduced by the
large number of slots traversal in StandbySlotsHaveCaughtup(). The tests
logged time at the start and end of the XLogSendLogical() call (which
eventually calls WalSndWaitForWal() --> StandbySlotsHaveCaughtup())  and
calculated total time taken by this function during the load run for
different total slots count.

Setup:
--one primary with 3 standbys and one subscriber with one active
subscription
--hot_standby_feedback=off and sync_replication_slots=false
--made sure the standby slots remain at the end
ReplicationSlotCtl->replication_slots array to measure performance of worst
case scenario for standby slot search in StandbySlotsHaveCaughtup()

pgbench for 15 min was run. Here is the data:

Case1 : with 1 logical slot, standby_slot_names having 3 slots
Run1: 626.141642 secs
Run2: 631.930254 secs

Case2 : with 100 logical slots,  standby_slot_names having 3 slots
Run1: 629.38332 secs
Run2: 630.548432 secs

Case3 : with 500 logical slots,  standby_slot_names having 3 slots
Run1: 629.910829 secs
Run2: 627.924183 secs

There was no degradation in performance seen.

Thanks Nisha for helping with the testing.

regards,
Ajin Cherian
Fujitsu Australia

Re: CREATE TABLE creates a composite type corresponding to the table row, which is and is not there

I wrote:
> The attached v2 is a simpler patch that instead modifies the existing
> error message.

Forgot to attach v2.

-- 
Erik
>From 9ab6b625e8f93ae2957144ec7f0ba32cf8a3bb5b Mon Sep 17 00:00:00 2001
From: Erik Wienhold 
Date: Fri, 8 Mar 2024 04:21:56 +0100
Subject: [PATCH v2] Document that typed tables rely on CREATE TYPE

CREATE TABLE OF accepts only composite types that were created with
CREATE TYPE.  Clarify that also in the error message.
---
 doc/src/sgml/ref/create_table.sgml| 2 ++
 src/backend/commands/tablecmds.c  | 2 +-
 src/test/regress/expected/typed_table.out | 2 +-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index 4cbaaccaf7..889496cf0a 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -254,6 +254,8 @@ WITH ( MODULUS numeric_literal, REM
   schema-qualified).  A typed table is tied to its type; for
   example the table will be dropped if the type is dropped
   (with DROP TYPE ... CASCADE).
+  Expects the composite type to have been created with
+  .
  
 
  
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7014be8039..0f43f349dc 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6979,7 +6979,7 @@ check_of_type(HeapTuple typetuple)
if (!typeOk)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-errmsg("type %s is not a composite type",
+errmsg("type %s is not a composite type 
created with CREATE TYPE",
format_type_be(typ->oid;
 }
 
diff --git a/src/test/regress/expected/typed_table.out 
b/src/test/regress/expected/typed_table.out
index 2e47ecbcf5..9edd744f6a 100644
--- a/src/test/regress/expected/typed_table.out
+++ b/src/test/regress/expected/typed_table.out
@@ -89,7 +89,7 @@ drop cascades to function get_all_persons()
 drop cascades to table persons2
 drop cascades to table persons3
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
-ERROR:  type stuff is not a composite type
+ERROR:  type stuff is not a composite type created with CREATE TYPE
 DROP TABLE stuff;
 -- implicit casting
 CREATE TYPE person_type AS (id int, name text);
-- 
2.44.0

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-03-07 Thread Michael Paquier

On Thu, Mar 07, 2024 at 08:02:00PM -0600, Justin Pryzby wrote:
> As you wrote it, you pass the "defaultAccessMethod" bool to
> ATExecSetAccessMethodNoStorage(), which seems odd.  Why not just pass
> the target amoid as newAccessMethod ?

+   /*
+* Check that the table access method exists.
+* Use the access method, if specified, otherwise (when not specified) 0
+* for partitioned tables or the configured default AM.
+*/
+   if (amname != NULL)
+   amoid = get_table_am_oid(amname, false);
+   else if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+   amoid = 0;
+   else
+   amoid = get_table_am_oid(default_table_access_method, false);

..  While using this flow to choose the AM oid, that's neater than the
versions I could come up with, pretty cool.

> When I fooled with this on my side, I called it "chgAccessMethod" to
> follow "chgPersistence".  I think "is default" isn't the right data
> structure.
> 
> Attached a relative patch with my version.

Thanks.  I've applied the patch to add the DEFAULT clause for now, to
ease the work on this patch.

> Also: I just realized that instead of adding a bool, we could test
> (tab->rewrite & AT_REWRITE_ACCESS_METHOD) != 0

Hmm.  I've considered that, but the boolean style is able to do the
work, while being more consistent, so I'm OK with what you are doing
in your 0002.

> +-- Default and AM set in in clause are the same, relam should be set.
> 
> in in?

Oops, fixed.

I have spent more time reviewing the whole and the tests (I didn't see
much value in testing the DEFAULT clause twice for the partitioned
table case and there is a test in d61a6cad6418), tweaked a few
comments and the documentation, did an indentation and a commit
message draft.

How does that look to you?  The test coverage and the semantics do
what we want them to do, so that looks rather reasonable here.  A
second or even third pair of eyes would not hurt.
--
Michael
From 4114c7944c78dbb6eb85320434d942daa77fa4d8 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 8 Mar 2024 13:24:14 +0900
Subject: [PATCH v4] Allow specifying access method for partitioned tables

Partitioned tables gain support for CREATE TABLE .. USING, where
specifying an access method can serve as a hint when creating partitions
on it.  This includes support for ALTER TABLE .. SET ACCESS METHOD,
where specifying DEFAULT is equivalent to resetting the partitioned
table's relam to 0.  Specifying a non-DEFAULT access method will set its
relam to not be 0.

The historical default of using pg_class.relam as InvalidOid is
preserved, corresponding to the case where USING clause is not used.  In
this case, like previous releases, a partition is created with the
access method set by default_table_access_method if its CREATE TABLE
query does not specify an access method with a USING clause.

The relcache of partitioned tables is not changed: rd_tableam is not
set, even if a partitioned table has a relam set.

Authors: Justin Pryzby, Soumyadeep Chakraborty
Reviewed-by: Michael Paquier, and a few others..
Discussion: https://postgr.es/m/
---
 src/include/catalog/pg_class.h  |   4 +-
 src/include/utils/lsyscache.h   |   1 +
 src/backend/commands/tablecmds.c| 172 
 src/backend/utils/cache/lsyscache.c |  22 +++
 src/backend/utils/cache/relcache.c  |   7 +
 src/bin/pg_dump/pg_dump.c   |   3 +-
 src/bin/pg_dump/t/002_pg_dump.pl|  35 +
 src/test/regress/expected/create_am.out | 158 +-
 src/test/regress/sql/create_am.sql  |  91 -
 doc/src/sgml/catalogs.sgml  |   5 +-
 doc/src/sgml/ref/alter_table.sgml   |   8 ++
 doc/src/sgml/ref/create_table.sgml  |   4 +
 12 files changed, 471 insertions(+), 39 deletions(-)

diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index 3b7533e7bb..0fc2c093b0 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -219,7 +219,9 @@ MAKE_SYSCACHE(RELNAMENSP, pg_class_relname_nsp_index, 128);
 /*
  * Relation kinds with a table access method (rd_tableam).  Although sequences
  * use the heap table AM, they are enough of a special case in most uses that
- * they are not included here.
+ * they are not included here.  Likewise, partitioned tables can have an access
+ * method defined so that their partitions can inherit it, but they do not set
+ * rd_tableam; hence, this is handled specially outside of this macro.
  */
 #define RELKIND_HAS_TABLE_AM(relkind) \
 	((relkind) == RELKIND_RELATION || \
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index e4a200b00e..35a8dec2b9 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -139,6 +139,7 @@ extern char get_rel_relkind(Oid relid);
 extern bool get_rel_relispartition(Oid relid);
 extern Oid	get_rel_tablespace(Oid relid);
 extern char get_re

Re: Patch: Add parse_type Function