Re: improve ssl error code, 2147483650

2024-07-28 Thread Peter Eisentraut

On 25.07.24 11:36, Daniel Gustafsson wrote:

On 24 Jul 2024, at 15:32, Peter Eisentraut  wrote:

On 25.06.24 16:21, Tom Lane wrote:

Peter Eisentraut  writes:

On 21.06.24 16:53, Tom Lane wrote:

Most of libpq gets at strerror_r via SOCK_STRERROR for Windows
portability.  Is that relevant here?

Looking inside the OpenSSL code, it makes no efforts to translate
between winsock error codes and standard error codes, so I don't think
our workaround/replacement code needs to do that either.

Fair enough.

Btw., our source code comments say something like
"ERR_reason_error_string randomly refuses to map system errno values."
The reason it doesn't is exactly that it can't do it while maintaining
thread-safety.

Ah.  Do you want to improve that comment while you're at it?


Here is a patch that fixes the strerror() call and updates the comments a bit.


LGTM.


This ought to be backpatched like the original fix; ideally for the next minor 
releases in about two weeks.


Agreed.


done





Re: POC, WIP: OR-clause support for indexes

2024-07-28 Thread Alena Rybakina

On 27.07.2024 13:56, Alexander Korotkov wrote:

On Thu, Jul 25, 2024 at 5:04 PM Alena Rybakina
  wrote:

To be honest, I have found a big problem in this patch - we try to perform the 
transformation every time we examime a column:

for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++) { ...

}

I have fixed it and moved the transformation before going through the loop.

What makes you think there is a problem?


To be honest, I was bothered by the fact that we need to go through 
expressions several times that obviously will not fit under other 
conditions.
Just yesterday I thought that it would be worthwhile to create a list of 
candidates - expressions that did not fit because the column did not 
match the index (!match_index_to_operand(nconst_expr, indexcol, index)).


Another problem that is related to the first one that the boolexpr could 
contain expressions referring to different operands, for example, both x 
and y. that is, we may have the problem that the optimal "ANY" 
expression may not be used, because the expression with x may come 
earlier and the loop may end earlier.


alena@postgres=# create table b (x int, y int);
CREATE TABLE
alena@postgres=# insert into b select id, id from 
generate_series(1,1000) as id;

INSERT 0 1000
alena@postgres=# create index x_idx on b(x);
CREATE INDEX
alena@postgres=# analyze;
ANALYZE
alena@postgres=# explain select * from b where y =3 or x =4 or x=5 or 
x=6 or x = 7 or x=8 or x=9;

  QUERY PLAN
---
 Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
   Filter: ((y = 3) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 
8) OR (x = 9))

(2 rows)
alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x = 
7 or x=8 or x=9 or y=1;

  QUERY PLAN
---
 Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
   Filter: ((x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x = 
9) OR (y = 1))

(2 rows)
alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x = 
7 or x=8 or x=9;

   QUERY PLAN

 Index Scan using x_idx on b  (cost=0.28..12.75 rows=6 width=8)
   Index Cond: (x = ANY ('{4,5,6,7,8,9}'::integer[]))
(2 rows)

Furthermore expressions can be stored in a different order.
For example, first comes "AND" expr, and then group of "OR" expr, which 
we can convert to "ANY" expr, but we won't do this due to the fact that 
we will exit the loop early, according to this condition:


if (!IsA(sub_rinfo->clause, OpExpr))
           return NULL;

or it may occur due to other conditions.

alena@postgres=# create index x_y_idx on b(x,y);
CREATE INDEX
alena@postgres=# analyze;
ANALYZE

alena@postgres=# explain select * from b where (x = 2 and y =3) or x =4 
or x=5 or x=6 or x = 7 or x=8 or x=9;

 QUERY PLAN
-
 Seq Scan on b  (cost=0.00..35.00 rows=6 width=8)
   Filter: (((x = 2) AND (y = 3)) OR (x = 4) OR (x = 5) OR (x = 6) OR 
(x = 7) OR (x = 8) OR (x = 9))

(2 rows)

Because of these reasons, I tried to save this and that transformation 
together for each column and try to analyze for each expr separately 
which method would be optimal.



Do you have a test case
illustrating a slow planning time?

No, I didn't have time to measure it and sorry for that. I'll do it.

When v27 performs transformation for a particular column, it just
stops facing the first unmatched OR entry.  So,
match_orclause_to_indexcol() examines just the first OR entry for all
the columns excepts at most one.  So, the check
match_orclause_to_indexcol() does is not much slower than other
match_*_to_indexcol() do.

I actually think this could help performance in many cases, not hurt
it.  At least we get rid of O(n^2) complexity over the number of OR
entries, which could be very many.


I agree with you that there is an overhead and your patch fixes this 
problem, but optimizer needs to have a good ordering of expressions for 
application.


I think we can try to move the transformation to another place where 
there is already a loop pass, and also save two options "OR" expr and 
"ANY" expr in one place (through BoolExpr) (like find_duplicate_ors 
function) and teach the optimizer to determine which option is better, 
for example, like now in match_orclause_to_indexcol() function.


What do you thing about it?

--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company





Re: Protocol question regarding Portal vs Cursor

2024-07-28 Thread Dave Cramer
On Sat, 27 Jul 2024 at 19:06, Tatsuo Ishii  wrote:

> > Yes, sorry, I should have said one can not create a with hold portal
> using
> > the BIND command
>
> Ok.
>
> It would be possible to add a new parameter to the BIND command to
> create such a portal. But it needs some changes to the existing
> protocol definition and requires protocol version up, which is a major
> pain.
>

I'm trying to add WITH HOLD to the JDBC driver and currently I would have
1) rewrite the query
2) issue a new query ... declare .. and bind variables to that statement
3) execute fetch

vs

1) bind variables to the statement
2) execute fetch

The second can be done much lower in the code.

However as you mentioned this would require a new protocol version which is
unlikely to happen.

Dave


Re: Fix overflow in pg_size_pretty

2024-07-28 Thread David Rowley
On Sun, 28 Jul 2024 at 16:30, Joseph Koshakow  wrote:
> Attached is an updated patch with your approach. I removed the 0 from
> the negative case because I think it was unnecessary, but happy to add
> it back in if I missed something.

I made a few adjustments and pushed this.  I did keep the 0 - part as
some compilers don't seem to like not having the 0.  e.g MSVC gives:

../src/backend/utils/adt/dbsize.c(578): warning C4146: unary minus
operator applied to unsigned type, result still unsigned

I thought a bit about if it was really worth keeping the regression
test or not and in the end decided it was likely worthwhile keeping
it, so I expanded it slightly to cover both PG_INT64_MIN and
PG_INT64_MAX values. It looks slightly less like we're earmarking the
fact that there was a bug that way, and also seems to be of some
additional value.

PG15 did see quite a significant rewrite of the pg_size_pretty code.
The bug does still exist in PG14 and earlier, but on looking at what
it would take to fix it there I got a bit unexcited at the risk to
reward ratio of adjusting that code and just left it alone. I've
backpatched only as far as PG15. I'm sure someone else will feel I
should have done something else there, but that's the judgement call I
made.

Thanks for the patch.

David




Re: Pluggable cumulative statistics

2024-07-28 Thread Michael Paquier
On Sat, Jul 27, 2024 at 03:49:42PM +0200, Dmitry Dolgov wrote:
> Agree, looks good. I've tried to quickly sketch out such a fixed
> statistic for some another extension, everything was fine and pretty
> straightforward.

That's my hope.  Thanks a lot for the feedback.

> One question, why don't you use
> pgstat_get_custom_shmem_data & pgstat_get_custom_snapshot_data outside
> of the injection points code? There seems to be a couple of possible
> places in pgstats itself.

Because these two helper routines are only able to fetch the fixed
data area in the snapshot and the control shmem structures for the
custom kinds, not the in-core ones.  We could, but the current code is
OK as well.  My point was just to ease the pluggability effort.

I would like to apply this new infrastructure stuff and move on to the
problems related to the scability of pg_stat_statements.  So, are
there any objections with all that?
--
Michael


signature.asc
Description: PGP signature


Re: pg_attribute.atttypmod for interval type

2024-07-28 Thread Chapman Flack
On 07/27/24 00:32, Tom Lane wrote:
> Interval typmods include a fractional-seconds-precision field as well
> as a bitmask indicating the allowed interval fields (per the SQL
> standard's weird syntax such as INTERVAL DAY TO SECOND).  Looking at
> the source code for intervaltypmodout() might be helpful:
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/utils/adt/timestamp.c;h=69fe7860ede062fc8be42e7545b35e69c3e068c4;hb=HEAD#l1136

Also, for this kind of question, an overview of a type modifier's
contents can be found in the javadoc for the WIP PL/Java 1.7, which is
intended to model such things accurately.[0]

The model is aimed at the logical level, that is, to represent
what information is present in the typmod, the precise semantics, what
combinations are allowed/disallowed, and so on, but not the way PostgreSQL
physically packs the bits. So, for this case, what you would find there
is essentially what Tom already said, about what's logically present; it
doesn't save you the effort of looking in the PostgreSQL source if you want
to independently implement unpacking the bits.

For possible future typmod questions, it may serve as a quick way to
get that kind of logical-level description at moments when Tom is away
from the keyboard.

Regards,
-Chap


[0]
https://tada.github.io/pljava/preview1.7/pljava-api/apidocs/org.postgresql.pljava/org/postgresql/pljava/adt/Timespan.Interval.Modifier.html

I just noticed a nit in that javadoc: it says the field combination
must be "one of the named constants in this interface" but you don't
find them in the Interval.Modifier interface; they're in the containing
interface Interval itself.




Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-28 Thread Masahiko Sawada
On Fri, Jul 26, 2024 at 1:27 PM Melanie Plageman
 wrote:
>
> On Mon, Jul 22, 2024 at 9:26 PM Masahiko Sawada  wrote:
> >
> > +   CREATE TABLE ${table1}(col1 int)
> > +   WITH (autovacuum_enabled=false, fillfactor=10);
> > +   INSERT INTO $table1 VALUES(7);
> > +   INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3;
> > +   CREATE INDEX on ${table1}(col1);
> > +   UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
> > +   INSERT INTO $table1 VALUES(7);
> >
> > These queries make sense to me; these make the radix tree wide and use
> > more nodes, instead of fattening lead nodes (i.e. the offset bitmap).
> > The $table1 has 18182 blocks and the statistics of radix tree shows:
> >
> > max_val = 65535
> > num_keys = 18182
> > height = 1, n4 = 0, n16 = 1, n32 = 0, n64 = 0, n256 = 72, leaves = 18182
> >
> > Which means that the height of the tree is 2 and we use the maximum
> > size node for all nodes except for 1 node.
>
> Do you have some kind of tool that prints this out for you? That would
> be really handy.

You can add '#define RT_DEBUG' for radix tree used in TidStore and
then call RT_STATS (e.g., local_ts_stats()).

>
> > I don't have any great idea to substantially reduce the total number
> > of tuples in the $table1. Probably we can use DELETE instead of UPDATE
> > to make garbage tuples (although I'm not sure it's okay for this
> > test). Which reduces the amount of WAL records from 11MB to 4MB and
> > would reduce the time to catch up. But I'm not sure how much it would
> > help. There might be ideas to trigger a two-round index vacuum with
> > fewer tuples but if the tests are too optimized for the current
> > TidStore, we will have to re-adjust them if the TidStore changes in
> > the future. So I think it's better and reliable to allow
> > maintenance_work_mem to be a lower value or use injection points
> > somehow.
>
> I think we can make improvements in overall time on master and 17 with
> the examples John provided later in the thread. However, I realized
> you are right about using a DELETE instead of an UPDATE. At some point
> in my development, I needed the UPDATE to satisfy some other aspect of
> the test. But that is no longer true. A DELETE works just as well as
> an UPDATE WRT the dead items and, as you point out, much less WAL is
> created and replay is much faster.
>
> I also realized I forgot to add 043_vacuum_horizon_floor.pl to
> src/test/recovery/meson.build in 16. I will post a patch here this
> weekend which changes the UPDATE to a DELETE in 14-16 (sped up the
> test by about 20% for me locally) and adds 043_vacuum_horizon_floor.pl
> to src/test/recovery/meson.build in 16. I'll plan to push it on Monday
> to save myself any weekend buildfarm embarrassment.
>
> As for 17 and master, I'm going to try out John's examples and see if
> it seems like it will be fast enough to commit to 17/master without
> lowering the maintenance_work_mem lower bound.

+1. Thanks.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Make COPY format extendable: Extract COPY TO format implementations

2024-07-28 Thread Junwang Zhao
Hi Sutou,

On Wed, Jul 24, 2024 at 4:31 PM Sutou Kouhei  wrote:
>
> Hi,
>
> In <9172d4eb-6de0-4c6d-beab-8210b7a22...@enterprisedb.com>
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Mon, 22 Jul 2024 14:36:40 +0200,
>   Tomas Vondra  wrote:
>
> > Thanks for the summary/responses. I still think it'd be better to post a
> > summary as a separate message, not as yet another post responding to
> > someone else. If I was reading the thread, I would not have noticed this
> > is meant to be a summary. I'd even consider putting a "THREAD SUMMARY"
> > title on the first line, or something like that. Up to you, of course.
>
> It makes sense. I'll do it as a separated e-mail.
>
> > My suggestions would be to maintain this as a series of patches, making
> > incremental changes, with the "more complex" or "more experimental"
> > parts larger in the series. For example, I can imagine doing this:
> >
> > 0001 - minimal version of the patch (e.g. current v17)
> > 0002 - switch existing formats to the new interface
> > 0003 - extend the interface to add bits needed for columnar formats
> > 0004 - add DML to create/alter/drop custom implementations
> > 0005 - minimal patch with extension adding support for Arrow
> >
> > Or something like that. The idea is that we still have a coherent story
> > of what we're trying to do, and can discuss the incremental changes
> > (easier than looking at a large patch). It's even possible to commit
> > earlier parts before the later parts are quite cleanup up for commit.
> > And some changes changes may not be even meant for commit (e.g. the
> > extension) but as guidance / validation for the earlier parts.
>
> OK. I attach the v18 patch set:
>
> 0001: add a basic feature (Copy{From,To}Routine)
>   (same as the v17 but it's based on the current master)
> 0002: use Copy{From,To}Rountine for the existing formats
>   (this may not be committed because there is a
>   profiling related concern)
> 0003: add support for specifying custom format by "COPY
>   ... WITH (format 'my-format')"
>   (this also has a test)
> 0004: export Copy{From,To}StateData
>   (but this isn't enough to implement custom COPY
>   FROM/TO handlers as an extension)
> 0005: add opaque member to Copy{From,To}StateData and export
>   some functions to read the next data and flush the buffer
>   (we can implement a PoC Apache Arrow COPY FROM/TO
>   handler as an extension with this)
>
> https://github.com/kou/pg-copy-arrow is a PoC Apache Arrow
> COPY FROM/TO handler as an extension.
>
>
> Notes:
>
> * 0002: We use "static inline" and "constant argument" for
>   optimization.
> * 0002: This hides NextCopyFromRawFields() in a public
>   header because it's not used in PostgreSQL and we want to
>   use "static inline" for it. If it's a problem, we can keep
>   it and create an internal function for "static inline".
> * 0003: We use "CREATE FUNCTION" to register a custom COPY
>   FROM/TO handler. It's the same approach as tablesample.
> * 0004 and 0005: We can mix them but this patch set split
>   them for easy to review. 0004 just moves the existing
>   codes. It doesn't change the existing codes.
> * PoC: I provide it as a separated repository instead of a
>   patch because an extension exists as a separated project
>   in general. If it's a problem, I can provide it as a patch
>   for contrib/.
> * This patch set still has minimal Copy{From,To}Routine. For
>   example, custom COPY FROM/TO handlers can't process their
>   own options with this patch set. We may add more callbacks
>   to Copy{From,To}Routine later based on real world use-cases.
>
> > Unfortunately, there's not much information about what exactly the tests
> > did, context (hardware, ...). So I don't know, really. But if you share
> > enough information on how to reproduce this, I'm willing to take a look
> > and investigate.
>
> Thanks. Here is related information based on the past
> e-mails from Michael:
>
> * Use -O2 for optimization build flag
>   ("meson setup --buildtype=release" may be used)
> * Use tmpfs for PGDATA
> * Disable fsync
> * Run on scissors (what is "scissors" in this context...?)
>   
> https://www.postgresql.org/message-id/flat/Zbr6piWuVHDtFFOl%40paquier.xyz#dbbec4d5c54ef2317be01a54abaf495c
> * Unlogged table may be used
> * Use a table that has 30 integer columns (*1)
> * Use 5M rows (*2)
> * Use '/dev/null' for COPY TO (*3)
> * Use blackhole_am for COPY FROM (*4)
>   https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am
> * perf is used but used options are unknown (sorry)
>
> (*1) This SQL may be used to create the table:
>
> CREATE OR REPLACE FUNCTION create_table_cols(tabname text, num_cols int)
> RETURNS VOID AS
> $func$
> DECLARE
>   query text;
> BEGIN
>   query := 'CREATE UNLOGGED TABLE ' || tabname || ' (';
>   FOR i IN 1..num_cols LOOP
> query := query || 'a_' || i::text || ' int default 1';
> IF i != num_cols THEN
>   query := query 

Re: [PATCH] TODO “Allow LISTEN on patterns”

2024-07-28 Thread Alexander Cheshev
Hi Emanuel,

I did a review on the new patch version and I observed that the identifier
> passed to the LISTEN command is handled differently between outer and
> inner
> levels.
>

We have the following grammar:

notify_channel:
   ColId
   { $$ = $1; }
   | notify_channel '.' ColId
   { $$ = psprintf("%s.%s", $1, $3); }

And ColId is truncated in core scanner:

 ident = downcase_truncate_identifier(yytext, yyleng, true);

So each level is truncated independently. For this reason we observe the
behaviour which you described above.

Another observation, probably not strictly related to this patch itself but
> the async-notify tests, is that there is no test for
> "payload too long". Probably there is a reason on why isn't in the specs?
>

I believe that simply because not all functionality is covered with tests.
But I have noticed a very interesting test "channel name too long":

SELECT
pg_notify('notify_async_channel_name_too_long__','sample_message1');
ERROR:  channel name too long

But the behaviour is inconsistent with NOTIFY command:

NOTIFY notify_async_channel_name_too_long__
NOTICE:  identifier
"notify_async_channel_name_too_long__" will be
truncated to ...

I guess that the expected behavior would be that if the outer level is
> truncated, the rest of the
> channel name should be ignored, as there won't be possible to notify it
> anyway.
>
> In the case of the inner levels creating a channel name too long, it may
> probably sane to just
> check the length of the entire identifier, and truncate -- ensuring that
> channel name doesn't
> end with the level separator.
>
>
Well, I believe that we can forbid too long channel names at all. So it
provides consistent behaviour among different ways we can send
notifications, and I agree with you that "there won't be possible to notify
it anyway". I created a patch for that and attached it to the email. In the
patch I relocated truncation from core scanner to parser. And as the same
core scanner is also used in plsql I added three lines of code to its
scanner to basically truncate too long identifiers in there. Here is an
example of the new behaviour:

-- Should fail. Too long channel names
NOTIFY notify_async_channel_name_too_long_._;
ERROR:  channel name too long
LISTEN notify_async_channel_name_too_long_%._;
ERROR:  channel name too long
UNLISTEN notify_async_channel_name_too_long_%._;
ERROR:  channel name too long

Regards,
Alexander Cheshev


On Sun, 21 Jul 2024 at 21:36, Emanuel Calvo <3man...@gmail.com> wrote:

>
> Hi Alexander,
>
> I did a review on the new patch version and I observed that the identifier
> passed to the LISTEN command is handled differently between outer and
> inner
> levels.
>
> When the outer level exceeds the 64 characters limitation, the outer level
> of the
> channel name is truncated, but leaves the inner levels in the channel name
> due
> that isn't parsed in the same way.
>
> Also, even if the outer level isn't truncated, it is allowed to add
> channels names
> that exceeds the allowed identifier size.
>
> It can be reproduced just by:
>
>   # LISTEN a.a.a.a.a.lot.of.levels..; -- this doesn't fail at LISTEN,
> but fails in NOTIFY due to channel name too long
>
> In the following, the outer level is truncated, but it doesn't cut out the
> inner levels. This leaves
> listening channels that cannot receive any notifications in the queue:
>
>   # LISTEN
> notify_async_channel_name_too_long.a.a.
> ...
>   NOTICE: identifier  will be truncated
>
>   # select substring(c.channel,0,66), length(c.channel) from
> pg_listening_channels() c(channel) where length(c.channel) > 64;
>   substring |
> notify_async_channel_name_too_long_.a
>   length| 1393
>
>
> I guess that the expected behavior would be that if the outer level is
> truncated, the rest of the
> channel name should be ignored, as there won't be possible to notify it
> anyway.
>
> In the case of the inner levels creating a channel name too long, it may
> probably sane to just
> check the length of the entire identifier, and truncate -- ensuring that
> channel name doesn't
> end with the level separator.
>
> Another observation, probably not strictly related to this patch itself
> but the async-notify tests, is that there is no test for
> "payload too long". Probably there is a reason on why isn't in the specs?
>
>
> Regards,
>
>
> El lun, 15 jul 2024 a las 12:59, Alexander Cheshev (<
> alex.ches...@gmail.com>) escribió:
>
>> Hi Emanuel,
>>
>> Changed implementation of the function Exec_UnlistenCommit . v2 of the
>> path contained a bug in the function Exec_UnlistenCommit (added a test case
>> for that) an

Re: race condition in pg_class

2024-07-28 Thread Tom Lane
Noah Misch  writes:
> On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
>> A recent buildfarm test failure [1] showed that the
>> intra-grant-inplace-db.spec test added with 0844b3968 may fail
>> on a slow machine

>> But as the test going to be modified by the inplace110-successors-v8.patch
>> and the modified test (with all three latest patches applied) passes
>> reliably in the same conditions, maybe this failure doesn't deserve a
>> deeper exploration.

> Agreed.  Let's just wait for code review of the actual bug fix, not develop a
> separate change to stabilize the test.  One flake in three weeks is low enough
> to make that okay.

It's now up to three similar failures in the past ten days: in
addition to

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08

I see

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2024-07-22%2018%3A00%3A46

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-07-28%2012%3A20%3A37

Is it time to worry yet?  If this were HEAD only, I'd not be too
concerned; but two of these three are on allegedly-stable branches.
And we have releases coming up fast.

(BTW, I don't think taipan qualifies as a slow machine.)

regards, tom lane




Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-07-28 Thread Tom Lane
David Rowley  writes:
> I ended up fixing that another way as the above seems to be casting
> away the const for those variables. Instead, I changed the signature
> of the function to:
> static void get_memory_context_name_and_ident(MemoryContext context,
> const char **const name,  const char **const ident);
> which I think takes into account for the call site variables being
> defined as "const char *".

I did not check the history to see quite what happened here,
but Coverity thinks the end result is rather confused,
and I agree:

*** CID 1615190:  Null pointer dereferences  (REVERSE_INULL)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/adt/mcxtfuncs.c: 58 in 
get_memory_context_name_and_ident()
52  *ident = context->ident;
53 
54  /*
55   * To be consistent with logging output, we label dynahash contexts with
56   * just the hash table name as with MemoryContextStatsPrint().
57   */
>>> CID 1615190:  Null pointer dereferences  (REVERSE_INULL)
>>> Null-checking "ident" suggests that it may be null, but it has already 
>>> been dereferenced on all paths leading to the check.
58  if (ident && strcmp(*name, "dynahash") == 0)
59  {
60  *name = *ident;
61  *ident = NULL;
62  }
63 }

It is not clear to me exactly which of these pointers should be
presumed to be possibly-null, but certainly testing ident after
storing through it is pretty pointless.  Maybe what was intended
was

-   if (ident && strcmp(*name, "dynahash") == 0)
+   if (*name && strcmp(*name, "dynahash") == 0)

?

regards, tom lane




Re: race condition in pg_class

2024-07-28 Thread Noah Misch
On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> >> A recent buildfarm test failure [1] showed that the
> >> intra-grant-inplace-db.spec test added with 0844b3968 may fail
> 
> >> But as the test going to be modified by the inplace110-successors-v8.patch
> >> and the modified test (with all three latest patches applied) passes
> >> reliably in the same conditions, maybe this failure doesn't deserve a
> >> deeper exploration.
> 
> > Agreed.  Let's just wait for code review of the actual bug fix, not develop 
> > a
> > separate change to stabilize the test.  One flake in three weeks is low 
> > enough
> > to make that okay.
> 
> It's now up to three similar failures in the past ten days

> Is it time to worry yet?  If this were HEAD only, I'd not be too
> concerned; but two of these three are on allegedly-stable branches.
> And we have releases coming up fast.

I don't know; neither decision feels terrible to me.  A bug fix that would
address both the data corruption causes and those buildfarm failures has been
awaiting review on this thread for 77 days.  The data corruption causes are
more problematic than 0.03% of buildfarm runs getting noise failures.  Two
wrongs don't make a right, but a commit masking that level of buildfarm noise
also feels like sending the wrong message.




Re: race condition in pg_class

2024-07-28 Thread Tom Lane
Noah Misch  writes:
> On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
>> Is it time to worry yet?  If this were HEAD only, I'd not be too
>> concerned; but two of these three are on allegedly-stable branches.
>> And we have releases coming up fast.

> I don't know; neither decision feels terrible to me.

Yeah, same here.  Obviously, it'd be better to spend effort on getting
the bug fix committed than to spend effort on some cosmetic
workaround.

The fact that the failure is in the isolation tests not the core
regression tests reduces my level of concern somewhat about shipping
it this way.  I think that packagers typically run the core tests
not check-world during package verification, so they won't hit this.

regards, tom lane




Re: UUID v7

2024-07-28 Thread Andrey M. Borodin


> On 24 Jul 2024, at 04:09, Sergey Prokhorenko  
> wrote:
> 
> Implementations MAY alter the actual timestamp.

Hmm… looks like we slightly misinterpreted words about clock source.
Well, that’s great, let’s get offset back.
PFA version accepting offset interval.
It works like this:
postgres=# select uuidv7(interval '-2 months’);
 018fc02f-0996-7136-aeb4-8936b5a516a1


postgres=# select uuid_extract_timestamp(uuidv7(interval '-2 months'));
 2024-05-28 22:11:15.71+05

What do you think?


Best regards, Andrey Borodin.


v26-0001-Implement-UUID-v7.patch
Description: Binary data


Re: Optimize mul_var() for var1ndigits >= 8

2024-07-28 Thread Dean Rasheed
On Fri, 12 Jul 2024 at 13:34, Dean Rasheed  wrote:
>
> Then I tried compiling with -m32, and unfortunately this made the
> patch slower than HEAD for small inputs:
>
> -- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled]
> call rate=5.052332e+06  -- HEAD
> call rate=3.883459e+06  -- v2 patch
>
> -- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled]
> call rate=4.7221405e+06  -- HEAD
> call rate=3.7965685e+06  -- v2 patch
>
> -- var1ndigits1=7, var2ndigits2=7 [-m32, SIMD disabled]
> call rate=4.4638375e+06   -- HEAD
> call rate=3.39948e+06 -- v2 patch
>
> and it doesn't reach parity until around ndigits=26, which is
> disappointing. It does still get much faster for large inputs
>

I spent some more time hacking on this, to try to improve the
situation for 32-bit builds. One of the biggest factors was the 64-bit
division that is necessary during carry propagation, which is very
slow -- every compiler/platform I looked at on godbolt.org turns this
into a call to a slow function like __udivdi3(). However, since we are
dividing by a constant (NBASE^2), it can be done using the same
multiply-and-shift trick that compilers use in 64-bit builds, and that
significantly improves performance.

Unfortunately, in 32-bit builds, using 64-bit integers is still slower
for small inputs (below 20-30 NBASE digits), so in the end I figured
that it's best to retain the old 32-bit base-NBASE multiplication code
for small inputs, and only use 64-bit base-NBASE^2 multiplication when
the inputs are above a certain size threshold. Furthermore, since this
threshold is quite low, it's possible to make an additional
simplification: as long as the threshold is less than or equal to 42,
it can be shown that there is no chance of 32-bit integer overflow,
and so the "maxdig" tracking and renormalisation code is not needed.
Getting rid of that makes the outer multiplication loop very simple,
and makes quite a noticeable difference to the performance for inputs
below the threshold.

In 64-bit builds, doing 64-bit base-NBASE^2 multiplication is faster
for all inputs over 4 or 5 NBASE digits, so the threshold can be much
lower. However, for numbers near that threshold, it's a close thing,
so it makes sense to also extend mul_var_small() to cover 1-6 digit
inputs, since that gives a much larger gain for numbers of that size.
That's quite nice because it equates to inputs with up to 21-24
decimal digits, which I suspect are quite commonly used in practice.

One risk in having different thresholds in 32-bit and 64-bit builds is
that it opens up the possibility that the results from the
reduced-rscale computations used by inexact functions like ln() and
exp() might be be different (actually, this may already be a
possibility, due to div_var_fast()'s use of floating point arithmetic,
but that seems considerably less likely). In practice though, this
should be extremely unlikely, due to the fact that any difference
would have to propagate through MUL_GUARD_DIGITS NBASE digits (8
decimal digits), and it doesn't show up in any of the existing tests.
IMO a very small chance of different results on different platforms is
probably acceptable, but it wouldn't be acceptable to make the
threshold a runtime configurable parameter that could lead to
different results on the same platform.

Overall, this has turned out to be more code than I would have liked,
but I think it's worth it, because the performance gains are pretty
substantial across the board.

(Here, I'm comparing with REL_17_STABLE, so that the effect of
mul_var_short() for ndigits <= 6 can be seen.)

32-bit build


Balanced inputs:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
--+--+---+---+--
1 |1 |  5.973068e+06 |  6.873059e+06 | +15.07%
2 |2 |  5.646598e+06 | 6.6456665e+06 | +17.69%
3 |3 | 5.8176995e+06 | 7.0903175e+06 | +21.87%
4 |4 |  5.545298e+06 | 6.9701605e+06 | +25.69%
5 |5 | 5.2109155e+06 | 6.6406185e+06 | +27.44%
6 |6 | 4.9276335e+06 |  6.478698e+06 | +31.48%
7 |7 | 4.6595095e+06 | 4.8514485e+06 | +4.12%
8 |8 |  4.898536e+06 | 5.1599975e+06 | +5.34%
9 |9 |  4.537171e+06 |  4.830987e+06 | +6.48%
   10 |   10 |  4.308139e+06 | 4.6029265e+06 | +6.84%
   11 |   11 |  4.092612e+06 |   4.37966e+06 | +7.01%
   12 |   12 | 3.9345035e+06 |  4.213998e+06 | +7.10%
   13 |   13 | 3.7600162e+06 | 4.0115955e+06 | +6.69%
   14 |   14 | 3.5959855e+06 | 3.8216508e+06 | +6.28%
   15 |   15 | 3.3576732e+06 | 3.6070588e+06 | +7.43%
   16 |   16 |  3.657139e+06 | 3.9067975e+06 | +6.83%
   17 |   17 | 3.3601955e+06 | 3.5651722e+06 | +6.10%
   18 |   18 | 3.1844472e+06 | 3.4542238e+06 | +8.47%
   19 |   19 | 3.0286392e+06 | 3.2380662e+06 | +6.91%
   20 |   20 | 2.9012185e+06 | 3.0496352e+06 

Re: why is pg_upgrade's regression run so slow?

2024-07-28 Thread Andrew Dunstan



On 2024-07-27 Sa 6:48 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 2024-07-27 Sa 10:20 AM, Tom Lane wrote:

Just to add some more fuel to the fire: I do *not* observe such an
effect on my own animals.

The culprit appears to be meson. When I tested running crake with
"using_meson => 0" I got results in line with yours.

Interesting.  Maybe meson is over-aggressively trying to run these
test suites in parallel?




Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on 
HEAD for fairywren and drongo -  fairwren has just gone green, and I 
expect drongo will when it reports in a few hours.


I'm at a loss for an explanation.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Speed up collation cache

2024-07-28 Thread Jeff Davis
On Sun, 2024-07-28 at 00:14 +0200, Andreas Karlsson wrote:
> But even without that extra optimization I think this patch is worth 
> merging and the patch is small, simple and clean and easy to
> understand 
> and a just a clear speed up. Feels like a no brainer. I think that it
> is 
> ready for committer.

Committed, thank you.

> And then we can discuss after committing if an additional cache of
> the 
> last locale is worth it or not.

Yeah, I'm holding off on that until refactoring in the area settles,
and we'll see if it's still worth it.

Regards,
Jeff Davis





Re: Pluggable cumulative statistics

2024-07-28 Thread Dmitry Dolgov
> On Sun, Jul 28, 2024 at 10:20:45PM GMT, Michael Paquier wrote:
> I would like to apply this new infrastructure stuff and move on to the
> problems related to the scability of pg_stat_statements.  So, are
> there any objections with all that?

So far I've got nothing against :)




Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-07-28 Thread David Rowley
On Mon, 29 Jul 2024 at 04:31, Tom Lane  wrote:
> It is not clear to me exactly which of these pointers should be
> presumed to be possibly-null, but certainly testing ident after
> storing through it is pretty pointless.  Maybe what was intended
> was
>
> -   if (ident && strcmp(*name, "dynahash") == 0)
> +   if (*name && strcmp(*name, "dynahash") == 0)

It should be *ident. I just missed adding the pointer dereference when
moving that code to a function.

Thanks for the report. I'll fix shortly.

David




Re: pg_upgrade failing for 200+ million Large Objects

2024-07-28 Thread Tom Lane
I wrote:
> Alexander Korotkov  writes:
>> J4F, I have an idea to count number of ';' sings and use it for
>> transaction size counter, since it is as upper bound estimate of
>> number of SQL commands :-)

> Hmm ... that's not a completely silly idea.  Let's keep it in
> the back pocket in case we can't easily reduce the number of
> SQL commands in some cases.

After poking at this for awhile, we can fix Justin's example
case by avoiding repeated UPDATEs on pg_attribute, so I think
we should do that.  It seems clearly a win, with no downside
other than a small increment of complexity in pg_dump.

However, that's probably not sufficient to mark this issue
as closed.  It seems likely that there are other patterns
that would cause backend memory bloat.  One case that I found
is tables with a lot of inherited constraints (not partitions,
but old-style inheritance).  For example, load the output of
this Perl script into a database:

-
for (my $i = 0; $i < 100; $i++)
{
print "CREATE TABLE test_inh_check$i (\n";
for (my $j = 0; $j < 1000; $j++)
{
print "a$j float check (a$j > 10.2),\n";
}
print "b float);\n";
print "CREATE TABLE test_inh_check_child$i() 
INHERITS(test_inh_check$i);\n";
}
-

pg_dump is horrendously slow on this, thanks to O(N^2) behavior in
ruleutils.c, and pg_upgrade is worse --- and leaks memory too in
HEAD/v17.  The slowness was there before, so I think the lack of
field complaints indicates that this isn't a real-world use case.
Still, it's bad if pg_upgrade fails when it would not have before,
and there may be other similar issues.

So I'm forced to the conclusion that we'd better make the transaction
size adaptive as per Alexander's suggestion.

In addition to the patches attached, I experimented with making
dumpTableSchema fold all the ALTER TABLE commands for a single table
into one command.  That's do-able without too much effort, but I'm now
convinced that we shouldn't.  It would break the semicolon-counting
hack for detecting that tables like these involve extra work.
I'm also not very confident that the backend won't have trouble with
ALTER TABLE commands containing hundreds of subcommands.  That's
something we ought to work on probably, but it's not a project that
I want to condition v17 pg_upgrade's stability on.

Anyway, proposed patches attached.  0001 is some trivial cleanup
that I noticed while working on the failed single-ALTER-TABLE idea.
0002 merges the catalog-UPDATE commands that dumpTableSchema issues,
and 0003 is Alexander's suggestion.

regards, tom lane

From 34ebed72e9029f690e5d3f0cb7464670e83caa5c Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Sun, 28 Jul 2024 13:02:27 -0400
Subject: [PATCH v1 1/3] Fix missing ONLY in one dumpTableSchema command.

The various ALTER [FOREIGN] TABLE commands issued by dumpTableSchema
all use ONLY, except for this one.  I think it's not a live bug
because we don't permit foreign tables to have children, but
it's still inconsistent.  I happened across this while refactoring
dumpTableSchema to merge all its ALTERs into one; while I'm not
certain we should actually do that, this seems like good cleanup.
---
 src/bin/pg_dump/pg_dump.c| 2 +-
 src/bin/pg_dump/t/002_pg_dump.pl | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index b8b1888bd3..7cd6a7fb97 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -16344,7 +16344,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 			if (tbinfo->relkind == RELKIND_FOREIGN_TABLE &&
 tbinfo->attfdwoptions[j][0] != '\0')
 appendPQExpBuffer(q,
-  "ALTER FOREIGN TABLE %s ALTER COLUMN %s OPTIONS (\n"
+  "ALTER FOREIGN TABLE ONLY %s ALTER COLUMN %s OPTIONS (\n"
   "%s\n"
   ");\n",
   qualrelname,
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d3dd8784d6..5bcc2244d5 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1154,7 +1154,7 @@ my %tests = (
 
 	'ALTER FOREIGN TABLE foreign_table ALTER COLUMN c1 OPTIONS' => {
 		regexp => qr/^
-			\QALTER FOREIGN TABLE dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n
+			\QALTER FOREIGN TABLE ONLY dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n
 			\s+\Qcolumn_name 'col1'\E\n
 			\Q);\E\n
 			/xm,
-- 
2.43.5

From c8f0d0252e292f276fe9631ae31e6aea11d294d2 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Sun, 28 Jul 2024 16:19:48 -0400
Subject: [PATCH v1 2/3] Reduce number of commands dumpTableSchema emits for
 binary upgrade.

Avoid issuing a separate SQL UPDATE command for each column when
directly manipulating pg_attribute contents in binary upgrade mode.
With the separate updates, we triggered a relcache invalidation with
each update.  For a table with N columns, that causes O(N^2) relcache
bloat in txn_size mode because the ta

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-28 Thread Joel Jacobson
On Sun, Jul 28, 2024, at 21:18, Dean Rasheed wrote:
> Attachments:
> * v3-0002-Optimise-numeric-multiplication-using-base-NBASE-.patch
> * v3-0001-Extend-mul_var_short-to-5-and-6-digit-inputs.patch

Very nice.

I've done some initial benchmarks on my Intel Core i9-14900K machine.

To reduce noise, I've isolated a single CPU core, specifically CPU core id 31, 
to not get any work scheduled by the kernel:

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.15.0-116-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv 
ro quiet splash isolcpus=31 intel_pstate=disable vt.handoff=7

Then, I've used sched_setaffinity() from  to ensure the benchmark run 
on CPU core id 31.

I've also fixed the CPU frequency to 3.20 GHz:

$ sudo cpufreq-info -c 31
...
  current CPU frequency is 3.20 GHz (asserted by call to hardware).

I've benchmarked each (var1ndigits, var2ndigits) 10 times per commit, in random 
order.

I've benchmarked all commits after "SQL/JSON: Various improvements to SQL/JSON 
query function docs"
which is the parent commit to "Optimise numeric multiplication for short 
inputs.",
including the two patches.

I've benchmarked each commit affecting numeric.c, and each such commit's parent 
commit, for comparison.

The accum_change column shows the accumulative percentage change since the 
baseline commit (SQL/JSON: Various improvements).

There is at least single digit percentage noise in the measurements,
which is apparent since the rate fluctuates even between commits
for cases we know can't be affected by the change.
Still, even with this noise level, the improvements are very visible and 
consistent.

ndigits|rate| accum_change |summary
---++--+
 (1,1) |  1.702e+07 |  | SQL/JSON: Various improvements
 (1,1) |  2.201e+07 | +29.32 % | Optimise numeric multiplicatio
 (1,1) |  2.268e+07 | +33.30 % | Use diff's --strip-trailing-cr
 (1,1) |  2.228e+07 | +30.92 % | Improve the numeric width_buck
 (1,1) |  2.195e+07 | +29.01 % | Add missing pointer dereferenc
 (1,1) |  2.241e+07 | +31.68 % | Extend mul_var_short() to 5 an
 (1,1) |  2.130e+07 | +25.17 % | Optimise numeric multiplicatio
 (1,2) |  1.585e+07 |  | SQL/JSON: Various improvements
 (1,2) |  2.227e+07 | +40.49 % | Optimise numeric multiplicatio
 (1,2) |  2.140e+07 | +35.00 % | Use diff's --strip-trailing-cr
 (1,2) |  2.227e+07 | +40.51 % | Improve the numeric width_buck
 (1,2) |  2.183e+07 | +37.75 % | Add missing pointer dereferenc
 (1,2) |  2.241e+07 | +41.41 % | Extend mul_var_short() to 5 an
 (1,2) |  2.223e+07 | +40.26 % | Optimise numeric multiplicatio
 (1,3) |  1.554e+07 |  | SQL/JSON: Various improvements
 (1,3) |  2.155e+07 | +38.70 % | Optimise numeric multiplicatio
 (1,3) |  2.140e+07 | +37.74 % | Use diff's --strip-trailing-cr
 (1,3) |  2.139e+07 | +37.66 % | Improve the numeric width_buck
 (1,3) |  2.234e+07 | +43.76 % | Add missing pointer dereferenc
 (1,3) |  2.142e+07 | +37.83 % | Extend mul_var_short() to 5 an
 (1,3) |  2.066e+07 | +32.97 % | Optimise numeric multiplicatio
 (1,4) |  1.450e+07 |  | SQL/JSON: Various improvements
 (1,4) |  2.113e+07 | +45.70 % | Optimise numeric multiplicatio
 (1,4) |  2.121e+07 | +46.30 % | Use diff's --strip-trailing-cr
 (1,4) |  2.115e+07 | +45.85 % | Improve the numeric width_buck
 (1,4) |  2.166e+07 | +49.37 % | Add missing pointer dereferenc
 (1,4) |  2.053e+07 | +41.56 % | Extend mul_var_short() to 5 an
 (1,4) |  2.085e+07 | +43.82 % | Optimise numeric multiplicatio
 (1,8) |  1.440e+07 |  | SQL/JSON: Various improvements
 (1,8) |  1.963e+07 | +36.38 % | Optimise numeric multiplicatio
 (1,8) |  2.018e+07 | +40.19 % | Use diff's --strip-trailing-cr
 (1,8) |  2.045e+07 | +42.05 % | Improve the numeric width_buck
 (1,8) |  1.998e+07 | +38.79 % | Add missing pointer dereferenc
 (1,8) |  1.953e+07 | +35.68 % | Extend mul_var_short() to 5 an
 (1,8) |  1.992e+07 | +38.36 % | Optimise numeric multiplicatio
 (1,16)|  9.444e+06 |  | SQL/JSON: Various improvements
 (1,16)|  1.235e+07 | +30.75 % | Optimise numeric multiplicatio
 (1,16)|  1.232e+07 | +30.47 % | Use diff's --strip-trailing-cr
 (1,16)|  1.239e+07 | +31.18 % | Improve the numeric width_buck
 (1,16)|  1.222e+07 | +29.35 % | Add missing pointer dereferenc
 (1,16)|  1.220e+07 | +29.14 % | Extend mul_var_short() to 5 an
 (1,16)|  1.271e+07 | +34.54 % | Optimise numeric multiplicatio
 (1,32)|  5.790e+06 |  | SQL/JS

Re: POC, WIP: OR-clause support for indexes

2024-07-28 Thread Alexander Korotkov
On Sun, Jul 28, 2024 at 12:59 PM Alena Rybakina
 wrote:
> On 27.07.2024 13:56, Alexander Korotkov wrote:
> > On Thu, Jul 25, 2024 at 5:04 PM Alena Rybakina
> >   wrote:
> >> To be honest, I have found a big problem in this patch - we try to perform 
> >> the transformation every time we examime a column:
> >>
> >> for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++) { ...
> >>
> >> }
> >>
> >> I have fixed it and moved the transformation before going through the loop.
> > What makes you think there is a problem?
>
> To be honest, I was bothered by the fact that we need to go through
> expressions several times that obviously will not fit under other
> conditions.
> Just yesterday I thought that it would be worthwhile to create a list of
> candidates - expressions that did not fit because the column did not
> match the index (!match_index_to_operand(nconst_expr, indexcol, index)).

I admit that this area probably could use some optimization,
especially for case of many clauses and many indexes.  But in the
scope of this patch, I think this is enough to not make things worse
in this area.

> Another problem that is related to the first one that the boolexpr could
> contain expressions referring to different operands, for example, both x
> and y. that is, we may have the problem that the optimal "ANY"
> expression may not be used, because the expression with x may come
> earlier and the loop may end earlier.
>
> alena@postgres=# create table b (x int, y int);
> CREATE TABLE
> alena@postgres=# insert into b select id, id from
> generate_series(1,1000) as id;
> INSERT 0 1000
> alena@postgres=# create index x_idx on b(x);
> CREATE INDEX
> alena@postgres=# analyze;
> ANALYZE
> alena@postgres=# explain select * from b where y =3 or x =4 or x=5 or
> x=6 or x = 7 or x=8 or x=9;
>QUERY PLAN
> ---
>   Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
> Filter: ((y = 3) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x =
> 8) OR (x = 9))
> (2 rows)
> alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x =
> 7 or x=8 or x=9 or y=1;
>QUERY PLAN
> ---
>   Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
> Filter: ((x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x =
> 9) OR (y = 1))
> (2 rows)
> alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x =
> 7 or x=8 or x=9;
> QUERY PLAN
> 
>   Index Scan using x_idx on b  (cost=0.28..12.75 rows=6 width=8)
> Index Cond: (x = ANY ('{4,5,6,7,8,9}'::integer[]))
> (2 rows)
>
> Furthermore expressions can be stored in a different order.
> For example, first comes "AND" expr, and then group of "OR" expr, which
> we can convert to "ANY" expr, but we won't do this due to the fact that
> we will exit the loop early, according to this condition:
>
> if (!IsA(sub_rinfo->clause, OpExpr))
> return NULL;
>
> or it may occur due to other conditions.
>
> alena@postgres=# create index x_y_idx on b(x,y);
> CREATE INDEX
> alena@postgres=# analyze;
> ANALYZE
>
> alena@postgres=# explain select * from b where (x = 2 and y =3) or x =4
> or x=5 or x=6 or x = 7 or x=8 or x=9;
>   QUERY PLAN
> -
>   Seq Scan on b  (cost=0.00..35.00 rows=6 width=8)
> Filter: (((x = 2) AND (y = 3)) OR (x = 4) OR (x = 5) OR (x = 6) OR
> (x = 7) OR (x = 8) OR (x = 9))
> (2 rows)
>
> Because of these reasons, I tried to save this and that transformation
> together for each column and try to analyze for each expr separately
> which method would be optimal.

Yes, with v27 of the patch, optimization wouldn't work in these cases.
However, you are using quite small table.  If you will use larger
table or disable sequential scans, there would be bitmap plans to
handle these queries.  So, v27 doesn't make the situation worse. It
just doesn't optimize all that it could potentially optimize and
that's OK.

I've written a separate 0002 patch to address this.  Now, before
generation of paths for bitmap OR, similar OR entries are grouped
together.  When considering a group of similar entries, they are
considered both together and one-by-one.  Ideally we could consider
more sophisticated grouping, but that seems fine for now.  You can
check how this patch handles the cases of above.

Also, 0002 address issue of duplicated bitmap scan conditions in
different forms. During generate_bitmap_or_paths() we need to exclude
considered condition for other clauses.  It couldn't be as normal
filtered out in the latter stage, because could reach the index in
another form.

> > Do you have a test ca

Re: Maybe don't process multi xmax in FreezeMultiXactId() if it is already marked as invalid?

2024-07-28 Thread Anton A. Melnikov



On 19.06.2024 21:06, Peter Geoghegan wrote:

On Wed, Jun 19, 2024 at 1:39 PM Alvaro Herrera  wrote:

FWIW I don't think HEAP_XMAX_INVALID as purely a hint.
HEAP_XMAX_COMMITTED is a hint, for sure, as is HEAP_XMIN_COMMITTED on
its own; but as far as I recall, the INVALID flags must persist once
set.


Seems we disagree on some pretty fundamental things in this area, then.


To resolve this situation seems it is necessary to agree on what
is a "hint bit" exactly means and how to use it.

For example, in this way:

1) Definition. The "hint bit" if it is set represents presence of the property 
of some object (e.g. tuple).
The value of a hint bit can be derived again at any time. So it is acceptable 
for a hint
bit to be lost during some operations.

2) Purpose. (It has already been defined by Yura Sokolov in one of the previous 
letters)
Some work (e.g CPU + mem usage) must be done to check the property of some 
object.
Checking the hint bit, if it is set, saves this work.
So the purpose of the hint bit is optimization.

3) Use. By default code that needs to check some property of the object
must firstly check the corresponding hint bit. If hint is set, determine that 
the property
is present. If hint is not set, do the work to check this property of the 
object and set
hint bit if that property is present.
Also, non-standard behavior is allowed, when the hint bit is ignored and the 
work on property
check will be performed unconditionally for some reasons. In this case the code 
must contain
a comment with an explanation of this reason.

And maybe for clarity, explicitly say that some bit is a hint right in its 
definition?
For instance, use HEAP_XMIN_COMMITTED_HINT instead of HEAP_XMIN_COMMITTED.


Remarks and concerns are gratefully welcome.
 


With the best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Simplify create_merge_append_path a bit for clarity

2024-07-28 Thread Richard Guo
On Fri, Jul 26, 2024 at 1:28 PM Paul A Jungwirth
 wrote:
> Is there a reason you don't want to remove the required_outer
> parameter altogether? I guess because it is such a common pattern to
> pass it?

I think it's best to keep this parameter unchanged to maintain
consistency with other functions that create path nodes in pathnode.c.

> Do you think it is worth keeping this assertion?:
>
> -
> -/* All child paths must have same parameterization */
> -Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
>
> I understand any failure would trigger one of the prior asserts
> instead, but it does communicate an extra requirement, and there is no
> cost.

I don't think it's a good idea to keep this Assert: with this change
it becomes redundant.

> But I'd be fine with committing this patch as-is.

I've pushed this patch.  Thanks for review.

Thanks
Richard




Re: Allow logical failover slots to wait on synchronous replication

2024-07-28 Thread shveta malik
On Fri, Jul 26, 2024 at 5:11 PM Amit Kapila  wrote:
>
> On Fri, Jul 26, 2024 at 3:28 PM shveta malik  wrote:
> >
> > On Tue, Jul 23, 2024 at 10:35 AM Amit Kapila  
> > wrote:
> > >
> > > On Tue, Jul 9, 2024 at 12:39 AM John H  wrote:
> > > >
> > > > > Out of curiosity, did you compare with 
> > > > > standby_slot_names_from_syncrep set to off
> > > > > and standby_slot_names not empty?
> > > >
> > > > I didn't think 'standby_slot_names' would impact TPS as much since
> > > > it's not grabbing the SyncRepLock but here's a quick test.
> > > > Writer with 5 synchronous replicas, 10 pg_recvlogical clients and
> > > > pgbench all running from the same server.
> > > >
> > > > Command: pgbench -c 4 -j 4 -T 600 -U "ec2-user" -d postgres -r -P 5
> > > >
> > > > Result with: standby_slot_names =
> > > > 'replica_1,replica_2,replica_3,replica_4,replica_5'
> > > >
> > > > latency average = 5.600 ms
> > > > latency stddev = 2.854 ms
> > > > initial connection time = 5.503 ms
> > > > tps = 714.148263 (without initial connection time)
> > > >
> > > > Result with: standby_slot_names_from_syncrep = 'true',
> > > > synchronous_standby_names = 'ANY 3 (A,B,C,D,E)'
> > > >
> > > > latency average = 5.740 ms
> > > > latency stddev = 2.543 ms
> > > > initial connection time = 4.093 ms
> > > > tps = 696.776249 (without initial connection time)
> > > >
> > > > Result with nothing set:
> > > >
> > > > latency average = 5.090 ms
> > > > latency stddev = 3.467 ms
> > > > initial connection time = 4.989 ms
> > > > tps = 785.665963 (without initial connection time)
> > > >
> > > > Again I think it's possible to improve the synchronous numbers if we
> > > > cache but I'll try that out in a bit.
> > > >
> > >
> > > Okay, so the tests done till now conclude that we won't get the
> > > benefit by using 'standby_slot_names_from_syncrep'. Now, if we
> > > increase the number of standby's in both lists and still keep ANY 3 in
> > > synchronous_standby_names then the results may vary. We should try to
> > > find out if there is a performance benefit with the use of
> > > synchronous_standby_names in the normal configurations like the one
> > > you used in the above tests to prove the value of this patch.
> > >
> >
> > I didn't fully understand the parameters mentioned above, specifically
> > what 'latency stddev' and 'latency average' represent.. But shouldn't
> > we see the benefit/value of this patch by having a setup where a
> > particular standby is slow in sending the response back to primary
> > (could be due to network lag or other reasons) and then measuring the
> > latency in receiving changes on failover-enabled logical subscribers?
> > We can perform this test with both of the below settings and say make
> > D and E slow in sending responses:
> > 1) synchronous_standby_names = 'ANY 3 (A,B,C,D,E)'
> > 2) standby_slot_names = A_slot, B_slot, C_slot, D_slot, E_slot.
> >
>
> Yes, I also expect the patch should perform better in such a scenario
> but it is better to test it. Also, irrespective of that, we should
> investigate why the reported case is slower for
> synchronous_standby_names and see if we can improve it.

+1

> BTW, you for 2), I think you wanted to say synchronized_standby_slots,
> not standby_slot_names. We have recently changed the GUC name.

yes, sorry, synchronized_standby_slots it is.

thanks
Shveta




Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-07-28 Thread Richard Guo
On Fri, Jul 26, 2024 at 5:44 PM Richard Guo  wrote:
> I've worked a bit more on the comments and commit message, and I plan
> to push the attached soon, barring any objections or comments.

Pushed.

Thanks
Richard




Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-07-28 Thread Ashutosh Bapat
Thanks a lot Richard.

On Mon, Jul 29, 2024 at 8:56 AM Richard Guo  wrote:
>
> On Fri, Jul 26, 2024 at 5:44 PM Richard Guo  wrote:
> > I've worked a bit more on the comments and commit message, and I plan
> > to push the attached soon, barring any objections or comments.
>
> Pushed.
>
> Thanks
> Richard



-- 
Best Wishes,
Ashutosh Bapat




RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-28 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

Thanks for pushing and analyzing the failure!

> The regression.diffs shows that pgfdw_conn_check returned 0 even though
> pgfdw_conn_checkable()
> returned true. This can happen if the "revents" from poll() indicates 
> something
> other than
> POLLRDHUP. I think that "revents" could indicate POLLHUP, POLLERR, or
> POLLNVAL. Therefore,
> IMO pgfdw_conn_check() should be updated as follows. I will test this change.
> 
> -   return (input_fd.revents & POLLRDHUP) ? 1 : 0;
> +   return (input_fd.revents &
> +   (POLLRDHUP | POLLHUP | POLLERR |
> POLLNVAL)) ? 1 : 0;

I think you are right.
According to the man page, the input socket is invalid or disconnected if 
revents
has such bits. So such cases should be also regarded as *failure*.
After the fix, the animal said OK. Great works!

Best regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-28 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

> > IIUC, the patch which adds user_name attribute to get_connection() can be
> discussed
> > in later stage, is it right?
> 
> No, let's work on the patch at this stage :)

OK, here is a rebased patch.

- Changed the name of new API from `GetUserMappingFromOid` to 
`GetUserMappingByOid`
  to keep the name consistent with others.
- Comments and docs were updated.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Extend-postgres_fdw_get_connections-to-return-user-n.patch
Description:  0001-Extend-postgres_fdw_get_connections-to-return-user-n.patch


dikkop failed the pg_combinebackupCheck/006_db_file_copy.pl test

2024-07-28 Thread Alexander Lakhin

Hello Tomas,

Please take a look at a recent dikkop's failure [1]. The
regress_log_006_db_file_copy file from that run shows:
[02:08:57.929](0.014s) # initializing database system by copying initdb template
...
[02:09:22.511](24.583s) ok 1 - full backup
...
[02:10:35.758](73.247s) not ok 2 - incremental backup

006_db_file_copy_primary.log contains:
2024-07-28 02:09:29.441 UTC [67785:12] 006_db_file_copy.pl LOG: received replication command: START_REPLICATION SLOT 
"pg_basebackup_67785" 0/400 TIMELINE 1
2024-07-28 02:09:29.441 UTC [67785:13] 006_db_file_copy.pl STATEMENT:  START_REPLICATION SLOT "pg_basebackup_67785" 
0/400 TIMELINE 1

2024-07-28 02:09:29.441 UTC [67785:14] 006_db_file_copy.pl LOG: acquired physical 
replication slot "pg_basebackup_67785"
2024-07-28 02:09:29.441 UTC [67785:15] 006_db_file_copy.pl STATEMENT:  START_REPLICATION SLOT "pg_basebackup_67785" 
0/400 TIMELINE 1

2024-07-28 02:10:29.487 UTC [67785:16] 006_db_file_copy.pl LOG: terminating 
walsender process due to replication timeout
2024-07-28 02:10:29.487 UTC [67785:17] 006_db_file_copy.pl STATEMENT:  START_REPLICATION SLOT "pg_basebackup_67785" 
0/400 TIMELINE 1


It looks like this incremental backup operation was performed slower than
usual (it took more than 60 seconds and apparently was interrupted due to
wal_sender_timeout). But looking at regress_log_006_db_file_copy from the
6 previous (successful) test runs, we can see:
[14:22:16.841](43.215s) ok 2 - incremental backup
[02:14:42.888](34.595s) ok 2 - incremental backup
[17:51:16.152](43.708s) ok 2 - incremental backup
[04:07:16.757](31.087s) ok 2 - incremental backup
[12:15:01.256](49.432s) ok 2 - incremental backup
[01:06:02.482](52.364s) ok 2 - incremental backup

Thus reaching 60s (e.g., due to some background activity) on this animal
seems pretty possible. So maybe it would make sense to increase
wal_sender_timeout for it, say, to 120s?

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2024-07-27%2023%3A22%3A57

Best regards,
Alexander




Re: Conflict detection and logging in logical replication

2024-07-28 Thread Amit Kapila
On Fri, Jul 26, 2024 at 4:28 PM shveta malik  wrote:
>
> On Fri, Jul 26, 2024 at 3:56 PM Amit Kapila  wrote:
> >
>
> > One more thing we need to consider is whether we should LOG or ERROR
> > for update/delete_differ conflicts. If we LOG as the patch is doing
> > then we are intentionally overwriting the row when the user may not
> > expect it. OTOH, without a patch anyway we are overwriting, so there
> > is an argument that logging by default is what the user will expect.
> > What do you think?
>
> I was under the impression that in this patch we do not intend to
> change behaviour of HEAD and thus only LOG the conflict wherever
> possible.
>

Earlier, I thought it was good to keep LOGGING the conflict where
there is no chance of wrong data update but for cases where we can do
something wrong, it is better to ERROR out. For example, for an
update_differ case where the apply worker can overwrite the data from
a different origin, it is better to ERROR out. I thought this case was
comparable to an existing ERROR case like a unique constraint
violation. But I see your point as well that one might expect the
existing behavior where we are silently overwriting the different
origin data. The one idea to address this concern is to suggest users
set the detect_conflict subscription option as off but I guess that
would make this feature unusable for users who don't want to ERROR out
for different origin update cases.

> And in the next patch of resolver, based on the user's input
> of error/skip/or resolve, we take the action. I still think it is
> better to stick to the said behaviour. Only if we commit the resolver
> patch in the same version where we commit the detection patch, then we
> can take the risk of changing this default behaviour to 'always
> error'. Otherwise users will be left with conflicts arising but no
> automatic way to resolve those. But for users who really want their
> application to error out, we can provide an additional GUC in this
> patch itself which changes the behaviour to 'always ERROR on
> conflict'.
>

I don't see a need of GUC here, even if we want we can have a
subscription option such conflict_log_level. But users may want to
either LOG or ERROR based on conflict type. For example, there won't
be any data inconsistency in two node replication for delete_missing
case as one is trying to delete already deleted data, so LOGGING such
a case should be sufficient whereas update_differ could lead to
different data on two nodes, so the user may want to ERROR out in such
a case.

We can keep the current behavior as default for the purpose of
conflict detection but can have a separate patch to decide whether to
LOG/ERROR based on conflict_type.

-- 
With Regards,
Amit Kapila.




Re: Flush pgstats file during checkpoints

2024-07-28 Thread Bertrand Drouvot
Hi,

On Tue, Jul 23, 2024 at 12:52:11PM +0900, Michael Paquier wrote:
> On Mon, Jul 22, 2024 at 07:01:41AM +, Bertrand Drouvot wrote:
> > 3 ===
> > 
> > +   /*
> > +* Read the redo LSN stored in the file.
> > +*/
> > +   if (!read_chunk_s(fpin, &file_redo) ||
> > +   file_redo != redo)
> > +   goto error;
> > 
> > I wonder if it would make sense to have dedicated error messages for
> > "file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would
> > ease to diagnose as to why the stat file is discarded.
> 
> Yep.  This has been itching me quite a bit, and that's a bit more than
> just the format ID or the redo LSN: it relates to all the read_chunk()
> callers.  I've taken a shot at this with patch 0001, implemented on
> top of the rest.

Thanks! 0001 attached is 
v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch
so I guess you did not attached the right one.

> Attaching a new v4 series, with all these comments addressed.

Thanks!

Looking at 0002:

1 ===

   if (!read_chunk(fpin, ptr, info->shared_data_len))
+  {
+   elog(WARNING, "could not read data of stats kind %d for entry 
of type %c",
+   kind, t);

Nit: what about to include the "info->shared_data_len" value in the WARNING?

2 ===

   if (!read_chunk_s(fpin, &name))
+  {
+   elog(WARNING, "could not read name of stats kind %d for entry 
of type %c",
+kind, t);
goto error;
+  }
   if (!pgstat_is_kind_valid(kind))
+  {
+   elog(WARNING, "invalid stats kind %d for entry of type %c",
+kind, t);
goto error;
+  }

Shouldn't we swap those 2 tests so that we check that the kind is valid right
after this one?

  if (!read_chunk_s(fpin, &kind))
+ {
+  elog(WARNING, "could not read stats kind for entry of type %c", t);
   goto error;
+ }

Looking at 0003: LGTM

Looking at 0004: LGTM

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Allow logical failover slots to wait on synchronous replication

2024-07-28 Thread Bertrand Drouvot
Hi John,

On Thu, Jul 18, 2024 at 02:22:08PM -0700, John H wrote:
> Hi Bertrand,
> 
> > 1 ===
> > ...
> > That's worth additional comments in the code.
> 
> There's this comment already about caching the value already, not sure
> if you prefer something more?
> 
> /* Cache values to reduce contention on lock */

Yeah, at the same place as the static lsn[] declaration, something like:

static XLogRecPtr   lsn[NUM_SYNC_REP_WAIT_MODE]; /* cached LSNs */

but that may just be a matter of taste.

> > 3 ===
> > ...
> > NUM_SYNC_REP_WAIT_MODE is small but as the goal is the keep the lock time as
> > short as possible I wonder if it wouldn't be better to use memcpy() here 
> > instead
> > of this for loop.
> >
> 
> It results in a "Wdiscarded-qualifiers" which is safe given we take
> the lock, but adds noise?
> What do you think?
> 
> "slot.c:2756:46: warning: passing argument 2 of ‘memcpy’ discards
> ‘volatile’ qualifier from pointer target type
> [-Wdiscarded-qualifiers]"

Right, we may want to cast it then but given that the for loop is "small" I 
think
that's also fine to keep the for loop.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Add mention of execution time memory for enable_partitionwise_* GUCs

2024-07-28 Thread David Rowley
On Fri, 19 Jul 2024 at 17:24, Ashutosh Bapat
 wrote:
> I am fine if we want to mention that the executor may consume a large
> amount of memory when these GUCs are turned ON. Users may decide to
> turn those OFF if they can not afford to spend that much memory during
> execution. But I don't like tying execution time consumption with
> default OFF.

Would the attached address your concern about the reasons for defaulting to off?

David


partitionwise_docs.patch
Description: Binary data


Re: Logical Replication of sequences

2024-07-28 Thread vignesh C
On Fri, 26 Jul 2024 at 08:04, Peter Smith  wrote:
>
> Here are some review comments for latest patch v20240725-0002
>
> ==
> doc/src/sgml/ref/create_publication.sgml
>
> nitpick - tweak to the description of the example.
>
> ==
> src/backend/parser/gram.y
>
> preprocess_pub_all_objtype_list:
> nitpick - typo "allbjects_list"
> nitpick - reword function header
> nitpick - /alltables/all_tables/
> nitpick - /allsequences/all_sequences/
> nitpick - I think code is safe as-is because makeNode internally does
> palloc0, but OTOH adding Assert would be nicer just to remove any
> doubts.
>
> ==
> src/bin/psql/describe.c
>
> 1.
> + /* Print any publications */
> + if (pset.sversion >= 18)
> + {
> + int tuples = 0;
>
> No need to assign value 0 here, because this will be unconditionally
> assigned before use anyway.
>
> 
>
> 2. describePublications
>
>   has_pubviaroot = (pset.sversion >= 13);
> + has_pubsequence = (pset.sversion >= 18000);
>
> That's a bug! Should be 18, not 18000.
>
> ==
>
> And, please see the attached diffs patch, which implements the
> nitpicks mentioned above.

These are handled in the v20240729 version attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3SucGGLe-B-a_aqWNWQZ-yfxFTiAA0JyP-SwX4jq9Y3A%40mail.gmail.com

Regards,
Vignesh




Re: Logical Replication of sequences

2024-07-28 Thread vignesh C
On Fri, 26 Jul 2024 at 11:46, Peter Smith  wrote:
>
> Hi Vignesh,
>
> There are still pending changes from my previous review of the
> 0720-0003 patch [1], but here are some new review comments for your
> latest patch v20240525-0003.
> 2b.
> Is it better to name these returned by-ref ptrs like 'ret_log_cnt',
> and 'ret_lsn' to emphasise they are output variables? YMMV.

I felt this is ok as we have mentioned in function header too

> ==
> src/test/subscription/t/034_sequences.pl
>
> 4.
> Q. Should we be suspicious that log_cnt changes from '32' to '31', or
> is there a valid explanation? It smells like some calculation is
> off-by-one, but without debugging I can't tell if it is right or
> wrong.

 It works like this: for every 33 nextval we will get log_cnt as 0. So
for 33 * 6(198) log_cnt will be 0, then for 199 log_cnt will be 32 and
for 200 log_cnt will be 31. This pattern repeats, so this is ok.

These are handled in the v20240729 version attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3SucGGLe-B-a_aqWNWQZ-yfxFTiAA0JyP-SwX4jq9Y3A%40mail.gmail.com

Regards,
Vignesh