Re: improve ssl error code, 2147483650
On 25.07.24 11:36, Daniel Gustafsson wrote: On 24 Jul 2024, at 15:32, Peter Eisentraut wrote: On 25.06.24 16:21, Tom Lane wrote: Peter Eisentraut writes: On 21.06.24 16:53, Tom Lane wrote: Most of libpq gets at strerror_r via SOCK_STRERROR for Windows portability. Is that relevant here? Looking inside the OpenSSL code, it makes no efforts to translate between winsock error codes and standard error codes, so I don't think our workaround/replacement code needs to do that either. Fair enough. Btw., our source code comments say something like "ERR_reason_error_string randomly refuses to map system errno values." The reason it doesn't is exactly that it can't do it while maintaining thread-safety. Ah. Do you want to improve that comment while you're at it? Here is a patch that fixes the strerror() call and updates the comments a bit. LGTM. This ought to be backpatched like the original fix; ideally for the next minor releases in about two weeks. Agreed. done
Re: POC, WIP: OR-clause support for indexes
On 27.07.2024 13:56, Alexander Korotkov wrote: On Thu, Jul 25, 2024 at 5:04 PM Alena Rybakina wrote: To be honest, I have found a big problem in this patch - we try to perform the transformation every time we examime a column: for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++) { ... } I have fixed it and moved the transformation before going through the loop. What makes you think there is a problem? To be honest, I was bothered by the fact that we need to go through expressions several times that obviously will not fit under other conditions. Just yesterday I thought that it would be worthwhile to create a list of candidates - expressions that did not fit because the column did not match the index (!match_index_to_operand(nconst_expr, indexcol, index)). Another problem that is related to the first one that the boolexpr could contain expressions referring to different operands, for example, both x and y. that is, we may have the problem that the optimal "ANY" expression may not be used, because the expression with x may come earlier and the loop may end earlier. alena@postgres=# create table b (x int, y int); CREATE TABLE alena@postgres=# insert into b select id, id from generate_series(1,1000) as id; INSERT 0 1000 alena@postgres=# create index x_idx on b(x); CREATE INDEX alena@postgres=# analyze; ANALYZE alena@postgres=# explain select * from b where y =3 or x =4 or x=5 or x=6 or x = 7 or x=8 or x=9; QUERY PLAN --- Seq Scan on b (cost=0.00..32.50 rows=7 width=8) Filter: ((y = 3) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x = 9)) (2 rows) alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x = 7 or x=8 or x=9 or y=1; QUERY PLAN --- Seq Scan on b (cost=0.00..32.50 rows=7 width=8) Filter: ((x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x = 9) OR (y = 1)) (2 rows) alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x = 7 or x=8 or x=9; QUERY PLAN Index Scan using x_idx on b (cost=0.28..12.75 rows=6 width=8) Index Cond: (x = ANY ('{4,5,6,7,8,9}'::integer[])) (2 rows) Furthermore expressions can be stored in a different order. For example, first comes "AND" expr, and then group of "OR" expr, which we can convert to "ANY" expr, but we won't do this due to the fact that we will exit the loop early, according to this condition: if (!IsA(sub_rinfo->clause, OpExpr)) return NULL; or it may occur due to other conditions. alena@postgres=# create index x_y_idx on b(x,y); CREATE INDEX alena@postgres=# analyze; ANALYZE alena@postgres=# explain select * from b where (x = 2 and y =3) or x =4 or x=5 or x=6 or x = 7 or x=8 or x=9; QUERY PLAN - Seq Scan on b (cost=0.00..35.00 rows=6 width=8) Filter: (((x = 2) AND (y = 3)) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x = 9)) (2 rows) Because of these reasons, I tried to save this and that transformation together for each column and try to analyze for each expr separately which method would be optimal. Do you have a test case illustrating a slow planning time? No, I didn't have time to measure it and sorry for that. I'll do it. When v27 performs transformation for a particular column, it just stops facing the first unmatched OR entry. So, match_orclause_to_indexcol() examines just the first OR entry for all the columns excepts at most one. So, the check match_orclause_to_indexcol() does is not much slower than other match_*_to_indexcol() do. I actually think this could help performance in many cases, not hurt it. At least we get rid of O(n^2) complexity over the number of OR entries, which could be very many. I agree with you that there is an overhead and your patch fixes this problem, but optimizer needs to have a good ordering of expressions for application. I think we can try to move the transformation to another place where there is already a loop pass, and also save two options "OR" expr and "ANY" expr in one place (through BoolExpr) (like find_duplicate_ors function) and teach the optimizer to determine which option is better, for example, like now in match_orclause_to_indexcol() function. What do you thing about it? -- Regards, Alena Rybakina Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Re: Protocol question regarding Portal vs Cursor
On Sat, 27 Jul 2024 at 19:06, Tatsuo Ishii wrote: > > Yes, sorry, I should have said one can not create a with hold portal > using > > the BIND command > > Ok. > > It would be possible to add a new parameter to the BIND command to > create such a portal. But it needs some changes to the existing > protocol definition and requires protocol version up, which is a major > pain. > I'm trying to add WITH HOLD to the JDBC driver and currently I would have 1) rewrite the query 2) issue a new query ... declare .. and bind variables to that statement 3) execute fetch vs 1) bind variables to the statement 2) execute fetch The second can be done much lower in the code. However as you mentioned this would require a new protocol version which is unlikely to happen. Dave
Re: Fix overflow in pg_size_pretty
On Sun, 28 Jul 2024 at 16:30, Joseph Koshakow wrote: > Attached is an updated patch with your approach. I removed the 0 from > the negative case because I think it was unnecessary, but happy to add > it back in if I missed something. I made a few adjustments and pushed this. I did keep the 0 - part as some compilers don't seem to like not having the 0. e.g MSVC gives: ../src/backend/utils/adt/dbsize.c(578): warning C4146: unary minus operator applied to unsigned type, result still unsigned I thought a bit about if it was really worth keeping the regression test or not and in the end decided it was likely worthwhile keeping it, so I expanded it slightly to cover both PG_INT64_MIN and PG_INT64_MAX values. It looks slightly less like we're earmarking the fact that there was a bug that way, and also seems to be of some additional value. PG15 did see quite a significant rewrite of the pg_size_pretty code. The bug does still exist in PG14 and earlier, but on looking at what it would take to fix it there I got a bit unexcited at the risk to reward ratio of adjusting that code and just left it alone. I've backpatched only as far as PG15. I'm sure someone else will feel I should have done something else there, but that's the judgement call I made. Thanks for the patch. David
Re: Pluggable cumulative statistics
On Sat, Jul 27, 2024 at 03:49:42PM +0200, Dmitry Dolgov wrote: > Agree, looks good. I've tried to quickly sketch out such a fixed > statistic for some another extension, everything was fine and pretty > straightforward. That's my hope. Thanks a lot for the feedback. > One question, why don't you use > pgstat_get_custom_shmem_data & pgstat_get_custom_snapshot_data outside > of the injection points code? There seems to be a couple of possible > places in pgstats itself. Because these two helper routines are only able to fetch the fixed data area in the snapshot and the control shmem structures for the custom kinds, not the in-core ones. We could, but the current code is OK as well. My point was just to ease the pluggability effort. I would like to apply this new infrastructure stuff and move on to the problems related to the scability of pg_stat_statements. So, are there any objections with all that? -- Michael signature.asc Description: PGP signature
Re: pg_attribute.atttypmod for interval type
On 07/27/24 00:32, Tom Lane wrote: > Interval typmods include a fractional-seconds-precision field as well > as a bitmask indicating the allowed interval fields (per the SQL > standard's weird syntax such as INTERVAL DAY TO SECOND). Looking at > the source code for intervaltypmodout() might be helpful: > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/utils/adt/timestamp.c;h=69fe7860ede062fc8be42e7545b35e69c3e068c4;hb=HEAD#l1136 Also, for this kind of question, an overview of a type modifier's contents can be found in the javadoc for the WIP PL/Java 1.7, which is intended to model such things accurately.[0] The model is aimed at the logical level, that is, to represent what information is present in the typmod, the precise semantics, what combinations are allowed/disallowed, and so on, but not the way PostgreSQL physically packs the bits. So, for this case, what you would find there is essentially what Tom already said, about what's logically present; it doesn't save you the effort of looking in the PostgreSQL source if you want to independently implement unpacking the bits. For possible future typmod questions, it may serve as a quick way to get that kind of logical-level description at moments when Tom is away from the keyboard. Regards, -Chap [0] https://tada.github.io/pljava/preview1.7/pljava-api/apidocs/org.postgresql.pljava/org/postgresql/pljava/adt/Timespan.Interval.Modifier.html I just noticed a nit in that javadoc: it says the field combination must be "one of the named constants in this interface" but you don't find them in the Interval.Modifier interface; they're in the containing interface Interval itself.
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Fri, Jul 26, 2024 at 1:27 PM Melanie Plageman wrote: > > On Mon, Jul 22, 2024 at 9:26 PM Masahiko Sawada wrote: > > > > + CREATE TABLE ${table1}(col1 int) > > + WITH (autovacuum_enabled=false, fillfactor=10); > > + INSERT INTO $table1 VALUES(7); > > + INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3; > > + CREATE INDEX on ${table1}(col1); > > + UPDATE $table1 SET col1 = 3 WHERE col1 = 0; > > + INSERT INTO $table1 VALUES(7); > > > > These queries make sense to me; these make the radix tree wide and use > > more nodes, instead of fattening lead nodes (i.e. the offset bitmap). > > The $table1 has 18182 blocks and the statistics of radix tree shows: > > > > max_val = 65535 > > num_keys = 18182 > > height = 1, n4 = 0, n16 = 1, n32 = 0, n64 = 0, n256 = 72, leaves = 18182 > > > > Which means that the height of the tree is 2 and we use the maximum > > size node for all nodes except for 1 node. > > Do you have some kind of tool that prints this out for you? That would > be really handy. You can add '#define RT_DEBUG' for radix tree used in TidStore and then call RT_STATS (e.g., local_ts_stats()). > > > I don't have any great idea to substantially reduce the total number > > of tuples in the $table1. Probably we can use DELETE instead of UPDATE > > to make garbage tuples (although I'm not sure it's okay for this > > test). Which reduces the amount of WAL records from 11MB to 4MB and > > would reduce the time to catch up. But I'm not sure how much it would > > help. There might be ideas to trigger a two-round index vacuum with > > fewer tuples but if the tests are too optimized for the current > > TidStore, we will have to re-adjust them if the TidStore changes in > > the future. So I think it's better and reliable to allow > > maintenance_work_mem to be a lower value or use injection points > > somehow. > > I think we can make improvements in overall time on master and 17 with > the examples John provided later in the thread. However, I realized > you are right about using a DELETE instead of an UPDATE. At some point > in my development, I needed the UPDATE to satisfy some other aspect of > the test. But that is no longer true. A DELETE works just as well as > an UPDATE WRT the dead items and, as you point out, much less WAL is > created and replay is much faster. > > I also realized I forgot to add 043_vacuum_horizon_floor.pl to > src/test/recovery/meson.build in 16. I will post a patch here this > weekend which changes the UPDATE to a DELETE in 14-16 (sped up the > test by about 20% for me locally) and adds 043_vacuum_horizon_floor.pl > to src/test/recovery/meson.build in 16. I'll plan to push it on Monday > to save myself any weekend buildfarm embarrassment. > > As for 17 and master, I'm going to try out John's examples and see if > it seems like it will be fast enough to commit to 17/master without > lowering the maintenance_work_mem lower bound. +1. Thanks. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Make COPY format extendable: Extract COPY TO format implementations
Hi Sutou, On Wed, Jul 24, 2024 at 4:31 PM Sutou Kouhei wrote: > > Hi, > > In <9172d4eb-6de0-4c6d-beab-8210b7a22...@enterprisedb.com> > "Re: Make COPY format extendable: Extract COPY TO format implementations" > on Mon, 22 Jul 2024 14:36:40 +0200, > Tomas Vondra wrote: > > > Thanks for the summary/responses. I still think it'd be better to post a > > summary as a separate message, not as yet another post responding to > > someone else. If I was reading the thread, I would not have noticed this > > is meant to be a summary. I'd even consider putting a "THREAD SUMMARY" > > title on the first line, or something like that. Up to you, of course. > > It makes sense. I'll do it as a separated e-mail. > > > My suggestions would be to maintain this as a series of patches, making > > incremental changes, with the "more complex" or "more experimental" > > parts larger in the series. For example, I can imagine doing this: > > > > 0001 - minimal version of the patch (e.g. current v17) > > 0002 - switch existing formats to the new interface > > 0003 - extend the interface to add bits needed for columnar formats > > 0004 - add DML to create/alter/drop custom implementations > > 0005 - minimal patch with extension adding support for Arrow > > > > Or something like that. The idea is that we still have a coherent story > > of what we're trying to do, and can discuss the incremental changes > > (easier than looking at a large patch). It's even possible to commit > > earlier parts before the later parts are quite cleanup up for commit. > > And some changes changes may not be even meant for commit (e.g. the > > extension) but as guidance / validation for the earlier parts. > > OK. I attach the v18 patch set: > > 0001: add a basic feature (Copy{From,To}Routine) > (same as the v17 but it's based on the current master) > 0002: use Copy{From,To}Rountine for the existing formats > (this may not be committed because there is a > profiling related concern) > 0003: add support for specifying custom format by "COPY > ... WITH (format 'my-format')" > (this also has a test) > 0004: export Copy{From,To}StateData > (but this isn't enough to implement custom COPY > FROM/TO handlers as an extension) > 0005: add opaque member to Copy{From,To}StateData and export > some functions to read the next data and flush the buffer > (we can implement a PoC Apache Arrow COPY FROM/TO > handler as an extension with this) > > https://github.com/kou/pg-copy-arrow is a PoC Apache Arrow > COPY FROM/TO handler as an extension. > > > Notes: > > * 0002: We use "static inline" and "constant argument" for > optimization. > * 0002: This hides NextCopyFromRawFields() in a public > header because it's not used in PostgreSQL and we want to > use "static inline" for it. If it's a problem, we can keep > it and create an internal function for "static inline". > * 0003: We use "CREATE FUNCTION" to register a custom COPY > FROM/TO handler. It's the same approach as tablesample. > * 0004 and 0005: We can mix them but this patch set split > them for easy to review. 0004 just moves the existing > codes. It doesn't change the existing codes. > * PoC: I provide it as a separated repository instead of a > patch because an extension exists as a separated project > in general. If it's a problem, I can provide it as a patch > for contrib/. > * This patch set still has minimal Copy{From,To}Routine. For > example, custom COPY FROM/TO handlers can't process their > own options with this patch set. We may add more callbacks > to Copy{From,To}Routine later based on real world use-cases. > > > Unfortunately, there's not much information about what exactly the tests > > did, context (hardware, ...). So I don't know, really. But if you share > > enough information on how to reproduce this, I'm willing to take a look > > and investigate. > > Thanks. Here is related information based on the past > e-mails from Michael: > > * Use -O2 for optimization build flag > ("meson setup --buildtype=release" may be used) > * Use tmpfs for PGDATA > * Disable fsync > * Run on scissors (what is "scissors" in this context...?) > > https://www.postgresql.org/message-id/flat/Zbr6piWuVHDtFFOl%40paquier.xyz#dbbec4d5c54ef2317be01a54abaf495c > * Unlogged table may be used > * Use a table that has 30 integer columns (*1) > * Use 5M rows (*2) > * Use '/dev/null' for COPY TO (*3) > * Use blackhole_am for COPY FROM (*4) > https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am > * perf is used but used options are unknown (sorry) > > (*1) This SQL may be used to create the table: > > CREATE OR REPLACE FUNCTION create_table_cols(tabname text, num_cols int) > RETURNS VOID AS > $func$ > DECLARE > query text; > BEGIN > query := 'CREATE UNLOGGED TABLE ' || tabname || ' ('; > FOR i IN 1..num_cols LOOP > query := query || 'a_' || i::text || ' int default 1'; > IF i != num_cols THEN > query := query
Re: [PATCH] TODO “Allow LISTEN on patterns”
Hi Emanuel, I did a review on the new patch version and I observed that the identifier > passed to the LISTEN command is handled differently between outer and > inner > levels. > We have the following grammar: notify_channel: ColId { $$ = $1; } | notify_channel '.' ColId { $$ = psprintf("%s.%s", $1, $3); } And ColId is truncated in core scanner: ident = downcase_truncate_identifier(yytext, yyleng, true); So each level is truncated independently. For this reason we observe the behaviour which you described above. Another observation, probably not strictly related to this patch itself but > the async-notify tests, is that there is no test for > "payload too long". Probably there is a reason on why isn't in the specs? > I believe that simply because not all functionality is covered with tests. But I have noticed a very interesting test "channel name too long": SELECT pg_notify('notify_async_channel_name_too_long__','sample_message1'); ERROR: channel name too long But the behaviour is inconsistent with NOTIFY command: NOTIFY notify_async_channel_name_too_long__ NOTICE: identifier "notify_async_channel_name_too_long__" will be truncated to ... I guess that the expected behavior would be that if the outer level is > truncated, the rest of the > channel name should be ignored, as there won't be possible to notify it > anyway. > > In the case of the inner levels creating a channel name too long, it may > probably sane to just > check the length of the entire identifier, and truncate -- ensuring that > channel name doesn't > end with the level separator. > > Well, I believe that we can forbid too long channel names at all. So it provides consistent behaviour among different ways we can send notifications, and I agree with you that "there won't be possible to notify it anyway". I created a patch for that and attached it to the email. In the patch I relocated truncation from core scanner to parser. And as the same core scanner is also used in plsql I added three lines of code to its scanner to basically truncate too long identifiers in there. Here is an example of the new behaviour: -- Should fail. Too long channel names NOTIFY notify_async_channel_name_too_long_._; ERROR: channel name too long LISTEN notify_async_channel_name_too_long_%._; ERROR: channel name too long UNLISTEN notify_async_channel_name_too_long_%._; ERROR: channel name too long Regards, Alexander Cheshev On Sun, 21 Jul 2024 at 21:36, Emanuel Calvo <3man...@gmail.com> wrote: > > Hi Alexander, > > I did a review on the new patch version and I observed that the identifier > passed to the LISTEN command is handled differently between outer and > inner > levels. > > When the outer level exceeds the 64 characters limitation, the outer level > of the > channel name is truncated, but leaves the inner levels in the channel name > due > that isn't parsed in the same way. > > Also, even if the outer level isn't truncated, it is allowed to add > channels names > that exceeds the allowed identifier size. > > It can be reproduced just by: > > # LISTEN a.a.a.a.a.lot.of.levels..; -- this doesn't fail at LISTEN, > but fails in NOTIFY due to channel name too long > > In the following, the outer level is truncated, but it doesn't cut out the > inner levels. This leaves > listening channels that cannot receive any notifications in the queue: > > # LISTEN > notify_async_channel_name_too_long.a.a. > ... > NOTICE: identifier will be truncated > > # select substring(c.channel,0,66), length(c.channel) from > pg_listening_channels() c(channel) where length(c.channel) > 64; > substring | > notify_async_channel_name_too_long_.a > length| 1393 > > > I guess that the expected behavior would be that if the outer level is > truncated, the rest of the > channel name should be ignored, as there won't be possible to notify it > anyway. > > In the case of the inner levels creating a channel name too long, it may > probably sane to just > check the length of the entire identifier, and truncate -- ensuring that > channel name doesn't > end with the level separator. > > Another observation, probably not strictly related to this patch itself > but the async-notify tests, is that there is no test for > "payload too long". Probably there is a reason on why isn't in the specs? > > > Regards, > > > El lun, 15 jul 2024 a las 12:59, Alexander Cheshev (< > alex.ches...@gmail.com>) escribió: > >> Hi Emanuel, >> >> Changed implementation of the function Exec_UnlistenCommit . v2 of the >> path contained a bug in the function Exec_UnlistenCommit (added a test case >> for that) an
Re: race condition in pg_class
Noah Misch writes: > On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote: >> A recent buildfarm test failure [1] showed that the >> intra-grant-inplace-db.spec test added with 0844b3968 may fail >> on a slow machine >> But as the test going to be modified by the inplace110-successors-v8.patch >> and the modified test (with all three latest patches applied) passes >> reliably in the same conditions, maybe this failure doesn't deserve a >> deeper exploration. > Agreed. Let's just wait for code review of the actual bug fix, not develop a > separate change to stabilize the test. One flake in three weeks is low enough > to make that okay. It's now up to three similar failures in the past ten days: in addition to https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu&dt=2024-07-18%2003%3A08%3A08 I see https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu&dt=2024-07-22%2018%3A00%3A46 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan&dt=2024-07-28%2012%3A20%3A37 Is it time to worry yet? If this were HEAD only, I'd not be too concerned; but two of these three are on allegedly-stable branches. And we have releases coming up fast. (BTW, I don't think taipan qualifies as a slow machine.) regards, tom lane
Re: Parent/child context relation in pg_get_backend_memory_contexts()
David Rowley writes: > I ended up fixing that another way as the above seems to be casting > away the const for those variables. Instead, I changed the signature > of the function to: > static void get_memory_context_name_and_ident(MemoryContext context, > const char **const name, const char **const ident); > which I think takes into account for the call site variables being > defined as "const char *". I did not check the history to see quite what happened here, but Coverity thinks the end result is rather confused, and I agree: *** CID 1615190: Null pointer dereferences (REVERSE_INULL) /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/adt/mcxtfuncs.c: 58 in get_memory_context_name_and_ident() 52 *ident = context->ident; 53 54 /* 55 * To be consistent with logging output, we label dynahash contexts with 56 * just the hash table name as with MemoryContextStatsPrint(). 57 */ >>> CID 1615190: Null pointer dereferences (REVERSE_INULL) >>> Null-checking "ident" suggests that it may be null, but it has already >>> been dereferenced on all paths leading to the check. 58 if (ident && strcmp(*name, "dynahash") == 0) 59 { 60 *name = *ident; 61 *ident = NULL; 62 } 63 } It is not clear to me exactly which of these pointers should be presumed to be possibly-null, but certainly testing ident after storing through it is pretty pointless. Maybe what was intended was - if (ident && strcmp(*name, "dynahash") == 0) + if (*name && strcmp(*name, "dynahash") == 0) ? regards, tom lane
Re: race condition in pg_class
On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote: > Noah Misch writes: > > On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote: > >> A recent buildfarm test failure [1] showed that the > >> intra-grant-inplace-db.spec test added with 0844b3968 may fail > > >> But as the test going to be modified by the inplace110-successors-v8.patch > >> and the modified test (with all three latest patches applied) passes > >> reliably in the same conditions, maybe this failure doesn't deserve a > >> deeper exploration. > > > Agreed. Let's just wait for code review of the actual bug fix, not develop > > a > > separate change to stabilize the test. One flake in three weeks is low > > enough > > to make that okay. > > It's now up to three similar failures in the past ten days > Is it time to worry yet? If this were HEAD only, I'd not be too > concerned; but two of these three are on allegedly-stable branches. > And we have releases coming up fast. I don't know; neither decision feels terrible to me. A bug fix that would address both the data corruption causes and those buildfarm failures has been awaiting review on this thread for 77 days. The data corruption causes are more problematic than 0.03% of buildfarm runs getting noise failures. Two wrongs don't make a right, but a commit masking that level of buildfarm noise also feels like sending the wrong message.
Re: race condition in pg_class
Noah Misch writes: > On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote: >> Is it time to worry yet? If this were HEAD only, I'd not be too >> concerned; but two of these three are on allegedly-stable branches. >> And we have releases coming up fast. > I don't know; neither decision feels terrible to me. Yeah, same here. Obviously, it'd be better to spend effort on getting the bug fix committed than to spend effort on some cosmetic workaround. The fact that the failure is in the isolation tests not the core regression tests reduces my level of concern somewhat about shipping it this way. I think that packagers typically run the core tests not check-world during package verification, so they won't hit this. regards, tom lane
Re: UUID v7
> On 24 Jul 2024, at 04:09, Sergey Prokhorenko > wrote: > > Implementations MAY alter the actual timestamp. Hmm… looks like we slightly misinterpreted words about clock source. Well, that’s great, let’s get offset back. PFA version accepting offset interval. It works like this: postgres=# select uuidv7(interval '-2 months’); 018fc02f-0996-7136-aeb4-8936b5a516a1 postgres=# select uuid_extract_timestamp(uuidv7(interval '-2 months')); 2024-05-28 22:11:15.71+05 What do you think? Best regards, Andrey Borodin. v26-0001-Implement-UUID-v7.patch Description: Binary data
Re: Optimize mul_var() for var1ndigits >= 8
On Fri, 12 Jul 2024 at 13:34, Dean Rasheed wrote: > > Then I tried compiling with -m32, and unfortunately this made the > patch slower than HEAD for small inputs: > > -- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled] > call rate=5.052332e+06 -- HEAD > call rate=3.883459e+06 -- v2 patch > > -- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled] > call rate=4.7221405e+06 -- HEAD > call rate=3.7965685e+06 -- v2 patch > > -- var1ndigits1=7, var2ndigits2=7 [-m32, SIMD disabled] > call rate=4.4638375e+06 -- HEAD > call rate=3.39948e+06 -- v2 patch > > and it doesn't reach parity until around ndigits=26, which is > disappointing. It does still get much faster for large inputs > I spent some more time hacking on this, to try to improve the situation for 32-bit builds. One of the biggest factors was the 64-bit division that is necessary during carry propagation, which is very slow -- every compiler/platform I looked at on godbolt.org turns this into a call to a slow function like __udivdi3(). However, since we are dividing by a constant (NBASE^2), it can be done using the same multiply-and-shift trick that compilers use in 64-bit builds, and that significantly improves performance. Unfortunately, in 32-bit builds, using 64-bit integers is still slower for small inputs (below 20-30 NBASE digits), so in the end I figured that it's best to retain the old 32-bit base-NBASE multiplication code for small inputs, and only use 64-bit base-NBASE^2 multiplication when the inputs are above a certain size threshold. Furthermore, since this threshold is quite low, it's possible to make an additional simplification: as long as the threshold is less than or equal to 42, it can be shown that there is no chance of 32-bit integer overflow, and so the "maxdig" tracking and renormalisation code is not needed. Getting rid of that makes the outer multiplication loop very simple, and makes quite a noticeable difference to the performance for inputs below the threshold. In 64-bit builds, doing 64-bit base-NBASE^2 multiplication is faster for all inputs over 4 or 5 NBASE digits, so the threshold can be much lower. However, for numbers near that threshold, it's a close thing, so it makes sense to also extend mul_var_small() to cover 1-6 digit inputs, since that gives a much larger gain for numbers of that size. That's quite nice because it equates to inputs with up to 21-24 decimal digits, which I suspect are quite commonly used in practice. One risk in having different thresholds in 32-bit and 64-bit builds is that it opens up the possibility that the results from the reduced-rscale computations used by inexact functions like ln() and exp() might be be different (actually, this may already be a possibility, due to div_var_fast()'s use of floating point arithmetic, but that seems considerably less likely). In practice though, this should be extremely unlikely, due to the fact that any difference would have to propagate through MUL_GUARD_DIGITS NBASE digits (8 decimal digits), and it doesn't show up in any of the existing tests. IMO a very small chance of different results on different platforms is probably acceptable, but it wouldn't be acceptable to make the threshold a runtime configurable parameter that could lead to different results on the same platform. Overall, this has turned out to be more code than I would have liked, but I think it's worth it, because the performance gains are pretty substantial across the board. (Here, I'm comparing with REL_17_STABLE, so that the effect of mul_var_short() for ndigits <= 6 can be seen.) 32-bit build Balanced inputs: ndigits1 | ndigits2 | PG17 rate | patch rate | % change --+--+---+---+-- 1 |1 | 5.973068e+06 | 6.873059e+06 | +15.07% 2 |2 | 5.646598e+06 | 6.6456665e+06 | +17.69% 3 |3 | 5.8176995e+06 | 7.0903175e+06 | +21.87% 4 |4 | 5.545298e+06 | 6.9701605e+06 | +25.69% 5 |5 | 5.2109155e+06 | 6.6406185e+06 | +27.44% 6 |6 | 4.9276335e+06 | 6.478698e+06 | +31.48% 7 |7 | 4.6595095e+06 | 4.8514485e+06 | +4.12% 8 |8 | 4.898536e+06 | 5.1599975e+06 | +5.34% 9 |9 | 4.537171e+06 | 4.830987e+06 | +6.48% 10 | 10 | 4.308139e+06 | 4.6029265e+06 | +6.84% 11 | 11 | 4.092612e+06 | 4.37966e+06 | +7.01% 12 | 12 | 3.9345035e+06 | 4.213998e+06 | +7.10% 13 | 13 | 3.7600162e+06 | 4.0115955e+06 | +6.69% 14 | 14 | 3.5959855e+06 | 3.8216508e+06 | +6.28% 15 | 15 | 3.3576732e+06 | 3.6070588e+06 | +7.43% 16 | 16 | 3.657139e+06 | 3.9067975e+06 | +6.83% 17 | 17 | 3.3601955e+06 | 3.5651722e+06 | +6.10% 18 | 18 | 3.1844472e+06 | 3.4542238e+06 | +8.47% 19 | 19 | 3.0286392e+06 | 3.2380662e+06 | +6.91% 20 | 20 | 2.9012185e+06 | 3.0496352e+06
Re: why is pg_upgrade's regression run so slow?
On 2024-07-27 Sa 6:48 PM, Tom Lane wrote: Andrew Dunstan writes: On 2024-07-27 Sa 10:20 AM, Tom Lane wrote: Just to add some more fuel to the fire: I do *not* observe such an effect on my own animals. The culprit appears to be meson. When I tested running crake with "using_meson => 0" I got results in line with yours. Interesting. Maybe meson is over-aggressively trying to run these test suites in parallel? Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on HEAD for fairywren and drongo - fairwren has just gone green, and I expect drongo will when it reports in a few hours. I'm at a loss for an explanation. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Speed up collation cache
On Sun, 2024-07-28 at 00:14 +0200, Andreas Karlsson wrote: > But even without that extra optimization I think this patch is worth > merging and the patch is small, simple and clean and easy to > understand > and a just a clear speed up. Feels like a no brainer. I think that it > is > ready for committer. Committed, thank you. > And then we can discuss after committing if an additional cache of > the > last locale is worth it or not. Yeah, I'm holding off on that until refactoring in the area settles, and we'll see if it's still worth it. Regards, Jeff Davis
Re: Pluggable cumulative statistics
> On Sun, Jul 28, 2024 at 10:20:45PM GMT, Michael Paquier wrote: > I would like to apply this new infrastructure stuff and move on to the > problems related to the scability of pg_stat_statements. So, are > there any objections with all that? So far I've got nothing against :)
Re: Parent/child context relation in pg_get_backend_memory_contexts()
On Mon, 29 Jul 2024 at 04:31, Tom Lane wrote: > It is not clear to me exactly which of these pointers should be > presumed to be possibly-null, but certainly testing ident after > storing through it is pretty pointless. Maybe what was intended > was > > - if (ident && strcmp(*name, "dynahash") == 0) > + if (*name && strcmp(*name, "dynahash") == 0) It should be *ident. I just missed adding the pointer dereference when moving that code to a function. Thanks for the report. I'll fix shortly. David
Re: pg_upgrade failing for 200+ million Large Objects
I wrote: > Alexander Korotkov writes: >> J4F, I have an idea to count number of ';' sings and use it for >> transaction size counter, since it is as upper bound estimate of >> number of SQL commands :-) > Hmm ... that's not a completely silly idea. Let's keep it in > the back pocket in case we can't easily reduce the number of > SQL commands in some cases. After poking at this for awhile, we can fix Justin's example case by avoiding repeated UPDATEs on pg_attribute, so I think we should do that. It seems clearly a win, with no downside other than a small increment of complexity in pg_dump. However, that's probably not sufficient to mark this issue as closed. It seems likely that there are other patterns that would cause backend memory bloat. One case that I found is tables with a lot of inherited constraints (not partitions, but old-style inheritance). For example, load the output of this Perl script into a database: - for (my $i = 0; $i < 100; $i++) { print "CREATE TABLE test_inh_check$i (\n"; for (my $j = 0; $j < 1000; $j++) { print "a$j float check (a$j > 10.2),\n"; } print "b float);\n"; print "CREATE TABLE test_inh_check_child$i() INHERITS(test_inh_check$i);\n"; } - pg_dump is horrendously slow on this, thanks to O(N^2) behavior in ruleutils.c, and pg_upgrade is worse --- and leaks memory too in HEAD/v17. The slowness was there before, so I think the lack of field complaints indicates that this isn't a real-world use case. Still, it's bad if pg_upgrade fails when it would not have before, and there may be other similar issues. So I'm forced to the conclusion that we'd better make the transaction size adaptive as per Alexander's suggestion. In addition to the patches attached, I experimented with making dumpTableSchema fold all the ALTER TABLE commands for a single table into one command. That's do-able without too much effort, but I'm now convinced that we shouldn't. It would break the semicolon-counting hack for detecting that tables like these involve extra work. I'm also not very confident that the backend won't have trouble with ALTER TABLE commands containing hundreds of subcommands. That's something we ought to work on probably, but it's not a project that I want to condition v17 pg_upgrade's stability on. Anyway, proposed patches attached. 0001 is some trivial cleanup that I noticed while working on the failed single-ALTER-TABLE idea. 0002 merges the catalog-UPDATE commands that dumpTableSchema issues, and 0003 is Alexander's suggestion. regards, tom lane From 34ebed72e9029f690e5d3f0cb7464670e83caa5c Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Sun, 28 Jul 2024 13:02:27 -0400 Subject: [PATCH v1 1/3] Fix missing ONLY in one dumpTableSchema command. The various ALTER [FOREIGN] TABLE commands issued by dumpTableSchema all use ONLY, except for this one. I think it's not a live bug because we don't permit foreign tables to have children, but it's still inconsistent. I happened across this while refactoring dumpTableSchema to merge all its ALTERs into one; while I'm not certain we should actually do that, this seems like good cleanup. --- src/bin/pg_dump/pg_dump.c| 2 +- src/bin/pg_dump/t/002_pg_dump.pl | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index b8b1888bd3..7cd6a7fb97 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -16344,7 +16344,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo) if (tbinfo->relkind == RELKIND_FOREIGN_TABLE && tbinfo->attfdwoptions[j][0] != '\0') appendPQExpBuffer(q, - "ALTER FOREIGN TABLE %s ALTER COLUMN %s OPTIONS (\n" + "ALTER FOREIGN TABLE ONLY %s ALTER COLUMN %s OPTIONS (\n" "%s\n" ");\n", qualrelname, diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index d3dd8784d6..5bcc2244d5 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -1154,7 +1154,7 @@ my %tests = ( 'ALTER FOREIGN TABLE foreign_table ALTER COLUMN c1 OPTIONS' => { regexp => qr/^ - \QALTER FOREIGN TABLE dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n + \QALTER FOREIGN TABLE ONLY dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n \s+\Qcolumn_name 'col1'\E\n \Q);\E\n /xm, -- 2.43.5 From c8f0d0252e292f276fe9631ae31e6aea11d294d2 Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Sun, 28 Jul 2024 16:19:48 -0400 Subject: [PATCH v1 2/3] Reduce number of commands dumpTableSchema emits for binary upgrade. Avoid issuing a separate SQL UPDATE command for each column when directly manipulating pg_attribute contents in binary upgrade mode. With the separate updates, we triggered a relcache invalidation with each update. For a table with N columns, that causes O(N^2) relcache bloat in txn_size mode because the ta
Re: Optimize mul_var() for var1ndigits >= 8
On Sun, Jul 28, 2024, at 21:18, Dean Rasheed wrote: > Attachments: > * v3-0002-Optimise-numeric-multiplication-using-base-NBASE-.patch > * v3-0001-Extend-mul_var_short-to-5-and-6-digit-inputs.patch Very nice. I've done some initial benchmarks on my Intel Core i9-14900K machine. To reduce noise, I've isolated a single CPU core, specifically CPU core id 31, to not get any work scheduled by the kernel: $ cat /proc/cmdline BOOT_IMAGE=/vmlinuz-5.15.0-116-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash isolcpus=31 intel_pstate=disable vt.handoff=7 Then, I've used sched_setaffinity() from to ensure the benchmark run on CPU core id 31. I've also fixed the CPU frequency to 3.20 GHz: $ sudo cpufreq-info -c 31 ... current CPU frequency is 3.20 GHz (asserted by call to hardware). I've benchmarked each (var1ndigits, var2ndigits) 10 times per commit, in random order. I've benchmarked all commits after "SQL/JSON: Various improvements to SQL/JSON query function docs" which is the parent commit to "Optimise numeric multiplication for short inputs.", including the two patches. I've benchmarked each commit affecting numeric.c, and each such commit's parent commit, for comparison. The accum_change column shows the accumulative percentage change since the baseline commit (SQL/JSON: Various improvements). There is at least single digit percentage noise in the measurements, which is apparent since the rate fluctuates even between commits for cases we know can't be affected by the change. Still, even with this noise level, the improvements are very visible and consistent. ndigits|rate| accum_change |summary ---++--+ (1,1) | 1.702e+07 | | SQL/JSON: Various improvements (1,1) | 2.201e+07 | +29.32 % | Optimise numeric multiplicatio (1,1) | 2.268e+07 | +33.30 % | Use diff's --strip-trailing-cr (1,1) | 2.228e+07 | +30.92 % | Improve the numeric width_buck (1,1) | 2.195e+07 | +29.01 % | Add missing pointer dereferenc (1,1) | 2.241e+07 | +31.68 % | Extend mul_var_short() to 5 an (1,1) | 2.130e+07 | +25.17 % | Optimise numeric multiplicatio (1,2) | 1.585e+07 | | SQL/JSON: Various improvements (1,2) | 2.227e+07 | +40.49 % | Optimise numeric multiplicatio (1,2) | 2.140e+07 | +35.00 % | Use diff's --strip-trailing-cr (1,2) | 2.227e+07 | +40.51 % | Improve the numeric width_buck (1,2) | 2.183e+07 | +37.75 % | Add missing pointer dereferenc (1,2) | 2.241e+07 | +41.41 % | Extend mul_var_short() to 5 an (1,2) | 2.223e+07 | +40.26 % | Optimise numeric multiplicatio (1,3) | 1.554e+07 | | SQL/JSON: Various improvements (1,3) | 2.155e+07 | +38.70 % | Optimise numeric multiplicatio (1,3) | 2.140e+07 | +37.74 % | Use diff's --strip-trailing-cr (1,3) | 2.139e+07 | +37.66 % | Improve the numeric width_buck (1,3) | 2.234e+07 | +43.76 % | Add missing pointer dereferenc (1,3) | 2.142e+07 | +37.83 % | Extend mul_var_short() to 5 an (1,3) | 2.066e+07 | +32.97 % | Optimise numeric multiplicatio (1,4) | 1.450e+07 | | SQL/JSON: Various improvements (1,4) | 2.113e+07 | +45.70 % | Optimise numeric multiplicatio (1,4) | 2.121e+07 | +46.30 % | Use diff's --strip-trailing-cr (1,4) | 2.115e+07 | +45.85 % | Improve the numeric width_buck (1,4) | 2.166e+07 | +49.37 % | Add missing pointer dereferenc (1,4) | 2.053e+07 | +41.56 % | Extend mul_var_short() to 5 an (1,4) | 2.085e+07 | +43.82 % | Optimise numeric multiplicatio (1,8) | 1.440e+07 | | SQL/JSON: Various improvements (1,8) | 1.963e+07 | +36.38 % | Optimise numeric multiplicatio (1,8) | 2.018e+07 | +40.19 % | Use diff's --strip-trailing-cr (1,8) | 2.045e+07 | +42.05 % | Improve the numeric width_buck (1,8) | 1.998e+07 | +38.79 % | Add missing pointer dereferenc (1,8) | 1.953e+07 | +35.68 % | Extend mul_var_short() to 5 an (1,8) | 1.992e+07 | +38.36 % | Optimise numeric multiplicatio (1,16)| 9.444e+06 | | SQL/JSON: Various improvements (1,16)| 1.235e+07 | +30.75 % | Optimise numeric multiplicatio (1,16)| 1.232e+07 | +30.47 % | Use diff's --strip-trailing-cr (1,16)| 1.239e+07 | +31.18 % | Improve the numeric width_buck (1,16)| 1.222e+07 | +29.35 % | Add missing pointer dereferenc (1,16)| 1.220e+07 | +29.14 % | Extend mul_var_short() to 5 an (1,16)| 1.271e+07 | +34.54 % | Optimise numeric multiplicatio (1,32)| 5.790e+06 | | SQL/JS
Re: POC, WIP: OR-clause support for indexes
On Sun, Jul 28, 2024 at 12:59 PM Alena Rybakina wrote: > On 27.07.2024 13:56, Alexander Korotkov wrote: > > On Thu, Jul 25, 2024 at 5:04 PM Alena Rybakina > > wrote: > >> To be honest, I have found a big problem in this patch - we try to perform > >> the transformation every time we examime a column: > >> > >> for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++) { ... > >> > >> } > >> > >> I have fixed it and moved the transformation before going through the loop. > > What makes you think there is a problem? > > To be honest, I was bothered by the fact that we need to go through > expressions several times that obviously will not fit under other > conditions. > Just yesterday I thought that it would be worthwhile to create a list of > candidates - expressions that did not fit because the column did not > match the index (!match_index_to_operand(nconst_expr, indexcol, index)). I admit that this area probably could use some optimization, especially for case of many clauses and many indexes. But in the scope of this patch, I think this is enough to not make things worse in this area. > Another problem that is related to the first one that the boolexpr could > contain expressions referring to different operands, for example, both x > and y. that is, we may have the problem that the optimal "ANY" > expression may not be used, because the expression with x may come > earlier and the loop may end earlier. > > alena@postgres=# create table b (x int, y int); > CREATE TABLE > alena@postgres=# insert into b select id, id from > generate_series(1,1000) as id; > INSERT 0 1000 > alena@postgres=# create index x_idx on b(x); > CREATE INDEX > alena@postgres=# analyze; > ANALYZE > alena@postgres=# explain select * from b where y =3 or x =4 or x=5 or > x=6 or x = 7 or x=8 or x=9; >QUERY PLAN > --- > Seq Scan on b (cost=0.00..32.50 rows=7 width=8) > Filter: ((y = 3) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = > 8) OR (x = 9)) > (2 rows) > alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x = > 7 or x=8 or x=9 or y=1; >QUERY PLAN > --- > Seq Scan on b (cost=0.00..32.50 rows=7 width=8) > Filter: ((x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x = > 9) OR (y = 1)) > (2 rows) > alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x = > 7 or x=8 or x=9; > QUERY PLAN > > Index Scan using x_idx on b (cost=0.28..12.75 rows=6 width=8) > Index Cond: (x = ANY ('{4,5,6,7,8,9}'::integer[])) > (2 rows) > > Furthermore expressions can be stored in a different order. > For example, first comes "AND" expr, and then group of "OR" expr, which > we can convert to "ANY" expr, but we won't do this due to the fact that > we will exit the loop early, according to this condition: > > if (!IsA(sub_rinfo->clause, OpExpr)) > return NULL; > > or it may occur due to other conditions. > > alena@postgres=# create index x_y_idx on b(x,y); > CREATE INDEX > alena@postgres=# analyze; > ANALYZE > > alena@postgres=# explain select * from b where (x = 2 and y =3) or x =4 > or x=5 or x=6 or x = 7 or x=8 or x=9; > QUERY PLAN > - > Seq Scan on b (cost=0.00..35.00 rows=6 width=8) > Filter: (((x = 2) AND (y = 3)) OR (x = 4) OR (x = 5) OR (x = 6) OR > (x = 7) OR (x = 8) OR (x = 9)) > (2 rows) > > Because of these reasons, I tried to save this and that transformation > together for each column and try to analyze for each expr separately > which method would be optimal. Yes, with v27 of the patch, optimization wouldn't work in these cases. However, you are using quite small table. If you will use larger table or disable sequential scans, there would be bitmap plans to handle these queries. So, v27 doesn't make the situation worse. It just doesn't optimize all that it could potentially optimize and that's OK. I've written a separate 0002 patch to address this. Now, before generation of paths for bitmap OR, similar OR entries are grouped together. When considering a group of similar entries, they are considered both together and one-by-one. Ideally we could consider more sophisticated grouping, but that seems fine for now. You can check how this patch handles the cases of above. Also, 0002 address issue of duplicated bitmap scan conditions in different forms. During generate_bitmap_or_paths() we need to exclude considered condition for other clauses. It couldn't be as normal filtered out in the latter stage, because could reach the index in another form. > > Do you have a test ca
Re: Maybe don't process multi xmax in FreezeMultiXactId() if it is already marked as invalid?
On 19.06.2024 21:06, Peter Geoghegan wrote: On Wed, Jun 19, 2024 at 1:39 PM Alvaro Herrera wrote: FWIW I don't think HEAP_XMAX_INVALID as purely a hint. HEAP_XMAX_COMMITTED is a hint, for sure, as is HEAP_XMIN_COMMITTED on its own; but as far as I recall, the INVALID flags must persist once set. Seems we disagree on some pretty fundamental things in this area, then. To resolve this situation seems it is necessary to agree on what is a "hint bit" exactly means and how to use it. For example, in this way: 1) Definition. The "hint bit" if it is set represents presence of the property of some object (e.g. tuple). The value of a hint bit can be derived again at any time. So it is acceptable for a hint bit to be lost during some operations. 2) Purpose. (It has already been defined by Yura Sokolov in one of the previous letters) Some work (e.g CPU + mem usage) must be done to check the property of some object. Checking the hint bit, if it is set, saves this work. So the purpose of the hint bit is optimization. 3) Use. By default code that needs to check some property of the object must firstly check the corresponding hint bit. If hint is set, determine that the property is present. If hint is not set, do the work to check this property of the object and set hint bit if that property is present. Also, non-standard behavior is allowed, when the hint bit is ignored and the work on property check will be performed unconditionally for some reasons. In this case the code must contain a comment with an explanation of this reason. And maybe for clarity, explicitly say that some bit is a hint right in its definition? For instance, use HEAP_XMIN_COMMITTED_HINT instead of HEAP_XMIN_COMMITTED. Remarks and concerns are gratefully welcome. With the best regards, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Simplify create_merge_append_path a bit for clarity
On Fri, Jul 26, 2024 at 1:28 PM Paul A Jungwirth wrote: > Is there a reason you don't want to remove the required_outer > parameter altogether? I guess because it is such a common pattern to > pass it? I think it's best to keep this parameter unchanged to maintain consistency with other functions that create path nodes in pathnode.c. > Do you think it is worth keeping this assertion?: > > - > -/* All child paths must have same parameterization */ > -Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer)); > > I understand any failure would trigger one of the prior asserts > instead, but it does communicate an extra requirement, and there is no > cost. I don't think it's a good idea to keep this Assert: with this change it becomes redundant. > But I'd be fine with committing this patch as-is. I've pushed this patch. Thanks for review. Thanks Richard
Re: Allow logical failover slots to wait on synchronous replication
On Fri, Jul 26, 2024 at 5:11 PM Amit Kapila wrote: > > On Fri, Jul 26, 2024 at 3:28 PM shveta malik wrote: > > > > On Tue, Jul 23, 2024 at 10:35 AM Amit Kapila > > wrote: > > > > > > On Tue, Jul 9, 2024 at 12:39 AM John H wrote: > > > > > > > > > Out of curiosity, did you compare with > > > > > standby_slot_names_from_syncrep set to off > > > > > and standby_slot_names not empty? > > > > > > > > I didn't think 'standby_slot_names' would impact TPS as much since > > > > it's not grabbing the SyncRepLock but here's a quick test. > > > > Writer with 5 synchronous replicas, 10 pg_recvlogical clients and > > > > pgbench all running from the same server. > > > > > > > > Command: pgbench -c 4 -j 4 -T 600 -U "ec2-user" -d postgres -r -P 5 > > > > > > > > Result with: standby_slot_names = > > > > 'replica_1,replica_2,replica_3,replica_4,replica_5' > > > > > > > > latency average = 5.600 ms > > > > latency stddev = 2.854 ms > > > > initial connection time = 5.503 ms > > > > tps = 714.148263 (without initial connection time) > > > > > > > > Result with: standby_slot_names_from_syncrep = 'true', > > > > synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' > > > > > > > > latency average = 5.740 ms > > > > latency stddev = 2.543 ms > > > > initial connection time = 4.093 ms > > > > tps = 696.776249 (without initial connection time) > > > > > > > > Result with nothing set: > > > > > > > > latency average = 5.090 ms > > > > latency stddev = 3.467 ms > > > > initial connection time = 4.989 ms > > > > tps = 785.665963 (without initial connection time) > > > > > > > > Again I think it's possible to improve the synchronous numbers if we > > > > cache but I'll try that out in a bit. > > > > > > > > > > Okay, so the tests done till now conclude that we won't get the > > > benefit by using 'standby_slot_names_from_syncrep'. Now, if we > > > increase the number of standby's in both lists and still keep ANY 3 in > > > synchronous_standby_names then the results may vary. We should try to > > > find out if there is a performance benefit with the use of > > > synchronous_standby_names in the normal configurations like the one > > > you used in the above tests to prove the value of this patch. > > > > > > > I didn't fully understand the parameters mentioned above, specifically > > what 'latency stddev' and 'latency average' represent.. But shouldn't > > we see the benefit/value of this patch by having a setup where a > > particular standby is slow in sending the response back to primary > > (could be due to network lag or other reasons) and then measuring the > > latency in receiving changes on failover-enabled logical subscribers? > > We can perform this test with both of the below settings and say make > > D and E slow in sending responses: > > 1) synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' > > 2) standby_slot_names = A_slot, B_slot, C_slot, D_slot, E_slot. > > > > Yes, I also expect the patch should perform better in such a scenario > but it is better to test it. Also, irrespective of that, we should > investigate why the reported case is slower for > synchronous_standby_names and see if we can improve it. +1 > BTW, you for 2), I think you wanted to say synchronized_standby_slots, > not standby_slot_names. We have recently changed the GUC name. yes, sorry, synchronized_standby_slots it is. thanks Shveta
Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)
On Fri, Jul 26, 2024 at 5:44 PM Richard Guo wrote: > I've worked a bit more on the comments and commit message, and I plan > to push the attached soon, barring any objections or comments. Pushed. Thanks Richard
Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)
Thanks a lot Richard. On Mon, Jul 29, 2024 at 8:56 AM Richard Guo wrote: > > On Fri, Jul 26, 2024 at 5:44 PM Richard Guo wrote: > > I've worked a bit more on the comments and commit message, and I plan > > to push the attached soon, barring any objections or comments. > > Pushed. > > Thanks > Richard -- Best Wishes, Ashutosh Bapat
RE: [Proposal] Add foreign-server health checks infrastructure
Dear Fujii-san, Thanks for pushing and analyzing the failure! > The regression.diffs shows that pgfdw_conn_check returned 0 even though > pgfdw_conn_checkable() > returned true. This can happen if the "revents" from poll() indicates > something > other than > POLLRDHUP. I think that "revents" could indicate POLLHUP, POLLERR, or > POLLNVAL. Therefore, > IMO pgfdw_conn_check() should be updated as follows. I will test this change. > > - return (input_fd.revents & POLLRDHUP) ? 1 : 0; > + return (input_fd.revents & > + (POLLRDHUP | POLLHUP | POLLERR | > POLLNVAL)) ? 1 : 0; I think you are right. According to the man page, the input socket is invalid or disconnected if revents has such bits. So such cases should be also regarded as *failure*. After the fix, the animal said OK. Great works! Best regards, Hayato Kuroda FUJITSU LIMITED
RE: [Proposal] Add foreign-server health checks infrastructure
Dear Fujii-san, > > IIUC, the patch which adds user_name attribute to get_connection() can be > discussed > > in later stage, is it right? > > No, let's work on the patch at this stage :) OK, here is a rebased patch. - Changed the name of new API from `GetUserMappingFromOid` to `GetUserMappingByOid` to keep the name consistent with others. - Comments and docs were updated. Best regards, Hayato Kuroda FUJITSU LIMITED 0001-Extend-postgres_fdw_get_connections-to-return-user-n.patch Description: 0001-Extend-postgres_fdw_get_connections-to-return-user-n.patch
dikkop failed the pg_combinebackupCheck/006_db_file_copy.pl test
Hello Tomas, Please take a look at a recent dikkop's failure [1]. The regress_log_006_db_file_copy file from that run shows: [02:08:57.929](0.014s) # initializing database system by copying initdb template ... [02:09:22.511](24.583s) ok 1 - full backup ... [02:10:35.758](73.247s) not ok 2 - incremental backup 006_db_file_copy_primary.log contains: 2024-07-28 02:09:29.441 UTC [67785:12] 006_db_file_copy.pl LOG: received replication command: START_REPLICATION SLOT "pg_basebackup_67785" 0/400 TIMELINE 1 2024-07-28 02:09:29.441 UTC [67785:13] 006_db_file_copy.pl STATEMENT: START_REPLICATION SLOT "pg_basebackup_67785" 0/400 TIMELINE 1 2024-07-28 02:09:29.441 UTC [67785:14] 006_db_file_copy.pl LOG: acquired physical replication slot "pg_basebackup_67785" 2024-07-28 02:09:29.441 UTC [67785:15] 006_db_file_copy.pl STATEMENT: START_REPLICATION SLOT "pg_basebackup_67785" 0/400 TIMELINE 1 2024-07-28 02:10:29.487 UTC [67785:16] 006_db_file_copy.pl LOG: terminating walsender process due to replication timeout 2024-07-28 02:10:29.487 UTC [67785:17] 006_db_file_copy.pl STATEMENT: START_REPLICATION SLOT "pg_basebackup_67785" 0/400 TIMELINE 1 It looks like this incremental backup operation was performed slower than usual (it took more than 60 seconds and apparently was interrupted due to wal_sender_timeout). But looking at regress_log_006_db_file_copy from the 6 previous (successful) test runs, we can see: [14:22:16.841](43.215s) ok 2 - incremental backup [02:14:42.888](34.595s) ok 2 - incremental backup [17:51:16.152](43.708s) ok 2 - incremental backup [04:07:16.757](31.087s) ok 2 - incremental backup [12:15:01.256](49.432s) ok 2 - incremental backup [01:06:02.482](52.364s) ok 2 - incremental backup Thus reaching 60s (e.g., due to some background activity) on this animal seems pretty possible. So maybe it would make sense to increase wal_sender_timeout for it, say, to 120s? [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2024-07-27%2023%3A22%3A57 Best regards, Alexander
Re: Conflict detection and logging in logical replication
On Fri, Jul 26, 2024 at 4:28 PM shveta malik wrote: > > On Fri, Jul 26, 2024 at 3:56 PM Amit Kapila wrote: > > > > > One more thing we need to consider is whether we should LOG or ERROR > > for update/delete_differ conflicts. If we LOG as the patch is doing > > then we are intentionally overwriting the row when the user may not > > expect it. OTOH, without a patch anyway we are overwriting, so there > > is an argument that logging by default is what the user will expect. > > What do you think? > > I was under the impression that in this patch we do not intend to > change behaviour of HEAD and thus only LOG the conflict wherever > possible. > Earlier, I thought it was good to keep LOGGING the conflict where there is no chance of wrong data update but for cases where we can do something wrong, it is better to ERROR out. For example, for an update_differ case where the apply worker can overwrite the data from a different origin, it is better to ERROR out. I thought this case was comparable to an existing ERROR case like a unique constraint violation. But I see your point as well that one might expect the existing behavior where we are silently overwriting the different origin data. The one idea to address this concern is to suggest users set the detect_conflict subscription option as off but I guess that would make this feature unusable for users who don't want to ERROR out for different origin update cases. > And in the next patch of resolver, based on the user's input > of error/skip/or resolve, we take the action. I still think it is > better to stick to the said behaviour. Only if we commit the resolver > patch in the same version where we commit the detection patch, then we > can take the risk of changing this default behaviour to 'always > error'. Otherwise users will be left with conflicts arising but no > automatic way to resolve those. But for users who really want their > application to error out, we can provide an additional GUC in this > patch itself which changes the behaviour to 'always ERROR on > conflict'. > I don't see a need of GUC here, even if we want we can have a subscription option such conflict_log_level. But users may want to either LOG or ERROR based on conflict type. For example, there won't be any data inconsistency in two node replication for delete_missing case as one is trying to delete already deleted data, so LOGGING such a case should be sufficient whereas update_differ could lead to different data on two nodes, so the user may want to ERROR out in such a case. We can keep the current behavior as default for the purpose of conflict detection but can have a separate patch to decide whether to LOG/ERROR based on conflict_type. -- With Regards, Amit Kapila.
Re: Flush pgstats file during checkpoints
Hi, On Tue, Jul 23, 2024 at 12:52:11PM +0900, Michael Paquier wrote: > On Mon, Jul 22, 2024 at 07:01:41AM +, Bertrand Drouvot wrote: > > 3 === > > > > + /* > > +* Read the redo LSN stored in the file. > > +*/ > > + if (!read_chunk_s(fpin, &file_redo) || > > + file_redo != redo) > > + goto error; > > > > I wonder if it would make sense to have dedicated error messages for > > "file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would > > ease to diagnose as to why the stat file is discarded. > > Yep. This has been itching me quite a bit, and that's a bit more than > just the format ID or the redo LSN: it relates to all the read_chunk() > callers. I've taken a shot at this with patch 0001, implemented on > top of the rest. Thanks! 0001 attached is v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch so I guess you did not attached the right one. > Attaching a new v4 series, with all these comments addressed. Thanks! Looking at 0002: 1 === if (!read_chunk(fpin, ptr, info->shared_data_len)) + { + elog(WARNING, "could not read data of stats kind %d for entry of type %c", + kind, t); Nit: what about to include the "info->shared_data_len" value in the WARNING? 2 === if (!read_chunk_s(fpin, &name)) + { + elog(WARNING, "could not read name of stats kind %d for entry of type %c", +kind, t); goto error; + } if (!pgstat_is_kind_valid(kind)) + { + elog(WARNING, "invalid stats kind %d for entry of type %c", +kind, t); goto error; + } Shouldn't we swap those 2 tests so that we check that the kind is valid right after this one? if (!read_chunk_s(fpin, &kind)) + { + elog(WARNING, "could not read stats kind for entry of type %c", t); goto error; + } Looking at 0003: LGTM Looking at 0004: LGTM Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Allow logical failover slots to wait on synchronous replication
Hi John, On Thu, Jul 18, 2024 at 02:22:08PM -0700, John H wrote: > Hi Bertrand, > > > 1 === > > ... > > That's worth additional comments in the code. > > There's this comment already about caching the value already, not sure > if you prefer something more? > > /* Cache values to reduce contention on lock */ Yeah, at the same place as the static lsn[] declaration, something like: static XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE]; /* cached LSNs */ but that may just be a matter of taste. > > 3 === > > ... > > NUM_SYNC_REP_WAIT_MODE is small but as the goal is the keep the lock time as > > short as possible I wonder if it wouldn't be better to use memcpy() here > > instead > > of this for loop. > > > > It results in a "Wdiscarded-qualifiers" which is safe given we take > the lock, but adds noise? > What do you think? > > "slot.c:2756:46: warning: passing argument 2 of ‘memcpy’ discards > ‘volatile’ qualifier from pointer target type > [-Wdiscarded-qualifiers]" Right, we may want to cast it then but given that the for loop is "small" I think that's also fine to keep the for loop. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Add mention of execution time memory for enable_partitionwise_* GUCs
On Fri, 19 Jul 2024 at 17:24, Ashutosh Bapat wrote: > I am fine if we want to mention that the executor may consume a large > amount of memory when these GUCs are turned ON. Users may decide to > turn those OFF if they can not afford to spend that much memory during > execution. But I don't like tying execution time consumption with > default OFF. Would the attached address your concern about the reasons for defaulting to off? David partitionwise_docs.patch Description: Binary data
Re: Logical Replication of sequences
On Fri, 26 Jul 2024 at 08:04, Peter Smith wrote: > > Here are some review comments for latest patch v20240725-0002 > > == > doc/src/sgml/ref/create_publication.sgml > > nitpick - tweak to the description of the example. > > == > src/backend/parser/gram.y > > preprocess_pub_all_objtype_list: > nitpick - typo "allbjects_list" > nitpick - reword function header > nitpick - /alltables/all_tables/ > nitpick - /allsequences/all_sequences/ > nitpick - I think code is safe as-is because makeNode internally does > palloc0, but OTOH adding Assert would be nicer just to remove any > doubts. > > == > src/bin/psql/describe.c > > 1. > + /* Print any publications */ > + if (pset.sversion >= 18) > + { > + int tuples = 0; > > No need to assign value 0 here, because this will be unconditionally > assigned before use anyway. > > > > 2. describePublications > > has_pubviaroot = (pset.sversion >= 13); > + has_pubsequence = (pset.sversion >= 18000); > > That's a bug! Should be 18, not 18000. > > == > > And, please see the attached diffs patch, which implements the > nitpicks mentioned above. These are handled in the v20240729 version attached at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm3SucGGLe-B-a_aqWNWQZ-yfxFTiAA0JyP-SwX4jq9Y3A%40mail.gmail.com Regards, Vignesh
Re: Logical Replication of sequences
On Fri, 26 Jul 2024 at 11:46, Peter Smith wrote: > > Hi Vignesh, > > There are still pending changes from my previous review of the > 0720-0003 patch [1], but here are some new review comments for your > latest patch v20240525-0003. > 2b. > Is it better to name these returned by-ref ptrs like 'ret_log_cnt', > and 'ret_lsn' to emphasise they are output variables? YMMV. I felt this is ok as we have mentioned in function header too > == > src/test/subscription/t/034_sequences.pl > > 4. > Q. Should we be suspicious that log_cnt changes from '32' to '31', or > is there a valid explanation? It smells like some calculation is > off-by-one, but without debugging I can't tell if it is right or > wrong. It works like this: for every 33 nextval we will get log_cnt as 0. So for 33 * 6(198) log_cnt will be 0, then for 199 log_cnt will be 32 and for 200 log_cnt will be 31. This pattern repeats, so this is ok. These are handled in the v20240729 version attached at [1]. [1] - https://www.postgresql.org/message-id/CALDaNm3SucGGLe-B-a_aqWNWQZ-yfxFTiAA0JyP-SwX4jq9Y3A%40mail.gmail.com Regards, Vignesh