Re: Problem, partition pruning for prepared statement with IS NULL clause.
On Wed, 11 Oct 2023 at 22:59, Sergei Glukhov wrote: > > Unfortunately, I'd not long sent the last email and realised that the > > step_lastkeyno parameter is now unused and can just be removed from > > both get_steps_using_prefix() and get_steps_using_prefix_recurse(). > > This requires some comment rewriting so I've attempted to do that too > > in the attached updated version. > > Thanks, verified again and everything is fine! Thanks for looking. I spent quite a bit more time on this again today and fiddled lots more with the comments and tests. I also did more testing after finding a way to easily duplicate the quals to cause multiple quals per partition key. The equivalence class code will only make ECs for mergejoin-able clauses, so if we just find a type that's not mergejoin-able but hashable, we can duplicate the quals with a hash partitioned table -- find a suitable non-mergejoin-able type. select oprleft::regtype from pg_operator where oprcanmerge=false and oprcanhash=true; oprleft - xid cid aclitem create table hash_xid(a xid, b xid, c xid) partition by hash(a,b,c); create table hash_xid1 partition of hash_xid for values with (modulus 2, remainder 0); create table hash_xid2 partition of hash_xid for values with (modulus 2, remainder 1); I tried out various combinations of the following query. Each equality clause is duplicated 6 times. When I enable all 6 for each of the 3 columns, I see 216 pruning steps. That's 6*6*6, just what I expected. The IS NULL quals are not duplicated since we can only set a bit once in the nullkeys. explain select * from hash_xid where a = '123'::xid and a = '123'::xid and a = '123'::xid and a = '123'::xid and a = '123'::xid and a = '123'::xid and --a is null and a is null and a is null and a is null and a is null and a is null and b = '123'::xid and b = '123'::xid and b = '123'::xid and b = '123'::xid and b = '123'::xid and b = '123'::xid and --b is null and b is null and b is null and b is null and b is null and b is null and c = '123'::xid and c = '123'::xid and c = '123'::xid and c = '123'::xid and c = '123'::xid and c = '123'::xid; --c is null and c is null and c is null and c is null and c is null and c is null; putting a breakpoint at the final line of gen_prune_steps_from_opexps() yields 216 steps. I didn't include anything of the above as part of the additional tests. Perhaps something like that is worthwhile in a reduced form. However, someone might make xid mergejoinable some time, which would break the test. Thanks for reviewing the previous version of this patch. Onto the other run-time one now... David
Re: New WAL record to detect the checkpoint redo location
On Tue, Oct 10, 2023 at 02:43:34PM -0400, Robert Haas wrote: > - I combined what were previously 0002 and 0003 into a single patch, > since that's how this would get committed. > > - I fixed up some comments. > > - I updated commit messages. > > Hopefully this is getting close to good enough. I have looked at 0001, for now.. And it looks OK to me. +* Nonetheless, this case is simpler than the normal cases handled +* above, which must check for changes in doPageWrites and RedoRecPtr. +* Those checks are only needed for records that can contain +* full-pages images, and an XLOG_SWITCH record never does. +Assert(fpw_lsn == InvalidXLogRecPtr); Right, that's the core reason behind the refactoring. The assertion is a good idea. -- Michael signature.asc Description: PGP signature
Re: A new strategy for pull-up correlated ANY_SUBLINK
Hi Alena, On Thu, Oct 12, 2023 at 5:01 AM Alena Rybakina wrote: > Hi! > > I reviewed your patch and it was interesting for me! > > Thank you for the explanation. It was really informative for me! > Thanks for your interest in this, and I am glad to know it is informative. > Unfortunately, I found a request when sublink did not pull-up, as in the > examples above. I couldn't quite figure out why. > I'm not sure what you mean with the "above", I guess it should be the "below"? > explain (analyze, costs off, buffers) > select b.x, b.x, a.y > from b > left join a > on b.x=a.x and > > *b.t in (select max(a0.t) * > from a a0 > where a0.x = b.x and >a0.t = b.t); > ... >SubPlan 2 > Here the sublink can't be pulled up because of its reference to the LHS of left join, the original logic is that no matter the 'b.t in ..' returns the true or false, the rows in LHS will be returned. If we pull it up to LHS, some rows in LHS will be filtered out, which breaks its original semantics. I thought it would be: > > explain (analyze, costs off, buffers) > select b.x, b.x, a.y > from b > left join a on > b.x=a.x and > > *b.t = (select max(a0.t) * > from a a0 > where a0.x = b.x and >a0.t <= b.t); > QUERY > PLAN > > - > Hash Right Join (actual time=1.181..67.927 rows=1000 loops=1) >Hash Cond: (a.x = b.x) >*Join Filter: (b.t = (SubPlan 2))* >Buffers: shared hit=3546 >-> Seq Scan on a (actual time=0.022..17.109 rows=10 loops=1) > Buffers: shared hit=541 >-> Hash (actual time=1.065..1.068 rows=1000 loops=1) > Buckets: 4096 Batches: 1 Memory Usage: 72kB > Buffers: shared hit=5 > -> Seq Scan on b (actual time=0.049..0.401 rows=1000 loops=1) >Buffers: shared hit=5 >SubPlan 2 > -> Result (actual time=0.025..0.025 rows=1 loops=1000) >Buffers: shared hit=3000 >InitPlan 1 (returns $2) > -> Limit (actual time=0.024..0.024 rows=1 loops=1000) >Buffers: shared hit=3000 >-> Index Only Scan Backward using a_t_x_idx on a a0 > (actual time=0.023..0.023 rows=1 loops=1000) > Index Cond: ((t IS NOT NULL) AND (t <= b.t) AND > (x = b.x)) > Heap Fetches: 1000 > Buffers: shared hit=3000 > Planning Time: 0.689 ms > Execution Time: 68.220 ms > (23 rows) > > If you noticed, it became possible after replacing the "in" operator with > "=". > I didn't notice much difference between the 'in' and '=', maybe I missed something? > I took the liberty of adding this to your patch and added myself as > reviewer, if you don't mind. > Sure, the patch after your modification looks better than the original. I'm not sure how the test case around "because of got one row" is relevant to the current changes. After we reach to some agreement on the above discussion, I think v4 is good for committer to review! -- Best Regards Andy Fan
Some performance degradation in REL_16 vs REL_15
Greetengs! Found that simple test pgbench -c20 -T20 -j8 gives approximately for REL_15_STABLE at 5143f76: 336+-1 TPS and for REL_16_STABLE at 4ac7635f: 324+-1 TPS The performance drop is approximately 3,5% while the corrected standard deviation is only 0.3%. See the raw_data.txt attached. How do you think, is there any cause for concern here? And is it worth spending time bisecting for the commit where this degradation may have occurred? Would be glad for any comments and concerns. With the best regards, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres CompanyREL_15_STABLE at 5143f76, TPS 336.639765 334.376801 334.963121 336.23666 335.698673 REL_16_STABLE at 4ac7635f, TPS 324.373695 323.168622 323.728652 325.799901 324.81759
Re: Doc: Minor update for enable_partitionwise_aggregate
On Wed, 11 Oct 2023 at 19:38, David Rowley wrote: > > On Wed, 11 Oct 2023 at 16:26, Andrew Atkinson wrote: > > "which allows grouping or aggregation on partitioned tables to be performed > > separately for each partition." > > This looks good to me. I can take care of this. Pushed and backpatched to v11. David
Re: Some performance degradation in REL_16 vs REL_15
On Thu, 12 Oct 2023 at 21:01, Anton A. Melnikov wrote: > > Greetengs! > > Found that simple test pgbench -c20 -T20 -j8 gives approximately > for REL_15_STABLE at 5143f76: 336+-1 TPS > and > for REL_16_STABLE at 4ac7635f: 324+-1 TPS > > And is it worth spending time bisecting for the commit where this degradation > may have occurred? It would be interesting to know what's to blame here and if you can attribute it to a certain commit. David
Re: Tab completion for AT TIME ZONE
Michael Paquier writes: > On Fri, Apr 14, 2023 at 12:05:25PM +0200, Jim Jones wrote: >> The patch applies cleanly and it does what it is proposing. - and it's IMHO >> a very nice addition. >> >> I've marked the CF entry as "Ready for Committer". > > +/* ... AT TIME ZONE ... */ > + else if (TailMatches("AT")) > + COMPLETE_WITH("TIME ZONE"); > + else if (TailMatches("AT", "TIME")) > + COMPLETE_WITH("ZONE"); > + else if (TailMatches("AT", "TIME", "ZONE")) > + COMPLETE_WITH_TIMEZONE_NAME(); > > This style will for the completion of timezone values even if "AT" is > the first word of a query. Shouldn't this be more selective by making > sure that we are at least in the context of a SELECT query? It's valid anywhere an expression is, which is a lot more places than just SELECT queries. Off the top of my head I can think of WITH, INSERT, UPDATE, VALUES, CALL, CREATE TABLE, CREATE INDEX. As I mentioned upthread, the only place in the grammar where the word AT occurs is in AT TIME ZONE, so there's no ambiguity. Also, it doesn't complete time zone names after AT, it completes the literal words TIME ZONE, and you have to then hit tab again to get a list of time zones. If we (or the SQL committee) were to invent more operators that start with the word AT, we can add those to the first if clause above and complete with the appropriate values after each one separately. - ilmari
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Wed, Oct 11, 2023 at 4:27 PM Hayato Kuroda (Fujitsu) wrote: > > Thank you for reviewing! PSA new version. > Some more comments: 1. Let's restruture binary_upgrade_validate_wal_logical_end() a bit. First, let's change its name to binary_upgrade_slot_has_pending_wal() or something like that. Then move the context creation and free related code into DecodingContextHasDecodedItems(). We can rename DecodingContextHasDecodedItems() as pg_logical_replication_slot_has_pending_wal() and place it in slotfuncs.c. This will make the code structure similar to other slot functions like pg_replication_slot_advance(). 2. + * Returns true if there are no changes after the confirmed_flush_lsn. How about something like: "Returns true if there are no decodable WAL records after the confirmed_flush_lsn."? 3. Shouldn't we need to call CheckSlotPermissions() in binary_upgrade_validate_wal_logical_end? 4. + /* + * Also, set processing_required flag if the message is not + * transactional. It is needed to notify the message's existence to + * the caller side. Usually, the flag is set when either the COMMIT or + * ABORT records are decoded, but this must be turned on here because + * the non-transactional logical message is decoded without waiting + * for these records. + */ The first sentence of the comments doesn't seem to be required as that just says what the code does. So, let's slightly change it to: "We need to set processing_required flag to notify the message's existence to the caller side. Usually, the flag is set when either the COMMIT or ABORT records are decoded, but this must be turned on here because the non-transactional logical message is decoded without waiting for these records." -- With Regards, Amit Kapila.
Special-case executor expression steps for common combinations
The attached patch adds special-case expression steps for common sets of steps in the executor to shave a few cycles off during execution, and make the JIT generated code simpler. * Adds EEOP_FUNCEXPR_STRICT_1 and EEOP_FUNCEXPR_STRICT_2 for function calls of strict functions with 1 or 2 arguments (EEOP_FUNCEXPR_STRICT remains used for > 2 arguments). * Adds EEOP_AGG_STRICT_INPUT_CHECK_ARGS_1 which is a special case for the common case of one arg aggs. * Replace EEOP_DONE with EEOP_DONE_RETURN and EEOP_DONE_NO_RETURN to be able to skip extra setup for steps which are only interested in the side effects. Stressing the EEOP_FUNCEXPR_STRICT_* steps specifically shows a 1.5% improvement and pgbench over the branch shows a ~1% improvement in TPS (both measured over 6 runs with outliers removed). EEOP_FUNCEXPR_STRICT_* (10M iterations): master : (7503.317, 7553.691, 7634.524) patched : (7422.756, 7455.120, 7492.393) pgbench: master : (3653.83, 3792.97, 3863.70) patched : (3743.04, 3830.02, 3869.80) This patch was extracted from a larger body of work from Andres [0] aiming at providing the necessary executor infrastructure for making JIT expression caching possible. This patch, and more which are to be submitted, is however separate in the sense that it is not part of the infrastructure, it's an improvements on its own. Thoughts? -- Daniel Gustafsson [0]: https://postgr.es/m/20191023163849.sosqbfs5yenoc...@alap3.anarazel.de v1-0001-Add-fast-path-expression-steps-for-common-combina.patch Description: Binary data
Re: cataloguing NOT NULL constraints
Hi Alvaro, 25.08.2023 14:38, Alvaro Herrera wrote: I have now pushed this again. Hopefully it'll stick this time. I've discovered that that commit added several recursive functions, and some of them are not protected from stack overflow. Namely, with "max_locks_per_transaction = 600" and default ulimit -s (8192), I observe server crashes with the following scripts: # ATExecSetNotNull() (n=4; printf "create table t0 (a int, b int);"; for ((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 alter b set not null;" ) | psql >psql.log # dropconstraint_internal() (n=2; printf "create table t0 (a int, b int not null);"; for ((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 alter b drop not null;" ) | psql >psql.log # set_attnotnull() (n=11; printf "create table tp (a int, b int, primary key(a, b)) partition by range (a); create table tp0 (a int primary key, b int) partition by range (a);"; for ((i=1;i<=$n;i++)); do printf "create table tp$i partition of tp$(( $i - 1 )) for values from ($i) to (100) partition by range (a);"; done; printf "alter table tp attach partition tp0 for values from (0) to (100);") | psql >psql.log # this takes half an hour on my machine May be you would find appropriate to add check_stack_depth() to these functions. (ATAddCheckNNConstraint() is protected because it calls AddRelationNewConstraints(), which in turn calls StoreRelCheck() -> CreateConstraintEntry() -> recordDependencyOnSingleRelExpr() -> find_expr_references_walker() -> expression_tree_walker() -> expression_tree_walker() -> check_stack_depth().) (There were patches prepared for similar cases [1], but they don't cover new functions, of course, and I'm not sure how to handle all such instances.) [1] https://commitfest.postgresql.org/45/4239/ Best regards, Alexander
Re: Logging parallel worker draught
On 10/11/23 17:26, Imseih (AWS), Sami wrote: Thank you for resurrecting this thread. Well, if you read Benoit's earlier proposal at [1] you'll see that he does propose to have some cumulative stats; this LOG line he proposes here is not a substitute for stats, but rather a complement. I don't see any reason to reject this patch even if we do get stats. I believe both cumulative statistics and logs are needed. Logs excel in pinpointing specific queries at precise times, while statistics provide a broader overview of the situation. Additionally, I often encounter situations where clients lack pg_stat_statements and can't restart their production promptly. Regarding the current patch, the latest version removes the separate GUC, but the user should be able to control this behavior. I created this patch in response to Amit Kapila's proposal to keep the discussion ongoing. However, I still favor the initial version with the GUCs. Query text is logged when log_min_error_statement > default level of "error". This could be especially problematic when there is a query running more than 1 Parallel Gather node that is in draught. In those cases each node will end up generating a log with the statement text. So, a single query execution could end up having multiple log lines with the statement text. ... I wonder if it will be better to accumulate the total # of workers planned and # of workers launched and logging this information at the end of execution? log_temp_files exhibits similar behavior when a query involves multiple on-disk sorts. I'm uncertain whether this is something we should or need to address. I'll explore whether the error message can be made more informative. [local]:5437 postgres@postgres=# SET work_mem to '125kB'; [local]:5437 postgres@postgres=# SET log_temp_files TO 0; [local]:5437 postgres@postgres=# SET client_min_messages TO log; [local]:5437 postgres@postgres=# WITH a AS ( SELECT x FROM generate_series(1,1) AS F(x) ORDER BY 1 ) , b AS (SELECT x FROM generate_series(1,1) AS F(x) ORDER BY 1 ) SELECT * FROM a,b; LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp138850.20", size 122880 => First sort LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp138850.19", size 14 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp138850.23", size 14 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp138850.22", size 122880 => Second sort LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp138850.21", size 14 -- Benoit Lobréau Consultant http://dalibo.com
Re: Use virtual tuple slot for Unique node
On Tue, Oct 10, 2023 at 2:23 PM David Rowley wrote: > > On Wed, 27 Sept 2023 at 20:01, David Rowley wrote: > > > > On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas wrote: > > > So not a win in this case. Could you peek at the outer slot type, and > > > use the same kind of slot for the Unique's result? Or some more > > > complicated logic, like use a virtual slot if all the values are > > > pass-by-val? I'd also like to keep this simple, though... > > > > > > Would this kind of optimization make sense elsewhere? > > > > There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin(). > > > > If I adjust the patch to: > > > > - ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsMinimalTuple); > > + ExecInitResultTupleSlotTL(&uniquestate->ps, > > + > > ExecGetResultSlotOps(outerPlanState(uniquestate), > > + > > NULL)); > > Just to keep this from going cold, here's that in patch form for > anyone who wants to test. Thanks. I don't recollect why we chose MinimalTupleSlot here - may be because we expected the underlying node to always produce a minimal tupe. But Unique node copies the tuple returned by the underlying node. This copy is carried out by the TupleTableSlot specific copy function copyslot. Every implementation of this function first converts the source slot tuple into the required form and then copies it. Having both the TupleTableSlots, ouput slot from the underlying node and the output slot of Unique node, of the same type avoids the first step and just copies the slot. It makes sense that it performs better. The code looks fine to me. > > I spent a bit more time running some more benchmarks and I don't see > any workload where it slows things down. I'd be happy if someone else > had a go at finding a regression. I built on your experiments and I might have found a minor regression. Setup = drop table if exists t_int; create table t_int(a int, b int); insert into t_int select x, x from generate_series(1,100)x; create index on t_int (a,b); vacuum analyze t_int; drop table if exists t_text; create table t_text(a text, b text); insert into t_text select lpad(x::text, 1000, '0'), x::text from generate_series(1,100)x; create index on t_text (a,b); vacuum analyze t_text; drop table if exists t_mixed; -- this one is new but it doesn't matter much create table t_mixed(a text, b int); insert into t_mixed select lpad(x::text, 1000, '0'), x from generate_series(1,100)x; create index on t_mixed (a,b); vacuum analyze t_mixed; Queries and measurements (average execution time from 3 runs - on my Thinkpad T490) == Q1 select distinct a,b from t_int'; HEAD: 544.45 ms patched: 381.55 ms Q2 select distinct a,b from t_text HEAD: 609.90 ms patched: 513.42 ms Q3 select distinct a,b from t_mixed HEAD: 626.80 ms patched: 468.22 ms The more the pass by ref data, more memory is allocated which seems to reduce the gain by this patch. Above nodes use Buffer or HeapTupleTableSlot. Try some different nodes which output minimal or virtual TTS. set enable_hashagg to off; Q4 select distinct a,b from (select sum(a) over (order by a rows 2 preceding) a, b from t_int) q HEAD: 2529.58 ms patched: 2332.23 Q5 select distinct a,b from (select sum(a) over (order by a rows 2 preceding) a, b from t_int order by a, b) q HEAD: 2633.69 ms patched: 2255.99 ms Q6 select distinct a,b from (select string_agg(a, ', ') over (order by a rows 2 preceding) a, b from t_text) q HEAD: 108589.85 ms patched: 107226.82 ms Q7 select distinct a,b from (select string_agg(left(a, 100), ', ') over (order by a rows 2 preceding) a, b from t_text) q HEAD: 16070.62 ms patched: 16182.16 ms This one is surprising though. May be the advantage of using the same tuple table slot is so narrow when large data needs to be copied that the execution times almost match. The patched and unpatched execution times differ by the margin of error either way. -- Best Wishes, Ashutosh Bapat
Re: Special-case executor expression steps for common combinations
On 12/10/2023 12:48, Daniel Gustafsson wrote: The attached patch adds special-case expression steps for common sets of steps in the executor to shave a few cycles off during execution, and make the JIT generated code simpler. * Adds EEOP_FUNCEXPR_STRICT_1 and EEOP_FUNCEXPR_STRICT_2 for function calls of strict functions with 1 or 2 arguments (EEOP_FUNCEXPR_STRICT remains used for > 2 arguments). * Adds EEOP_AGG_STRICT_INPUT_CHECK_ARGS_1 which is a special case for the common case of one arg aggs. Are these relevant when JITting? I'm a little sad if the JIT compiler cannot unroll these on its own. Is there something we could do to hint it, so that it could treat the number of arguments as a constant? I understand that this can give a small boost in interpreter mode, so maybe we should do it in any case. But I'd like to know if we're missing a trick with the JITter, before we mask it with this. * Replace EEOP_DONE with EEOP_DONE_RETURN and EEOP_DONE_NO_RETURN to be able to skip extra setup for steps which are only interested in the side effects. I'm a little surprised if this makes a measurable performance difference, but sure, why not. It seems nice to be more explicit when you don't expect a return value. -- Heikki Linnakangas Neon (https://neon.tech)
Re: [PATCH] Compression dictionaries for JSONB
Hi hackers, I would like to continue discussing compression dictionaries. > So I summarized the requirements we agreed on so far and ended up with > the following list: [...] Again, here is the summary of our current agreements, at least how I understand them. Please feel free to correct me where I'm wrong. We are going to focus on supporting the: SET COMPRESSION lz4 [WITH|WITHOUT] DICTIONARY ``` ... syntax for now. From the UI perspective the rest of the agreements didn't change compared to the previous summary. In the [1] discussion (cc: Robert) we agreed to use va_tag != 18 for the on-disk TOAST pointer representation to make TOAST pointers extendable. If va_tag has a different value (currently it's always 18), the TOAST pointer is followed by an utf8-like varint bitmask. This bitmask determines the rest of the content of the TOAST pointer and its overall size. This will allow to extend TOAST pointers to include dictionary_id and also to extend them in the future, e.g. to support ZSTD and other compression algorithms, use 64-bit TOAST pointers, etc. Several things occured to me: - Does anyone believe that va_tag should be part of the utf8-like bitmask in order to save a byte or two? - The described approach means that compression dictionaries are not going to be used when data is compressed in-place (i.e. within a tuple), since no TOAST pointer is involved in this case. Also we will be unable to add additional compression algorithms here. Does anyone have problems with this? Should we use the reserved compression algorithm id instead as a marker of an extended TOAST? - It would be nice to decompose the feature in several independent patches, e.g. modify TOAST first, then add compression dictionaries without automatic update of the dictionaries, then add the automatic update. I find it difficult to imagine however how to modify TOAST pointers and test the code properly without a dependency on a larger feature. Could anyone think of a trivial test case for extendable TOAST? Maybe something we could add to src/test/modules similarly to how we test SLRU, background workers, etc. [1]: https://www.postgresql.org/message-id/flat/CAN-LCVMq2X%3Dfhx7KLxfeDyb3P%2BBXuCkHC0g%3D9GF%2BJD4izfVa0Q%40mail.gmail.com -- Best regards, Aleksander Alekseev
Re: Server crash on RHEL 9/s390x platform against PG16
Here is clang version: [edb@9428da9d2137]$ clang --version clang version 15.0.7 (Red Hat 15.0.7-2.el9) Target: s390x-ibm-linux-gnu Thread model: posix InstalledDir: /usr/bin Let me know if any further information is needed. On Mon, Oct 9, 2023 at 8:21 AM Suraj Kharage wrote: > It looks like an issue with JIT. If I disable the JIT then the above query > runs successfully. > > postgres=# set jit to off; > > SET > > postgres=# SELECT * FROM rm32044_t1 LEFT JOIN rm32044_t2 ON > rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN rm32044_t4 ON > rm32044_t3.pkey = rm32044_t4.pkey order by rm32044_t1.pkey,label,hidden; > > pkey | val | pkey | label | hidden | pkey | val | pkey > > --+--+--+-++--+-+-- > > 1 | row1 |1 | hidden | t |1 | 1 | > > 1 | row1 |1 | hidden | t |2 | 1 | > > 2 | row2 |2 | visible | f |1 | 1 | > > 2 | row2 |2 | visible | f |2 | 1 | > > (4 rows) > > Any idea on this? > > On Mon, Sep 18, 2023 at 11:20 AM Suraj Kharage < > suraj.khar...@enterprisedb.com> wrote: > >> Few more details on this: >> >> (gdb) p val >> $1 = 0 >> (gdb) p i >> $2 = 3 >> (gdb) f 3 >> #3 0x01a1ef70 in ExecCopySlotMinimalTuple (slot=0x202e4f8) at >> ../../../../src/include/executor/tuptable.h:472 >> 472 return slot->tts_ops->copy_minimal_tuple(slot); >> (gdb) p *slot >> $3 = {type = T_TupleTableSlot, tts_flags = 16, tts_nvalid = 8, tts_ops = >> 0x1b6dcc8 , tts_tupleDescriptor = 0x202e0e8, tts_values = >> 0x202e540, tts_isnull = 0x202e580, tts_mcxt = 0x1f54550, tts_tid = >> {ip_blkid = {bi_hi = 65535, >> bi_lo = 65535}, ip_posid = 0}, tts_tableOid = 0} >> (gdb) p *slot->tts_tupleDescriptor >> $2 = {natts = 8, tdtypeid = 2249, tdtypmod = -1, tdrefcount = -1, constr >> = 0x0, attrs = 0x202cd28} >> >> (gdb) p slot.tts_values[3] >> $4 = 0 >> (gdb) p slot.tts_values[2] >> $5 = 1 >> (gdb) p slot.tts_values[1] >> $6 = 34027556 >> >> >> As per the resultslot, it has 0 value for the third attribute (column >> lable). >> Im testing this on the docker container and facing some issues with gdb >> hence could not able to debug it further. >> >> Here is a explain plan: >> >> postgres=# explain (verbose, costs off) SELECT * FROM rm32044_t1 LEFT >> JOIN rm32044_t2 ON rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN >> rm32044_t4 ON rm32044_t3.pkey = rm32044_t4.pkey order by >> rm32044_t1.pkey,label,hidden; >> >> QUERY PLAN >> >> >> - >> Incremental Sort >>Output: rm32044_t1.pkey, rm32044_t1.val, rm32044_t2.pkey, >> rm32044_t2.label, rm32044_t2.hidden, rm32044_t3.pkey, rm32044_t3.val, >> rm32044_t4.pkey >>Sort Key: rm32044_t1.pkey, rm32044_t2.label, rm32044_t2.hidden >>Presorted Key: rm32044_t1.pkey >>-> Merge Left Join >> Output: rm32044_t1.pkey, rm32044_t1.val, rm32044_t2.pkey, >> rm32044_t2.label, rm32044_t2.hidden, rm32044_t3.pkey, rm32044_t3.val, >> rm32044_t4.pkey >> Merge Cond: (rm32044_t1.pkey = rm32044_t2.pkey) >> -> Sort >>Output: rm32044_t3.pkey, rm32044_t3.val, rm32044_t4.pkey, >> rm32044_t1.pkey, rm32044_t1.val >>Sort Key: rm32044_t1.pkey >>-> Nested Loop >> Output: rm32044_t3.pkey, rm32044_t3.val, >> rm32044_t4.pkey, rm32044_t1.pkey, rm32044_t1.val >> -> Merge Left Join >>Output: rm32044_t3.pkey, rm32044_t3.val, >> rm32044_t4.pkey >>Merge Cond: (rm32044_t3.pkey = rm32044_t4.pkey) >>-> Sort >> Output: rm32044_t3.pkey, rm32044_t3.val >> Sort Key: rm32044_t3.pkey >> -> Seq Scan on public.rm32044_t3 >>Output: rm32044_t3.pkey, >> rm32044_t3.val >>-> Sort >> Output: rm32044_t4.pkey >> Sort Key: rm32044_t4.pkey >> -> Seq Scan on public.rm32044_t4 >>Output: rm32044_t4.pkey >> -> Materialize >>Output: rm32044_t1.pkey, rm32044_t1.val >>-> Seq Scan on public.rm32044_t1 >> Output: rm32044_t1.pkey, rm32044_t1.val >> -> Sort >>Output: rm32044_t2.pkey, rm32044_t2.label, >> rm32044_t2.hidden >>Sort Key: rm32044_t2.pkey >>-> Seq Scan on public.rm32044_t2 >> Output: rm32044_t2.pkey, rm32044_t2.label, >> rm32044_t2.hidden >> (34 rows) >> >> >> It seems like while building the innerslot for merge join, the value for >> attnum 1 is no
Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)
Hello hackers, While investigating the recent skink failure [1], I've reproduced this failure under Valgrind on a slow machine and found that this happens due to the last checkpoint recorded in the segment 2, that is removed in the test: The failure log contains: 2023-10-10 19:10:08.212 UTC [2144251][startup][:0] LOG: invalid checkpoint record 2023-10-10 19:10:08.214 UTC [2144251][startup][:0] PANIC: could not locate a valid checkpoint record The line above: [19:10:02.701](318.076s) ok 1 - 00010001 differs from 00010002 tells us about the duration of previous operations (> 5 mins). src/test/recovery/tmp_check/log/026_overwrite_contrecord_primary.log: 2023-10-10 19:04:50.149 UTC [1845798][postmaster][:0] LOG: database system is ready to accept connections ... 2023-10-10 19:09:49.131 UTC [1847585][checkpointer][:0] LOG: checkpoint starting: time ... 2023-10-10 19:10:02.058 UTC [1847585][checkpointer][:0] LOG: checkpoint complete: ... lsn=0/*2093980*, redo lsn=0/1F62760 And here is one more instance of this failure [2]: 2022-11-08 02:35:25.826 UTC [1614205][][:0] PANIC: could not locate a valid checkpoint record 2022-11-08 02:35:26.164 UTC [1612967][][:0] LOG: startup process (PID 1614205) was terminated by signal 6: Aborted src/test/recovery/tmp_check/log/026_overwrite_contrecord_primary.log: 2022-11-08 02:29:57.961 UTC [1546469][][:0] LOG: database system is ready to accept connections ... 2022-11-08 02:35:10.764 UTC [1611737][][2/10:0] LOG: statement: SELECT pg_walfile_name(pg_current_wal_insert_lsn()) 2022-11-08 02:35:11.598 UTC [1546469][][:0] LOG: received immediate shutdown request The next successful run after the failure [1] shows the following duration: [21:34:48.556](180.150s) ok 1 - 00010001 differs from 00010002 And the last successful run: [03:03:53.892](126.206s) ok 1 - 00010001 differs from 00010002 So to fail on the test, skink should perform at least twice slower than usual, and may be it's an extraordinary condition indeed, but on the other hand, may be increase checkpoint_timeout as already done in several tests (015_promotion_pages, 038_save_logical_slots_shutdown, 039_end_of_wal, ...). [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-10%2017%3A10%3A11 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-11-07%2020%3A27%3A11 Best regards, Alexander
pg_upgrade's interaction with pg_resetwal seems confusing
In pg_upgrade, we reset WAL archives (remove WAL), transaction id, etc. in copy_xact_xlog_xid() for the new cluster. Then, we create new objects in the new cluster, and again towards the end of the upgrade we invoke pg_resetwal with the -o option to reset the next OID. Now, along with resetting the OID, pg_resetwal will again reset the WAL. I am not sure if that is intentional and it may not have any problem today except that it seems redundant to reset WAL again. However, this can be problematic for the ongoing work to upgrade the logical replication slots [1]. We want to create/migrate logical slots in the new cluster before calling the final pg_resetwal (which resets the next OID) to ensure that there is no new WAL inserted by background processes or otherwise between resetwal location and creation of slots. So, we thought that we would compute the next WAL location by doing the computation similar to what pg_resetwal does to reset a new WAL location, create slots using that location, and pass the same to pg_resetwal using the -l option. However, that doesn't work because pg_resetwal uses the passed -l option only as a hint but can reset the later WAL if present which can remove the WAL position we have decided as restart_lsn (point to start reading WAL) for slots. So, we came up with another idea that we will reset the WAL just before creating slots and use that location to create slots and then invent a new option in pg_resetwal where it won't reset the WAL. Now, as mentioned in the first paragraph, it seems we anyway don't need to reset the WAL at the end when setting the next OID for the new cluster with the -o option. If that is true, then I think even without slots work it will be helpful to have such an option in pg_resetwal. Thoughts? [1] - https://commitfest.postgresql.org/45/4273/ -- With Regards, Amit Kapila.
Re: Removing unneeded self joins
On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov wrote: > On 4/10/2023 14:34, Alexander Korotkov wrote: > > > Relid replacement machinery is the most contradictory code here. We used > > > a utilitarian approach and implemented a simplistic variant. > > > > > > 2) It would be nice to skip the insertion of IS NOT NULL checks when > > > > they are not necessary. [1] points that infrastructure from [2] might > > > > be useful. The patchset from [2] seems committed mow. However, I > > > > can't see it is directly helpful in this matter. Could we just skip > > > > adding IS NOT NULL clause for the columns, that have > > > > pg_attribute.attnotnull set? > > > Thanks for the links, I will look into that case. > To be more precise, in the attachment, you can find a diff to the main > patch, which shows the volume of changes to achieve the desired behaviour. > Some explains in regression tests shifted. So, I've made additional tests: > > DROP TABLE test CASCADE; > CREATE TABLE test (a int, b int not null); > CREATE UNIQUE INDEX abc ON test(b); > explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) > WHERE t1.b=t2.b; > CREATE UNIQUE INDEX abc1 ON test(a,b); > explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) > WHERE t1.b=t2.b; > explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) > WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a); > DROP INDEX abc1; > explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) > WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b); > > We have almost the results we wanted to have. But in the last explain > you can see that nothing happened with the OR clause. We should use the > expression mutator instead of walker to handle such clauses. But It > doesn't process the RestrictInfo node ... I'm inclined to put a solution > of this issue off for a while. OK. I think it doesn't worth to eliminate IS NULL quals with this complexity (at least at this stage of work). I made improvements over the code. Mostly new comments, grammar corrections of existing comments and small refactoring. Also, I found that the suggestion from David Rowley [1] to qsort array of relations to faster find duplicates is still unaddressed. I've implemented it. That helps to evade quadratic complexity with large number of relations. Also I've incorporated improvements from Alena Rybakina except one for skipping SJ removal when no SJ quals is found. It's not yet clear for me if this check fix some cases. But at least optimization got skipped in some useful cases (as you can see in regression tests). Links 1. https://www.postgresql.org/message-id/CAKJS1f8ySSsBfooH3bJK7OD3LBEbDb99d8J_FtqDd6w50p-eAQ%40mail.gmail.com 2. https://www.postgresql.org/message-id/96f66ae3-df10-4060-9844-4c9633062cd3%40yandex.ru -- Regards, Alexander Korotkov 0001-Remove-useless-self-joins-v44.patch Description: Binary data
RE: [PoC] pg_upgrade: allow to upgrade publisher node
Dear Amit, Thanks for your suggestion! PSA new version. > The other problem is that pg_resetwal removes all pre-existing WAL > files which in this case could lead to the removal of the WAL file > corresponding to restart_lsn. This is because at least the shutdown > checkpoint record will be written after the creation of slots which > could be in the new file used for restart_lsn. Then when we invoke > pg_resetwal, it can remove that file. > > One idea to deal with this could be to do the reset WAL stuff > (FindEndOfXLOG(), KillExistingXLOG(), KillExistingArchiveStatus(), > WriteEmptyXLOG()) in a separate function (say in pg_upgrade) and then > create slots. If we do this, then we additionally need an option in > pg_resetwal which skips resetting the WAL as that would have been done > before creating the slots. Based on above idea, I made new version patch which some functionalities were exported from pg_resetwal. In this approach, pg_upgrade itself removed WALs and then create logical slots, then pg_resetwal would be called with new option --no-switch, which avoid to switch a WAL segment file. The option is only used for the upgrading purpose so it is not written in doc and usage(). This option is not required if pg_resetwal -o does not discard WAL records. Please see the fork thread [1]. We do not have to reserve future restart_lsn while creating a slot, so the binary function binary_upgrade_create_logical_replication_slot() was removed. Another advantage of this approach is to avoid calling pg_log_standby_snapshot() after the pg_resetwal. This was needed because of two reasons, but they were resolved automatically. 1) pg_resetwal removes all WAL files. 2) Logical slots requires a RUNNING_XACTS record for building a snapshot. [1]: https://www.postgresql.org/message-id/CAA4eK1KRyPMiY4fW98qFofsYrPd87Oc83zDNxSeHfTYh_asdBg%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED v49-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch Description: v49-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch
RE: [PoC] pg_upgrade: allow to upgrade publisher node
Dear Amit, Thanks for reviewing! New patch is available at [1]. > > Some more comments: > 1. Let's restruture binary_upgrade_validate_wal_logical_end() a bit. > First, let's change its name to binary_upgrade_slot_has_pending_wal() > or something like that. Then move the context creation and free > related code into DecodingContextHasDecodedItems(). We can rename > DecodingContextHasDecodedItems() as > pg_logical_replication_slot_has_pending_wal() and place it in > slotfuncs.c. This will make the code structure similar to other slot > functions like pg_replication_slot_advance(). Seems clearer than mine. Fixed. > 2. + * Returns true if there are no changes after the confirmed_flush_lsn. > > How about something like: "Returns true if there are no decodable WAL > records after the confirmed_flush_lsn."? Fixed. > 3. Shouldn't we need to call CheckSlotPermissions() in > binary_upgrade_validate_wal_logical_end? Added, but actually it is not needed. This is because only superusers can connect to the server while upgrading. Please see below codes in InitPostgres(). ``` if (IsBinaryUpgrade && !am_superuser) { ereport(FATAL, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be superuser to connect in binary upgrade mode"))); } ``` > 4. > + /* > + * Also, set processing_required flag if the message is not > + * transactional. It is needed to notify the message's existence to > + * the caller side. Usually, the flag is set when either the COMMIT or > + * ABORT records are decoded, but this must be turned on here because > + * the non-transactional logical message is decoded without waiting > + * for these records. > + */ > > The first sentence of the comments doesn't seem to be required as that > just says what the code does. So, let's slightly change it to: "We > need to set processing_required flag to notify the message's existence > to the caller side. Usually, the flag is set when either the COMMIT or > ABORT records are decoded, but this must be turned on here because the > non-transactional logical message is decoded without waiting for these > records." Fixed. [1]: https://www.postgresql.org/message-id/TYAPR01MB5866B0614F80CE9F5EF051BDF5D3A%40TYAPR01MB5866.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Special-case executor expression steps for common combinations
On Thu, 12 Oct 2023 at 22:54, Daniel Gustafsson wrote: > EEOP_FUNCEXPR_STRICT_* (10M iterations): > master : (7503.317, 7553.691, 7634.524) > patched : (7422.756, 7455.120, 7492.393) > > pgbench: > master : (3653.83, 3792.97, 3863.70) > patched : (3743.04, 3830.02, 3869.80) > > Thoughts? Did any of these tests compile the expression with JIT? If not, how does the performance compare for a query that JITs the expression? David
Re: Problem, partition pruning for prepared statement with IS NULL clause.
On Mon, 9 Oct 2023 at 12:26, David Rowley wrote: > > On Sat, 7 Oct 2023 at 03:11, Sergei Glukhov wrote: > > I noticed that combination of prepared statement with generic plan and > > 'IS NULL' clause could lead partition pruning to crash. > > > Test case: > > -- > > set plan_cache_mode to force_generic_plan; > > prepare stmt AS select * from hp where a is null and b = $1; > > explain execute stmt('xxx'); > > Thanks for the detailed report and proposed patch. > > I think your proposed fix isn't quite correct. I think the problem > lies in InitPartitionPruneContext() where we assume that the list > positions of step->exprs are in sync with the keyno. If you look at > perform_pruning_base_step() the code there makes a special effort to > skip over any keyno when a bit is set in opstep->nullkeys. I've now also pushed the fix for the incorrect logic for nullkeys in ExecInitPruningContext(). I didn't quite find a test to make this work for v11. I tried calling execute 5 times as we used to have to before the plan_cache_mode GUC was added in v12, but the test case kept picking the custom plan. So I ended up pushing v11 without any test. This goes out of support in ~1 month, so I'm not too concerned about the lack of test. I did do a manual test to ensure it works with: create table hp (a int, b text, c int) partition by hash (a, b); create table hp0 partition of hp for values with (modulus 4, remainder 0); create table hp3 partition of hp for values with (modulus 4, remainder 3); create table hp1 partition of hp for values with (modulus 4, remainder 1); create table hp2 partition of hp for values with (modulus 4, remainder 2); prepare hp_q1 (text) as select * from hp where a is null and b = $1; (set breakpoint in choose_custom_plan() and have it return false when we hit it.) explain (costs off) execute hp_q1('xxx'); David
Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock
On Wed, Oct 11, 2023 at 5:57 PM Dilip Kumar wrote: > > On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar wrote: > In my last email, I forgot to give the link from where I have taken > the base path for dividing the buffer pool in banks so giving the same > here[1]. And looking at this again it seems that the idea of that > patch was from > Andrey M. Borodin and the idea of the SLRU scale factor were > introduced by Yura Sokolov and Ivan Lazarev. Apologies for missing > that in the first email. > > [1] https://commitfest.postgresql.org/43/2627/ In my last email I have just rebased the base patch, so now while reading through that patch I realized that there was some refactoring needed and some unused functions were there so I have removed that and also added some comments. Also did some refactoring to my patches. So reposting the patch series. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v2-0002-bank-wise-slru-locks.patch Description: Binary data v2-0003-Introduce-bank-wise-LRU-counter.patch Description: Binary data v2-0001-Divide-SLRU-buffers-into-banks.patch Description: Binary data
Re: RFC: Logging plan of the running query
On 2023-10-11 16:22, Ashutosh Bapat wrote: Like many others I think this feature is useful to debug a long running query. Sorry for jumping late into this. I have a few of high level comments Thanks for your comments! There is a lot of similarity between what this feature does and what auto explain does. I see the code is also duplicated. There is some merit in avoiding this duplication 1. we will get all the features of auto_explain automatically like choosing a format (this was expressed somebody earlier in this thread), setings etc. 2. avoid bugs. E.g your code switches context after ExplainState has been allocated. These states may leak depending upon when this function gets called. 3. Building features on top as James envisions will be easier. Considering the similarity with auto_explain I wondered whether this function should be part of auto_explain contrib module itself? If we do that users will need to load auto_explain extension and thus install executor hooks when this function doesn't need those. So may not be such a good idea. I didn't see any discussion on this. I once thought about adding this to auto_explain, but I left it asis for below reasons: - One of the typical use case of pg_log_query_plan() would be analyzing slow query on customer environments. On such environments, We cannot always control what extensions to install. Of course auto_explain is a major extension and it is quite possible that they installed auto_explain, but but it is also possible they do not. - It seems a bit counter-intuitive that pg_log_query_plan() is in an extension called auto_explain, since it `manually`` logs plans I tried following query to pass PID of a non-client backend to this function. #select pg_log_query_plan(pid), application_name, backend_type from pg_stat_activity where backend_type = 'autovacuum launcher'; pg_log_query_plan | application_name |backend_type ---+--+- t | | autovacuum launcher (1 row) I see "LOG: backend with PID 2733631 is not running a query or a subtransaction is aborted" in server logs. That's ok. But may be we should not send signal to these kinds of backends at all, thus avoiding some system calls. Agreed, it seems better. Attached patch checks if the backendType of target process is 'client backend'. =# select pg_log_query_plan(pid), application_name, backend_type from pg_stat_activity where backend_type = 'autovacuum launcher'; WARNING: PID 63323 is not a PostgreSQL client backend process pg_log_query_plan | application_name |backend_type ---+--+- f | | autovacuum launcher I am also wondering whether it's better to report the WARNING as status column in the output. E.g. instead of #select pg_log_query_plan(100); WARNING: PID 100 is not a PostgreSQL backend process pg_log_query_plan --- f (1 row) we output #select pg_log_query_plan(100); pg_log_query_plan | status ---+- f | PID 100 is not a PostgreSQL backend process (1 row) That looks neater and can easily be handled by scripts, applications and such. But it will be inconsistent with other functions like pg_terminate_backend() and pg_log_backend_memory_contexts(). It seems neater, but it might be inconvenient because we can no longer use it in select list like the following query as you wrote: #select pg_log_query_plan(pid), application_name, backend_type from pg_stat_activity where backend_type = 'autovacuum launcher'; I do share a concern that was discussed earlier. If a query is running longer, there's something problematic with it. A diagnostic intervention breaking it further would be unwelcome. James has run experiments to shake this code for any loose breakages. He has not found any. So may be we are good. And we wouldn't know about very rare corner cases so easily without using it in the field. So fine with it. If we could add some safety net that will be great but may not be necessary for the first cut. If there are candidates for the safety net, I'm willing to add them. -- Regards, -- Atsushi Torikoshi NTT DATA Group CorporationFrom b7902cf43254450cc7831c235982438ea1e5e8b7 Mon Sep 17 00:00:00 2001 From: Atsushi Torikoshi Date: Thu, 12 Oct 2023 22:03:48 +0900 Subject: [PATCH v31] Add function to log the plan of the query Currently, we have to wait for the query execution to finish to check its plan. This is not so convenient when investigating long-running queries on production environments where we cannot use debuggers. To improve this situation, this patch adds pg_log_query_plan() function that requests to log the plan of the specified backend process. By default, only superusers are allowed to request to log the plans because allowing any users to issue
PostgreSQL domains and NOT NULL constraint
Hello PostgreSQL's CREATE DOMAIN documentation (section Notes) describes a way how one can add NULL's to a column that has a domain with the NOT NULL constraint. https://www.postgresql.org/docs/current/sql-createdomain.html To me it seems very strange and amounts to a bug because it defeats the purpose of domains (to be a reusable assets) and constraints (to avoid any bypassing of these). Oracle 23c added the support of domains (https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/create-domain.html). I tested the same scenario both in PostgreSQL and Oracle (https://www.oracle.com/database/free/) and found out that in these situations Oracle does not allow NULL's to be added to the column. I do not know as to whether the behaviour that is implemented in PostgreSQL is specified by the standard. However, if it is not the case, then how it could be that Oracle can but PostgreSQL cannot. Best regards Erki Eessaar The scenario that I tested both in PostgreSQL (16) and Oracle (23c). *** /*PostgreSQL 16*/ CREATE DOMAIN d_name VARCHAR(50) NOT NULL; CREATE TABLE Product_state_type (product_state_type_code SMALLINT NOT NULL, name d_name, CONSTRAINT pk_product_state_type PRIMARY KEY (product_state_type_code), CONSTRAINT ak_product_state_type_name UNIQUE (name)); CREATE TABLE Product (product_code INTEGER NOT NULL, name d_name, product_state_type_code SMALLINT NOT NULL, CONSTRAINT pk_product PRIMARY KEY (product_code), CONSTRAINT fk_product_product_state_type FOREIGN KEY (product_state_type_code) REFERENCES Product_state_type(product_state_type_code) ON UPDATE CASCADE); INSERT INTO Product_state_type (product_state_type_code, name) VALUES (1, (SELECT name FROM Product_state_type WHERE FALSE)); /*Insertion succeeds, name is NULL!*/ INSERT INTO Product (product_code, name, product_state_type_code) SELECT 1 AS product_code, Product.name, 1 AS product_state_type_code FROM Product_state_type LEFT JOIN Product USING (product_state_type_code); /*Insertion succeeds, name is NULL!*/ /*Oracle 23c*/ CREATE DOMAIN d_name AS VARCHAR2(50) NOT NULL; CREATE TABLE Product_state_type (product_state_type_code NUMBER(4) NOT NULL, name d_name, CONSTRAINT pk_product_state_type PRIMARY KEY (product_state_type_code), CONSTRAINT ak_product_state_type_name UNIQUE (name)); CREATE TABLE Product (product_code NUMBER(8) NOT NULL, name d_name, product_state_type_code NUMBER(4) NOT NULL, CONSTRAINT pk_product PRIMARY KEY (product_code), CONSTRAINT fk_product_product_state_type FOREIGN KEY (product_state_type_code) REFERENCES Product_state_type(product_state_type_code)); INSERT INTO Product_state_type (product_state_type_code, name) VALUES (1, (SELECT name FROM Product_state_type WHERE FALSE)); /*Fails. Error report - SQL Error: ORA-01400: cannot insert NULL into ("SYSTEM"."PRODUCT_STATE_TYPE"."NAME") Help: https://docs.oracle.com/error-help/db/ora-01400/ 01400. 0 - "cannot insert NULL into (%s)" *Cause:An attempt was made to insert NULL into previously listed objects. *Action: These objects cannot accept NULL values.*/ INSERT INTO Product_state_type (product_state_type_code, name) VALUES (1, 'Active'); INSERT INTO Product (product_code, name, product_state_type_code) SELECT 1 AS product_code, Product.name, 1 AS product_state_type_code FROM Product_state_type LEFT JOIN Product USING (product_state_type_code); /*Fails. SQL Error: ORA-01400: cannot insert NULL into ("SYSTEM"."PRODUCT"."NAME") Help: https://docs.oracle.com/error-help/db/ora-01400/ 01400. 0 - "cannot insert NULL into (%s)" *Cause:An attempt was made to insert NULL into previously listed objects. *Action: These objects cannot accept NULL values.*/
Re: Lowering the default wal_blocksize to 4K
On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro wrote: > That leaves only the segments where a record starts exactly on the > first usable byte of a segment, which is why I was trying to think of > a way to cover that case too. I suggested we could notice and insert > a new record at that place. But Andres suggests it would be too > expensive and not worth worrying about. Hmm. Even in that case, xl_prev has to match. It's not like it's the wild west. Sure, it's not nearly as good of a cross-check, but it's something. It seems to me that it's not worth worrying very much about xlp_seg_size or xlp_blcksz changing undetected in that scenario - if you're doing that kind of advanced magic, you need to be careful enough to not mess it up, and if we still cross-check once per checkpoint cycle that's pretty good. I do worry a bit about the sysid changing under us, though. It's not that hard to get your WAL archives mixed up, and it'd be nice to catch that right away. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [RFC] Add jit deform_counter
Hi, On Fri, 8 Sept 2023 at 20:22, Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Fri, Sep 08, 2023 at 03:34:42PM +0200, Daniel Gustafsson wrote: > > > On 5 Sep 2023, at 16:37, Daniel Gustafsson wrote: > > > > > I've gone over this version of the patch and I think it's ready to go in. > > > I'm > > > marking this Ready for Committer and will go ahead with it shortly > > > barring any > > > objections. > > > > Pushed, after another round of review with some minor fixes. I realized that pg_stat_statements is bumped to 1.11 with this patch but oldextversions test is not updated. So, I attached a patch for updating oldextversions. Regards, Nazir Bilal Yavuz Microsoft From d3c63a5d68ed76257d110db6377bd3ec859d65a4 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz Date: Thu, 12 Oct 2023 14:44:38 +0300 Subject: [PATCH v1] Update oldextversions test for pg_stat_statements 1.11 pg_stat_statements 1.11 is introduced at 5a3423ad8e but oldextversions test is not updated. Add missing pg_stat_statements 1.11 test case to oldextversions test. --- .../expected/oldextversions.out | 58 +++ .../pg_stat_statements/sql/oldextversions.sql | 5 ++ 2 files changed, 63 insertions(+) diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out index efb2049ecff..64982aad601 100644 --- a/contrib/pg_stat_statements/expected/oldextversions.out +++ b/contrib/pg_stat_statements/expected/oldextversions.out @@ -250,4 +250,62 @@ SELECT count(*) > 0 AS has_data FROM pg_stat_statements; t (1 row) +-- New views for pg_stat_statements in 1.11 +AlTER EXTENSION pg_stat_statements UPDATE TO '1.11'; +\d pg_stat_statements + View "public.pg_stat_statements" + Column | Type | Collation | Nullable | Default ++--+---+--+- + userid | oid | | | + dbid | oid | | | + toplevel | boolean | | | + queryid| bigint | | | + query | text | | | + plans | bigint | | | + total_plan_time| double precision | | | + min_plan_time | double precision | | | + max_plan_time | double precision | | | + mean_plan_time | double precision | | | + stddev_plan_time | double precision | | | + calls | bigint | | | + total_exec_time| double precision | | | + min_exec_time | double precision | | | + max_exec_time | double precision | | | + mean_exec_time | double precision | | | + stddev_exec_time | double precision | | | + rows | bigint | | | + shared_blks_hit| bigint | | | + shared_blks_read | bigint | | | + shared_blks_dirtied| bigint | | | + shared_blks_written| bigint | | | + local_blks_hit | bigint | | | + local_blks_read| bigint | | | + local_blks_dirtied | bigint | | | + local_blks_written | bigint | | | + temp_blks_read | bigint | | | + temp_blks_written | bigint | | | + blk_read_time | double precision | | | + blk_write_time | double precision | | | + temp_blk_read_time | double precision | | | + temp_blk_write_time| double precision | | | + wal_records| bigint | | | + wal_fpi| bigint | | | + wal_bytes | numeric | | | + jit_functions | bigint | | | + jit_generation_time| double precision | | | + jit_inlining_count | bigint | | | + jit_inlining_time | double precision | | | + jit_optimization_count | bigint | | | + jit_optimization_time | double precision | | | + jit_emission_count | bigint | | | + jit_emission_time | double precision | | | + jit_deform_count | bigint |
Re: [RFC] Add jit deform_counter
> On 12 Oct 2023, at 15:37, Nazir Bilal Yavuz wrote: > > Hi, > > On Fri, 8 Sept 2023 at 20:22, Dmitry Dolgov <9erthali...@gmail.com> wrote: >> >>> On Fri, Sep 08, 2023 at 03:34:42PM +0200, Daniel Gustafsson wrote: On 5 Sep 2023, at 16:37, Daniel Gustafsson wrote: >>> I've gone over this version of the patch and I think it's ready to go in. I'm marking this Ready for Committer and will go ahead with it shortly barring any objections. >>> >>> Pushed, after another round of review with some minor fixes. > > I realized that pg_stat_statements is bumped to 1.11 with this patch > but oldextversions test is not updated. So, I attached a patch for > updating oldextversions. Thanks for the patch, that was an oversight in the original commit for this. From a quick look it seems correct, I'll have another look later today and will then apply it. -- Daniel Gustafsson
Re: Lowering the default wal_blocksize to 4K
On Wed, Oct 11, 2023 at 6:11 PM Andres Freund wrote: > I think the question is what the point of the crosschecks in long page headers > is. It's pretty easy to see what the point of the xlp_sysid check is - make it > less likely to accidentally replay WAL from a different system. It's much > less clear what the point of xlp_seg_size and xlp_xlog_blcksz is - after all, > they are also in ControlFileData and the xlp_sysid check tied the control file > and WAL file together. Yeah, fair. > Let me rephrase my point: > > If somebody uses a modified pg_resetwal to change the xlog block size, then > tries to replay WAL from before that change, and is unlucky enough that the > LSN looked for in a segment is the start of a valid record both before/after > the pg_resetwal invocation, then yes, we might not catch that anymore if we > remove the block size check. But the much much more common case is that the > block size was *not* changed, in which case we *already* don't catch that > pg_resetwal was invoked. Hmm. Should we invent a mechanism just for that? > ISTM that the xlp_seg_size and xlp_xlog_blcksz checks in long page headers are > a belt and suspenders check that is very unlikely to ever catch a mistake that > wouldn't otherwise be caught. I think that's probably right. > I think that's what Thomas was proposing. Thinking about it a bit more I'm > not sure that having the data both in the checkpoint record itself and in > XLOG_CHECKPOINT_REDO buys much. But it's also pretty much free, so ... Yes. To me, having it in the redo record seems considerably more valuable. Because that's where we're going to begin replay, so we should catch most problems straight off. To escape detection at that point, you need to not just be pointed at the wrong WAL archive, but actually have files of diverse origin mixed together in the same WAL archive. That's a less-likely error, and we still have some ways of catching it if it happens. > Which would make the code more efficient... Right. > As outlined above, I don't think xlp_seg_size, xlp_xlog_blcksz buy us > anything, but that the protection by xlp_sysid is a bit more meaningful. So a > compromise position could be to include xlp_sysid in the page header, possibly > in a "chopped up" manner, as Matthias suggested. I'm not that keen on the idea of storing the upper half and lower half in alternate pages. That seems to me to add code complexity and cognitive burden with little increased likelihood of catching real problems. I'm not completely opposed to the idea if somebody wants to make it happen, but I bet it would be better to either store the whole thing or just cut it in half and store, say, the low-order bits. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Problem, partition pruning for prepared statement with IS NULL clause.
On 10/12/23 16:27, David Rowley wrote: I've now also pushed the fix for the incorrect logic for nullkeys in ExecInitPruningContext(). Thanks! Regards, Gluh
Re: Lowering the default wal_blocksize to 4K
On Thu, 12 Oct 2023 at 16:36, Robert Haas wrote: > On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro > wrote: > > That leaves only the segments where a record starts exactly on the > > first usable byte of a segment, which is why I was trying to think of > > a way to cover that case too. I suggested we could notice and insert > > a new record at that place. But Andres suggests it would be too > > expensive and not worth worrying about. > > Hmm. Even in that case, xl_prev has to match. It's not like it's the > wild west. Sure, it's not nearly as good of a cross-check, but it's > something. It seems to me that it's not worth worrying very much about > xlp_seg_size or xlp_blcksz changing undetected in that scenario - if > you're doing that kind of advanced magic, you need to be careful > enough to not mess it up, and if we still cross-check once per > checkpoint cycle that's pretty good. I do worry a bit about the sysid > changing under us, though. It's not that hard to get your WAL archives > mixed up, and it'd be nice to catch that right away. > This reminds me that xlp_tli is not being used to its full potential right now either. We only check that it's not going backwards, but there is at least one not very hard to hit way to get postgres to silently replay on the wrong timeline. [1] [1] https://www.postgresql.org/message-id/canwkhkmn3qwacvudzhb6wsvlrtkwebiyso-klfykkqvwuql...@mail.gmail.com -- Ants Aasma Senior Database Engineerwww.cybertec-postgresql.com
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
On 10/11/23 21:10, Michael Paquier wrote: On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote: I'm planning to push 0002 (retries in frontend programs, which is where this thread began) and 0004 (add missing locks to SQL functions), including back-patches as far as 12, in a day or so. I'll abandon the others for now, since we're now thinking bigger[1] for backups, side stepping the problem. FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be OK to use it at least for now on HEAD before seeing where the other discussions lead. 0004 would be OK if applied to v11, as well, but I also agree that it is not a big deal to let this branch be as it is now at this stage if you feel strongly this way. Agreed on 0002 and 0004, though I don't really think a back patch of 0004 to 11 is necessary. I'd feel differently if there was a single field report of this issue. I would prefer to hold off on applying 0003 to HEAD until we see how [1] pans out. Having said that, I have a hard time seeing [1] as being something we could back patch. The manipulation of backup_label is simple enough, but starting a cluster without pg_control is definitely going to change some things. Also, the requirement that backup software skip copying pg_control after a minor release is not OK. If we do back patch 0001 is 0003 really needed? Surely if 0001 works with other backup software it would work fine for pg_basebackup. Regards, -David [1] https://www.postgresql.org/message-id/flat/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net
Re: The danger of deleting backup_label
Hi Thomas, On 10/11/23 18:10, Thomas Munro wrote: Even though I spent a whole bunch of time trying to figure out how to make concurrent reads of the control file sufficiently atomic for backups (pg_basebackup and low level filesystem tools), and we explored multiple avenues with varying results, and finally came up with something that basically works pretty well... actually I just hate all of that stuff, and I'm hoping to be able to just withdraw https://commitfest.postgresql.org/45/4025/ and chalk it all up to discovery/education and call *this* thread the real outcome of that preliminary work. So I'm +1 on the idea of putting a control file image into the backup label and I'm happy that you're looking into it. Well, hopefully this thread will *at least* be the solution going forward. Not sure about a back patch yet, see below... We could just leave the control file out of the base backup completely, as you said, removing a whole foot-gun. That's the plan. People following the 'low level' instructions will still get a copy of the control file from the filesystem, and I don't see any reliable way to poison that file without also making it so that a crash wouldn't also be prevented from recovering. I have wondered about putting extra "fingerprint" information into the control file such as the file's path and inode number etc, so that you can try to distinguish between a control file written by PostgreSQL, and a control file copied somewhere else, but that all feels too fragile, and at the end of the day, people following the low level backup instructions had better follow the low level backup instructions (hopefully via the intermediary of an excellent external backup tool). Not sure about the inode idea, because it seems OK for people to move a cluster elsewhere under a variety of circumstances. I do have an idea about how to mark a cluster in "recovery to consistency" mode, but not quite sure how to atomically turn that off at the end of recovery to consistency. I have some ideas I'll work on though. As Stephen mentioned[1], we could perhaps also complain if both backup label and control file exist, and then hint that the user should remove the *control file* (not the backup label!). I had originally suggested we would just overwrite the control file, but by explicitly complaining about it we would also bring the matter to tool/script authors' attention, ie that they shouldn't be backing that file up, or should be removing it in a later step if they copy everything. He also mentions that there doesn't seem to be anything stopping us from back-patching changes to the backup label contents if we go this way. I don't have a strong opinion on that and we could leave the question for later. I'm worried about the possibility of back patching this unless the solution comes out to be simpler than I think and that rarely comes to pass. Surely throwing errors on something that is currently valid (i.e. backup_label and pg_control both present). But perhaps there is a simpler, acceptable solution we could back patch (transparent to all parties except Postgres) and then a more advanced solution we could go forward with. I guess I had better get busy on this. Regards, -David [1] https://www.postgresql.org/message-id/ZL69NXjCNG%2BWHCqG%40tamriel.snowman.net
Pro et contra of preserving pg_proc oids during pg_upgrade
Hi hackers! Please advise on the idea of preserving pg_proc oids during pg_upgrade, in a way like relfilenodes, type id and so on. What are possible downsides of such a solution? Thanks! -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: The danger of deleting backup_label
On 10/11/23 18:22, Michael Paquier wrote: On Tue, Oct 10, 2023 at 05:06:45PM -0400, David Steele wrote: That fails because there is a check to make sure the checkpoint is valid when pg_control is loaded. Another possibility is to use a special LSN like we use for unlogged tables. Anything >= 24 and < WAL segment size will work fine. Do we have any reason to do that in the presence of a backup_label file anyway? We'll know the LSN of the checkpoint based on what the base backup wants us to use. Using a fake-still-rather-valid value for the LSN in the control file to bypass this check does not address the issue you are pointing at: it is just avoiding this check. A reasonable answer would be, IMO, to just not do this check at all based on the control file in this case. Yeah, that's fair. And it looks like we are leaning towards excluding pg_control from the backup entirely, so the point is probably moot. If the contents of the control file are tweaked before sending it through a BASE_BACKUP, it would cover more than just pg_basebackup. Switching the way the control file is sent with new contents in sendFileWithContent() rather than sendFile() would be one way, for instance.. Good point, and that makes this even more compelling. If we include pg_control into backup_label then there is no need to modify pg_control (as above) -- we can just exclude it from the backup entirely. That will certainly require some rejigging in recovery but seems worth it for backup solutions that can't easily modify pg_control. The C-based solutions can do this pretty easily but it is a pretty high bar for anyone else. I have little idea about that, but I guess that you are referring to backrest here. Sure, pgBackRest, but there are other backup solutions written in C. My point is really that we should not depend on backup solutions being able to manipulate C structs. It looks the the solution we are working towards would not require that. Regards, -David
Re: Eager page freeze criteria clarification
Thanks for these notes. On Wed, Oct 11, 2023 at 8:43 PM Andres Freund wrote: > - We also discussed an idea by Robert to track the number of times we need to > dirty a page when unfreezing and to compare that to the number of pages > dirtied overall (IIRC), but I don't think we really came to a conclusion > around that - and I didn't write down anything so this is purely from > memory. See http://postgr.es/m/ca+tgmoycwisxlbl-pxu13oevthloxm20ojqjnrztkkhxsy9...@mail.gmail.com for my on-list discussion of this. > - Attributing "unfreezes" to specific vacuums would be powerful: > > - "Number of pages frozen during vacuum" and "Number of pages unfrozen that > were frozen during the same vacuum" provides numerator / denominator for > an "error rate" I want to highlight the denominator issue here. I think we all have the intuition that if we count the number of times that a recently-frozen page gets unfrozen, and that's a big number, that's bad, and a sign that we need to freeze less aggressively. But a lot of the struggle has been around answering the question "big compared to what". A lot of the obvious candidates fail to behave nicely in corner cases, as discussed in the above email. I think this is one of the better candidates so far proposed, possibly the best. > - This approach could provide "goals" for opportunistic freezing in a > somewhat understandable way. E.g. aiming to rarely unfreeze data that has > been frozen within 1h/1d/... This strikes me as another important point. Making the behavior understandable to users is going to be important, because sometimes whatever system we might craft will misbehave, and then people are going to need to be able to understand why it's misbehaving and how to tune/fix it so it works. > Around this point my laptop unfortunately ran out of battery. Possibly the > attendees of this mini summit also ran out of steam (and tea). When the tea is gone, there's little point in continuing. > I likely mangled this substantially, both when taking notes during the lively > discussion, and when revising them to make them a bit more readable. I think it's quite a good summary, actually. Thanks! -- Robert Haas EDB: http://www.enterprisedb.com
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
Nikita Malakhov writes: > Please advise on the idea of preserving pg_proc oids during pg_upgrade, in > a way like relfilenodes, type id and so on. What are possible downsides of > such a solution? You have the burden of proof backwards. That would add a great deal of new mechanism, and you haven't provided even one reason why it'd be worth doing. regards, tom lane
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
On 10/12/23 09:58, David Steele wrote: On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote: I'm planning to push 0002 (retries in frontend programs, which is where this thread began) and 0004 (add missing locks to SQL functions), including back-patches as far as 12, in a day or so. I'll abandon the others for now, since we're now thinking bigger[1] for backups, side stepping the problem. FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be OK to use it at least for now on HEAD before seeing where the other discussions lead. 0004 would be OK if applied to v11, as well, but I also agree that it is not a big deal to let this branch be as it is now at this stage if you feel strongly this way. Agreed on 0002 and 0004, though I don't really think a back patch of 0004 to 11 is necessary. I'd feel differently if there was a single field report of this issue. I would prefer to hold off on applying 0003 to HEAD until we see how [1] pans out. Having said that, I have a hard time seeing [1] as being something we could back patch. The manipulation of backup_label is simple enough, but starting a cluster without pg_control is definitely going to change some things. Also, the requirement that backup software skip copying pg_control after a minor release is not OK. After some more thought, I think we could massage the "pg_control in backup_label" method into something that could be back patched, with more advanced features (e.g. error on backup_label and pg_control both present on initial cluster start) saved for HEAD. Regards, -David
Re: PostgreSQL domains and NOT NULL constraint
Erki Eessaar writes: > PostgreSQL's CREATE DOMAIN documentation (section Notes) describes a way how > one can add NULL's to a column that has a domain with the NOT NULL constraint. > https://www.postgresql.org/docs/current/sql-createdomain.html > To me it seems very strange and amounts to a bug because it defeats the > purpose of domains (to be a reusable assets) and constraints (to avoid any > bypassing of these). I doubt we'd consider doing anything about that. The whole business of domains with NOT NULL constraints is arguably a defect of the SQL standard, because there are multiple ways to produce a value that is NULL and yet must be considered to be of the domain type. The subselect-with-no-output case that you show isn't even the most common one; I'd say that outer joins where there are domain columns on the nullable side are the biggest problem. There's been some discussion of treating the output of such a join, subselect, etc as being of the domain's base type not the domain proper. That'd solve this particular issue since then we'd decide we have to cast the base type back up to the domain type (and hence check its constraints) before inserting the row. But that choice just moves the surprise factor somewhere else, in that queries that used to produce one data type now produce another one. There are applications that this would break. Moreover, I do not think there's any justification for it in the SQL spec. Our general opinion about this is what is stated in the NOTES section of our CREATE DOMAIN reference page [1]: Best practice therefore is to design a domain's constraints so that a null value is allowed, and then to apply column NOT NULL constraints to columns of the domain type as needed, rather than directly to the domain type. regards, tom lane [1] https://www.postgresql.org/docs/current/sql-createdomain.html
Re: Lowering the default wal_blocksize to 4K
On Thu, Oct 12, 2023 at 9:57 AM Ants Aasma wrote: > This reminds me that xlp_tli is not being used to its full potential right > now either. We only check that it's not going backwards, but there is at > least one not very hard to hit way to get postgres to silently replay on the > wrong timeline. [1] > > [1] > https://www.postgresql.org/message-id/canwkhkmn3qwacvudzhb6wsvlrtkwebiyso-klfykkqvwuql...@mail.gmail.com Maybe I'm missing something, but that seems mostly unrelated. What you're discussing there is the server's ability to figure out when it ought to perform a timeline switch. In other words, the server settles on the wrong TLI and therefore opens and reads from the wrong filename. But here, we're talking about the case where the server is correct about the TLI and LSN and hence opens exactly the right file on disk, but the contents of the file on disk aren't what they're supposed to be due to a procedural error. Said differently, I don't see how anything we could do with xlp_tli would actually fix the problem discussed in that thread. That can detect a situation where the TLI of the file doesn't match the TLI of the pages inside the file, but it doesn't help with the case where the server decided to read the wrong file in the first place. But this does make me wonder whether storing xlp_tli and xlp_pageaddr in every page is really worth the bit-space. That takes 12 bytes plus any padding it forces us to incur, but the actual entropy content of those 12 bytes must be quite low. In normal cases probably 7 or so of those bytes are going to consist entirely of zero bits (TLI < 256, LSN%8k == 0, LSN < 2^40). We could probably find a way of jumbling the LSN, TLI, and maybe some other stuff into an 8-byte quantity or even perhaps a 4-byte quantity that would do about as good a job catching problems as what we have now (e.g. LSN_HIGH32^LSN_LOW32^BITREVERSE(TLI)). In the event of a mismatch, the value actually stored in the page header would be harder for humans to understand, but I'm not sure that really matters here. Users should mostly be concerned with whether a WAL file matches the cluster where they're trying to replay it; forensics on misplaced or corrupted WAL files should be comparatively rare. -- Robert Haas EDB: http://www.enterprisedb.com
Re: logical decoding and replication of sequences, take 2
On 9/20/23 11:53, Dilip Kumar wrote: > On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra > wrote: >> > > I was reading through 0001, I noticed this comment in > ReorderBufferSequenceIsTransactional() function > > + * To decide if a sequence change should be handled as transactional or > applied > + * immediately, we track (sequence) relfilenodes created by each transaction. > + * We don't know if the current sub-transaction was already assigned to the > + * top-level transaction, so we need to check all transactions. > > It says "We don't know if the current sub-transaction was already > assigned to the top-level transaction, so we need to check all > transactions". But IIRC as part of the steaming of in-progress > transactions we have ensured that whenever we are logging the first > change by any subtransaction we include the top transaction ID in it. > Yeah, that's a stale comment - the actual code only searched through the top-level ones (and thus relying on the immediate assignment). As I wrote in the earlier response, I suspect this code originates from before I added the GetCurrentTransactionId() calls. That being said, I do wonder why with the immediate assignments we still need the bit in ReorderBufferAssignChild that says: /* * We already saw this transaction, but initially added it to the * list of top-level txns. Now that we know it's not top-level, * remove it from there. */ dlist_delete(&subtxn->node); I don't think that affects this patch, but it's a bit confusing. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 9/13/23 15:18, Ashutosh Bapat wrote: > On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila wrote: >> >> On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila wrote: >>> >>> On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat >>> wrote: On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra wrote: > >> >> But whether or not that's the case, downstream should not request (and >> hence receive) any changes that have been already applied (and >> committed) downstream as a principle. I think a way to achieve this is >> to update the replorigin_session_origin_lsn so that a sequence change >> applied once is not requested (and hence sent) again. >> > > I guess we could update the origin, per attached 0004. We don't have > timestamp to set replorigin_session_origin_timestamp, but it seems we > don't need that. > > The attached patch merges the earlier improvements, except for the part > that experimented with adding a "fake" transaction (which turned out to > have a number of difficult issues). 0004 looks good to me. >>> >>> >>> + { >>> CommitTransactionCommand(); >>> + >>> + /* >>> + * Update origin state so we don't try applying this sequence >>> + * change in case of crash. >>> + * >>> + * XXX We don't have replorigin_session_origin_timestamp, but we >>> + * can just leave that set to 0. >>> + */ >>> + replorigin_session_origin_lsn = seq.lsn; >>> >>> IIUC, your proposal is to update the replorigin_session_origin_lsn, so >>> that after restart, it doesn't use some prior origin LSN to start with >>> which can in turn lead the sequence to go backward. If so, it should >>> be updated before calling CommitTransactionCommand() as we are doing >>> in apply_handle_commit_internal(). If that is not the intention then >>> it is not clear to me how updating replorigin_session_origin_lsn after >>> commit is helpful. >>> >> >> typedef struct ReplicationState >> { >> ... >> /* >> * Location of the latest commit from the remote side. >> */ >> XLogRecPtrremote_lsn; >> >> This is the variable that will be updated with the value of >> replorigin_session_origin_lsn. This means we will now track some >> arbitrary LSN location of the remote side in this variable. The above >> comment makes me wonder if there is anything we are missing or if it >> is just a matter of updating this comment because before the patch we >> always adhere to what is written in the comment. > > I don't think we are missing anything. This value is used to track the > remote LSN upto which all the commits from upstream have been applied > locally. Since a non-transactional sequence change is like a single > WAL record transaction, it's LSN acts as the LSN of the mini-commit. > So it should be fine to update remote_lsn with sequence WAL record's > end LSN. That's what the patches do. I don't see any hazard. But you > are right, we need to update comments. Here and also at other places > like > replorigin_session_advance() which uses remote_commit as name of the > argument which gets assigned to ReplicationState::remote_lsn. > I agree - updating the replorigin_session_origin_lsn shouldn't break anything. As you write, it's essentially a "mini-commit" and the commit order remains the same. I'm not sure about resetting replorigin_session_origin_timestamp to 0 though. It's not something we rely on very much (it may not correlated with the commit order etc.). But why should we set it to 0? We don't do that for regular commits, right? And IMO it makes sense to just use the timestamp of the last commit before the sequence change. FWIW I've left this in a separate commit, but I'll merge that into 0002 in the next patch version. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On 7/25/23 12:20, Amit Kapila wrote: > ... > > I have used the debugger to reproduce this as it needs quite some > coordination. I just wanted to see if the sequence can go backward and > didn't catch up completely before the sequence state is marked > 'ready'. On the publisher side, I created a publication with a table > and a sequence. Then did the following steps: > SELECT nextval('s') FROM generate_series(1,50); > insert into t1 values(1); > SELECT nextval('s') FROM generate_series(51,150); > > Then on the subscriber side with some debugging aid, I could find the > values in the sequence shown in the previous email. Sorry, I haven't > recorded each and every step but, if you think it helps, I can again > try to reproduce it and share the steps. > Amit, can you try to reproduce this backwards movement with the latest version of the patch? I have tried triggering that (mis)behavior, but I haven't been successful so far. I'm hesitant to declare it resolved, as it's dependent on timing etc. and you mentioned it required quite some coordination. Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Andres Freund writes: >> On 2023-09-25 15:42:26 -0400, Tom Lane wrote: >>> I just did a git bisect run to discover when the failure documented >>> in bug #18130 [1] started. And the answer is commit 82a4edabd. > Uh, huh. The problem is that COPY uses a single BulkInsertState for multiple > partitions. Which to me seems to run counter to the following comment: > *The caller can also provide a BulkInsertState object to optimize many > *insertions into the same relation. This keeps a pin on the current > *insertion target page (to save pin/unpin cycles) and also passes a > *BULKWRITE buffer selection strategy object to the buffer manager. > *Passing NULL for bistate selects the default behavior. > The reason this doesn't cause straight up corruption due to reusing a pin from > another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and a > call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk > insertion state, which is what leads to the errors from the bug report. > Resetting the relevant BulkInsertState fields fixes the problem. But I'm not > sure that's the right fix. ISTM that independent of whether we fix this via > ReleaseBulkInsertStatePin() resetting the fields or via not reusing > BulkInsertState, we should add assertions defending against future issues like > this (e.g. by adding a relation field to BulkInsertState in cassert builds, > and asserting that the relation is the same as in prior calls unless > ReleaseBulkInsertStatePin() has been called). Ping? We really ought to have a fix for this committed in time for 16.1. regards, tom lane
Re: Eager page freeze criteria clarification
On Wed, Oct 11, 2023 at 8:43 PM Andres Freund wrote: > > Robert, Melanie and I spent an evening discussing this topic around > pgconf.nyc. Here are, mildly revised, notes from that: Thanks for taking notes! > The main thing we are worried about is repeated freezing / unfreezing of > pages within a relatively short time period. > > - Computing an average "modification distance" as I (Andres) proposed efor > each page is complicated / "fuzzy" > > The main problem is that it's not clear how to come up with a good number > for workloads that have many more inserts into new pages than modifications > of existing pages. > > It's also hard to use average for this kind of thing, e.g. in cases where > new pages are frequently updated, but also some old data is updated, it's > easy for the updates to the old data to completely skew the average, even > though that shouldn't prevent us from freezing. > > - We also discussed an idea by Robert to track the number of times we need to > dirty a page when unfreezing and to compare that to the number of pages > dirtied overall (IIRC), but I don't think we really came to a conclusion > around that - and I didn't write down anything so this is purely from > memory. I was under the impression that we decided we still had to consider the number of clean pages dirtied as well as the number of pages unfrozen. The number of pages frozen and unfrozen over a time period gives us some idea of if we are freezing the wrong pages -- but it doesn't tell us if we are freezing the right pages. A riff on an earlier example by Robert: While vacuuming a relation, we freeze 100 pages. During the same time period, we modify 1,000,000 previously clean pages. Of these 1,000,000 pages modified, 90 were frozen. So we unfroze 90% of the pages frozen during this time. Does this mean we should back off of trying to freeze any pages in the relation? > A rough sketch of a freezing heuristic: ... > - Attributing "unfreezes" to specific vacuums would be powerful: > > - "Number of pages frozen during vacuum" and "Number of pages unfrozen that > were frozen during the same vacuum" provides numerator / denominator for > an "error rate" > > - We can perform this attribution by comparing the page LSN with recorded > start/end LSNs of recent vacuums While implementing a rough sketch of this, I realized I had a question about this. vacuum 1 starts at lsn 10 and ends at lsn 200. It froze 100 pages. vacuum 2 then starts at lsn 600. 5 frozen pages with page lsn > 10 and < 200 were updated. We count those in vacuum 1's stats. 3 frozen pages with page lsn > 200 and < 600 were updated. Do we count those somewhere? > - This approach could provide "goals" for opportunistic freezing in a > somewhat understandable way. E.g. aiming to rarely unfreeze data that has > been frozen within 1h/1d/... Similar to the above question, if we are tracking pages frozen and unfrozen during a time period, if there are many vacuums in quick succession, we might care if a page was frozen by one vacuum and then unfrozen during a subsequent vacuum if not too much time has passed. - Melanie
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Wed, Oct 4, 2023 at 8:07 AM Peter Geoghegan wrote: > If you're willing to take over as committer here, I'll let the issue > of backpatching go. > > I only ask that you note why you've not backpatched in the commit message. Will do, but see also the last point below. I have looked over these patches in some detail and here are my thoughts: - I find the use of the word "generate" in error messages slightly odd. I think it's reasonable given the existing precedent, but the word I would have picked is "assign", which I see is what Aleksander actually had in v1. How would people feel about changing the two existing messages that say "database is not accepting commands that generate new MultiXactIds to avoid wraparound data loss ..." to use "assign" instead, and then make the new messages match that? - I think that 0002 needs a bit of wordsmithing. I will work on that. In particular, I don't like this sentence: "It increases downtime, makes monitoring impossible, disables replication, bypasses safeguards against wraparound, etc." While there's nothing untrue there, it feels more like a sentence from a pgsql-hackers email where most people participating in the discussion understand the general contours of the problem already than like polished documentation that really lays things out methodically. - I'm somewhat inclined to have a go at restructuring these patches a bit so that some of the documentation changes can potentially be back-patched without back-patching the message changes. Even if we eventually decide to back-patch everything or nothing, there are wording adjustments spread across all 3 patches that seem somewhat independent of the changes to the server messages. I think it would be clearer to have one patch that is mostly about documentation wording changes, and a second one that is about changing the server messages and then making documentation changes that are directly dependent on those message changes. And I might also be inclined to back-patch the former patch as far as it makes sense to do so, while leaving the latter one master-only. Comments? -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Thu, Oct 12, 2023 at 8:54 AM Robert Haas wrote: > - I find the use of the word "generate" in error messages slightly > odd. I think it's reasonable given the existing precedent, but the > word I would have picked is "assign", which I see is what Aleksander > actually had in v1. How would people feel about changing the two > existing messages that say "database is not accepting commands that > generate new MultiXactIds to avoid wraparound data loss ..." to use > "assign" instead, and then make the new messages match that? WFM. > - I think that 0002 needs a bit of wordsmithing. I will work on that. > In particular, I don't like this sentence: "It increases downtime, > makes monitoring impossible, disables replication, bypasses safeguards > against wraparound, etc." While there's nothing untrue there, it feels > more like a sentence from a pgsql-hackers email where most people > participating in the discussion understand the general contours of the > problem already than like polished documentation that really lays > things out methodically. I agree. > - I'm somewhat inclined to have a go at restructuring these patches a > bit so that some of the documentation changes can potentially be > back-patched without back-patching the message changes. Even if we > eventually decide to back-patch everything or nothing, there are > wording adjustments spread across all 3 patches that seem somewhat > independent of the changes to the server messages. I think it would be > clearer to have one patch that is mostly about documentation wording > changes, and a second one that is about changing the server messages > and then making documentation changes that are directly dependent on > those message changes. And I might also be inclined to back-patch the > former patch as far as it makes sense to do so, while leaving the > latter one master-only. No objections from me. -- Peter Geoghegan
Re: Parent/child context relation in pg_get_backend_memory_contexts()
Hi, On 2023-08-04 21:16:49 +0300, Melih Mutlu wrote: > Melih Mutlu , 16 Haz 2023 Cum, 17:03 tarihinde şunu > yazdı: > > > With this change, here's a query to find how much space used by each > > context including its children: > > > > > WITH RECURSIVE cte AS ( > > > SELECT id, total_bytes, id as root, name as root_name > > > FROM memory_contexts > > > UNION ALL > > > SELECT r.id, r.total_bytes, cte.root, cte.root_name > > > FROM memory_contexts r > > > INNER JOIN cte ON r.parent_id = cte.id > > > ), > > > memory_contexts AS ( > > > SELECT * FROM pg_backend_memory_contexts > > > ) > > > SELECT root as id, root_name as name, sum(total_bytes) > > > FROM cte > > > GROUP BY root, root_name > > > ORDER BY sum DESC; > > > > Given that the above query to get total bytes including all children is > still a complex one, I decided to add an additional info in > pg_backend_memory_contexts. > The new "path" field displays an integer array that consists of ids of all > parents for the current context. This way it's easier to tell whether a > context is a child of another context, and we don't need to use recursive > queries to get this info. I think that does make it a good bit easier. Both to understand and to use. > Here how pg_backend_memory_contexts would look like with this patch: > > postgres=# SELECT name, id, parent, parent_id, path > FROM pg_backend_memory_contexts > ORDER BY total_bytes DESC LIMIT 10; > name | id | parent | parent_id | path > -+-+--+---+-- > CacheMemoryContext | 27 | TopMemoryContext | 0 | {0} > Timezones | 124 | TopMemoryContext | 0 | {0} > TopMemoryContext| 0 | | | > MessageContext | 8 | TopMemoryContext | 0 | {0} > WAL record construction | 118 | TopMemoryContext | 0 | {0} > ExecutorState | 18 | PortalContext|17 | {0,16,17} > TupleSort main | 19 | ExecutorState|18 | {0,16,17,18} > TransactionAbortContext | 14 | TopMemoryContext | 0 | {0} > smgr relation table | 10 | TopMemoryContext | 0 | {0} > GUC hash table | 123 | GUCMemoryContext | 122 | {0,122} > (10 rows) Would we still need the parent_id column? > + > + > + > + context_id int4 > + > + > + Current context id > + > + I think the docs here need to warn that the id is ephemeral and will likely differ in the next invocation. > + > + > + parent_id int4 > + > + > + Parent context id > + > + > + > + > + > + path int4 > + > + > + Path to reach the current context from TopMemoryContext > + > + Perhaps we should include some hint here how it could be used? > > > > diff --git a/src/backend/utils/adt/mcxtfuncs.c > b/src/backend/utils/adt/mcxtfuncs.c > index 92ca5b2f72..81cb35dd47 100644 > --- a/src/backend/utils/adt/mcxtfuncs.c > +++ b/src/backend/utils/adt/mcxtfuncs.c > @@ -20,6 +20,7 @@ > #include "mb/pg_wchar.h" > #include "storage/proc.h" > #include "storage/procarray.h" > +#include "utils/array.h" > #include "utils/builtins.h" > > /* -- > @@ -28,6 +29,8 @@ > */ > #define MEMORY_CONTEXT_IDENT_DISPLAY_SIZE1024 > > +static Datum convert_path_to_datum(List *path); > + > /* > * PutMemoryContextsStatsTupleStore > * One recursion level for pg_get_backend_memory_contexts. > @@ -35,9 +38,10 @@ > static void > PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore, >TupleDesc > tupdesc, MemoryContext context, > - const char > *parent, int level) > + const char > *parent, int level, int *context_id, > + int parent_id, > List *path) > { > -#define PG_GET_BACKEND_MEMORY_CONTEXTS_COLS 9 > +#define PG_GET_BACKEND_MEMORY_CONTEXTS_COLS 12 > > Datum values[PG_GET_BACKEND_MEMORY_CONTEXTS_COLS]; > boolnulls[PG_GET_BACKEND_MEMORY_CONTEXTS_COLS]; > @@ -45,6 +49,7 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore, > MemoryContext child; > const char *name; > const char *ident; > + int current_context_id = (*context_id)++; > > Assert(MemoryContextIsValid(context)); > > @@ -103,13 +108,29 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate > *tupstore, > values[6] = Int64GetDatum(stat.freespace); > values[7] = Int64GetDatum(stat.freechunks); > values[8] = Int64GetDatum(stat.totalspace - stat.freespace); > + values[9] = Int32GetDatum(current_context_id); > + > + if(parent_id < 0) > +
Re: Performance degradation on concurrent COPY into a single relation in PG16.
Hi, On 2023-10-12 11:44:09 -0400, Tom Lane wrote: > Andres Freund writes: > >> On 2023-09-25 15:42:26 -0400, Tom Lane wrote: > >>> I just did a git bisect run to discover when the failure documented > >>> in bug #18130 [1] started. And the answer is commit 82a4edabd. > > > Uh, huh. The problem is that COPY uses a single BulkInsertState for > > multiple > > partitions. Which to me seems to run counter to the following comment: > > * The caller can also provide a BulkInsertState object to optimize many > > * insertions into the same relation. This keeps a pin on the current > > * insertion target page (to save pin/unpin cycles) and also passes a > > * BULKWRITE buffer selection strategy object to the buffer manager. > > * Passing NULL for bistate selects the default behavior. > > > The reason this doesn't cause straight up corruption due to reusing a pin > > from > > another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and > > a > > call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk > > insertion state, which is what leads to the errors from the bug report. > > > Resetting the relevant BulkInsertState fields fixes the problem. But I'm not > > sure that's the right fix. ISTM that independent of whether we fix this via > > ReleaseBulkInsertStatePin() resetting the fields or via not reusing > > BulkInsertState, we should add assertions defending against future issues > > like > > this (e.g. by adding a relation field to BulkInsertState in cassert builds, > > and asserting that the relation is the same as in prior calls unless > > ReleaseBulkInsertStatePin() has been called). > > Ping? We really ought to have a fix for this committed in time for > 16.1. I kind of had hoped somebody would comment on the approach. Given that nobody has, I'll push the minimal fix of resetting the state in ReleaseBulkInsertStatePin(), even though I think architecturally that's not great. Greetings, Andres Freund
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
Hi! Say, we have data processed by some user function and we want to keep reference to this function in our data. In this case we have two ways - first - store string output of regprocedure, which is not very convenient, and the second - store its OID, which requires slight modification of pg_upgrade (pg_dump and func/procedure creation function). I've read previous threads about using regproc, and agree that this is not a very good case anyway, but I haven't found any serious obstacles that forbid modifying pg_upgrade this way. -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: Separate memory contexts for relcache and catcache
Hi, On 2023-08-09 15:02:31 +0300, Melih Mutlu wrote: > To quickly show how pg_backend_memory_contexts would look like, I did the > following: > > -Create some tables: > SELECT 'BEGIN;' UNION ALL SELECT format('CREATE TABLE %1$s(id serial > primary key, data text not null unique)', 'test_'||g.i) FROM > generate_series(0, 1000) g(i) UNION ALL SELECT 'COMMIT;';\gexec > > -Open a new connection and query pg_backend_memory_contexts [1]: > This is what you'll see before and after the patch. > -- HEAD: > name| used_bytes | free_bytes | total_bytes > +++- > CacheMemoryContext | 467656 | 56632 | 524288 > index info | 111760 | 46960 | 158720 > relation rules | 4416 | 3776 |8192 > (3 rows) > > -- Patch: > name | used_bytes | free_bytes | total_bytes > ---+++- > CatCacheMemoryContext | 217696 | 8 | 262144 > RelCacheMemoryContext | 248264 | 13880 | 262144 > index info| 111760 | 46960 | 158720 > CacheMemoryContext| 2336 | 5856 |8192 > relation rules| 4416 | 3776 |8192 > (5 rows) Have you checked what the source of the remaining allocations in CacheMemoryContext are? One thing that I had observed previously and reproduced with this patch, is that the first backend starting after a restart uses considerably more memory: first: ┌───┬┬┬─┐ │ name │ used_bytes │ free_bytes │ total_bytes │ ├───┼┼┼─┤ │ CatCacheMemoryContext │ 370112 │ 154176 │ 524288 │ │ RelCacheMemoryContext │ 244136 │ 18008 │ 262144 │ │ index info│ 104392 │ 45112 │ 149504 │ │ CacheMemoryContext│ 2304 │ 5888 │8192 │ │ relation rules│ 3856 │240 │4096 │ └───┴┴┴─┘ second: ┌───┬┬┬─┐ │ name │ used_bytes │ free_bytes │ total_bytes │ ├───┼┼┼─┤ │ CatCacheMemoryContext │ 215072 │ 47072 │ 262144 │ │ RelCacheMemoryContext │ 243856 │ 18288 │ 262144 │ │ index info│ 104944 │ 47632 │ 152576 │ │ CacheMemoryContext│ 2304 │ 5888 │8192 │ │ relation rules│ 3856 │240 │4096 │ └───┴┴┴─┘ This isn't caused by this patch, but it does make it easier to pinpoint than before. The reason is fairly simple: On the first start we start without being able to use relcache init files, in later starts we can. The reason the size increase is in CatCacheMemoryContext, rather than RelCacheMemoryContext, is simple: When using the init file the catcache isn't used, when not, we have to query the catcache a lot to build the initial relcache contents. Given the size of both CatCacheMemoryContext and RelCacheMemoryContext in a new backend, I think it might be worth using non-default aset parameters. A bit ridiculous to increase block sizes from 8k upwards in every single connection made to postgres ever. > - Run select on all tables > SELECT format('SELECT count(*) FROM %1$s', 'test_'||g.i) FROM > generate_series(0, 1000) g(i);\gexec > > - Then check pg_backend_memory_contexts [1] again: > --HEAD > name| used_bytes | free_bytes | total_bytes > +++- > CacheMemoryContext |8197344 | 257056 | 8454400 > index info |2102160 | 113776 | 2215936 > relation rules | 4416 | 3776 |8192 > (3 rows) > > --Patch > name | used_bytes | free_bytes | total_bytes > ---+++- > RelCacheMemoryContext |4706464 |3682144 | 8388608 > CatCacheMemoryContext |3489384 | 770712 | 4260096 > index info|2102160 | 113776 | 2215936 > CacheMemoryContext| 2336 | 5856 |8192 > relation rules| 4416 | 3776 |8192 > (5 rows) > > You can see that CacheMemoryContext does not use much memory without > catcache and relcache (at least in cases similar to above), and it's easy > to bloat catcache and relcache. That's why I think it would be useful to > see their usage separately. Yes, I think it'd be quite useful. There's ways to bloat particularly catcache much further, and it's hard to differentiate that from other sources of bloat right now. > +static void > +CreateCatCacheMemoryContext() We typically use (void) to differentiate from an
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
On Thu, Oct 12, 2023 at 9:57 AM Nikita Malakhov wrote: > Say, we have data processed by some user function and we want to keep > reference to this function > in our data. > Then you need to keep the user-visible identifier of said function (schema+name+input argument types - you'd probably want to incorporate version into the name) in your user-space code. Exposing runtime generated oids to user-space is not something I can imagine the system supporting. It goes against the very definition of "implementation detail" that user-space code is not supposed to depend upon. David J.
Re: Eager page freeze criteria clarification
On Thu, Oct 12, 2023 at 11:50 AM Melanie Plageman wrote: > I was under the impression that we decided we still had to consider > the number of clean pages dirtied as well as the number of pages > unfrozen. The number of pages frozen and unfrozen over a time period > gives us some idea of if we are freezing the wrong pages -- but it > doesn't tell us if we are freezing the right pages. A riff on an > earlier example by Robert: > > While vacuuming a relation, we freeze 100 pages. During the same time > period, we modify 1,000,000 previously clean pages. Of these 1,000,000 > pages modified, 90 were frozen. So we unfroze 90% of the pages frozen > during this time. Does this mean we should back off of trying to > freeze any pages in the relation? I didn't think we decided the thing your first sentence says you thought we decided ... but I also didn't think of this example. That said, it might be fine to back off freezing in this case because we weren't doing enough of it to make any real difference in the first place. Maybe there's a more moderate example where it feels like a bigger problem? > > - "Number of pages frozen during vacuum" and "Number of pages unfrozen > > that > > were frozen during the same vacuum" provides numerator / denominator for > > an "error rate" > > > > - We can perform this attribution by comparing the page LSN with recorded > > start/end LSNs of recent vacuums > > While implementing a rough sketch of this, I realized I had a question > about this. > > vacuum 1 starts at lsn 10 and ends at lsn 200. It froze 100 pages. > vacuum 2 then starts at lsn 600. > 5 frozen pages with page lsn > 10 and < 200 were updated. We count > those in vacuum 1's stats. 3 frozen pages with page lsn > 200 and < > 600 were updated. Do we count those somewhere? How did those pages get frozen when no VACUUM was running? The LSN of the frozen page just prior to unfreezing it should be from the operation that froze it, which should be some VACUUM. I think the case you're talking about could happen if we did on-access freezing, but today I believe we don't. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Wait events for delayed checkpoints
On Wed, Oct 11, 2023 at 9:13 PM Thomas Munro wrote: > You can't tell if your checkpointer is spending a lot of time waiting > around for flags in delayChkptFlags to clear. Trivial patch to add > that. I've managed to see it a few times when checkpointing > repeatedly with a heavy pgbench workload. > > I had to stop and think for a moment about whether these events belong > under "WaitEventIPC", "waiting for notification from another process" > or under "WaitEventTimeout", "waiting for a timeout to expire". I > mean, both? It's using sleep-and-poll instead of (say) a CV due to > the economics, we want to make the other side as cheap as possible, so > we don't care about making the checkpointer take some micro-naps in > this case. I feel like the key point here is that it's waiting for > another process to do stuff and unblock it. IPC seems right to me. Yeah, a timeout is being used, but as you say, that's an implementation detail. +1 for the idea, too. -- Robert Haas EDB: http://www.enterprisedb.com
Re: On login trigger: take three
On Tue, Oct 10, 2023 at 3:43 PM Alexander Korotkov wrote: > Yep, in v43 it worked that way. One transaction has to wait for > another finishing update of pg_database tuple, then fails. This is > obviously ridiculous. Since overlapping setters of flag will have to > wait anyway, I changed lock mode in v44 for them to > AccessExclusiveLock. Now, waiting transaction then sees the updated > tuple and doesn't fail. Doesn't that mean that if you create the first login trigger in a database and leave the transaction open, nobody can connect to that database until the transaction ends? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Special-case executor expression steps for common combinations
Hi, On 2023-10-12 13:24:27 +0300, Heikki Linnakangas wrote: > On 12/10/2023 12:48, Daniel Gustafsson wrote: > > The attached patch adds special-case expression steps for common sets of > > steps > > in the executor to shave a few cycles off during execution, and make the JIT > > generated code simpler. > > > > * Adds EEOP_FUNCEXPR_STRICT_1 and EEOP_FUNCEXPR_STRICT_2 for function calls > > of > >strict functions with 1 or 2 arguments (EEOP_FUNCEXPR_STRICT remains > > used for > >> 2 arguments). > > * Adds EEOP_AGG_STRICT_INPUT_CHECK_ARGS_1 which is a special case for the > >common case of one arg aggs. > > Are these relevant when JITting? I'm a little sad if the JIT compiler cannot > unroll these on its own. Is there something we could do to hint it, so that > it could treat the number of arguments as a constant? I think it's mainly important for interpreted execution. > >skip extra setup for steps which are only interested in the side effects. > > I'm a little surprised if this makes a measurable performance difference, > but sure, why not. It seems nice to be more explicit when you don't expect a > return value. IIRC this is more interesting for JIT than the above, because it allows LLVM to know that the return value isn't needed and thus doesn't need to be computed. Greetings, Andres Freund
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
On Thu, Oct 12, 2023 at 7:36 AM Tom Lane wrote: > Nikita Malakhov writes: > > Please advise on the idea of preserving pg_proc oids during pg_upgrade, > in > > a way like relfilenodes, type id and so on. What are possible downsides > of > > such a solution? > > You have the burden of proof backwards. That would add a great deal > of new mechanism, and you haven't provided even one reason why it'd > be worth doing. > > I was curious about the comment regarding type oids being copied over and I found the commentary in pg_upgrade.c that describes which oids are copied over and why, but the IMPLEMENTATION seems to be out-of-sync with the actual implementation. """ It preserves the relfilenode numbers so TOAST and other references to relfilenodes in user data is preserved. (See binary-upgrade usage in pg_dump). We choose to preserve tablespace and database OIDs as well. """ David J.
Re: New WAL record to detect the checkpoint redo location
On Thu, Oct 12, 2023 at 3:27 AM Michael Paquier wrote: > I have looked at 0001, for now.. And it looks OK to me. Cool. I've committed that one. Thanks for the review. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
On Thu, Oct 12, 2023 at 10:35 AM Tom Lane wrote: > You have the burden of proof backwards. That would add a great deal > of new mechanism, and you haven't provided even one reason why it'd > be worth doing. "A great deal of new mechanism" seems like a slight exaggeration. We preserve a bunch of kinds of OIDs already, and it wouldn't be any harder to preserve this one than the ones we preserve already, or so I think. So it would be some additional mechanism, but maybe not a great deal. As to whether it's a good idea, it isn't necessary for the system to operate properly, so we didn't, but it's a judgement call whether it's better for other reasons, like being able to have regprocedure columns survive an upgrade, or making users being less confused, or allowing people supporting PostgreSQL having an easier time debugging issues. Personally, I've never been quite sure we made the right decision there. I admit that I'm not particularly keen to try to add the amount of mechanism that would be required to preserve every single OID everywhere, but I also somehow feel like the fact that we don't is pretty weird. The pg_upgrade experience right now is a bit as if you woke up in the morning and found that city officials came by during the night and renumbered your house, thus changing your address. Then, they sent change of address forms to everyone who ever mails you anything, plus updated your address with your doctor's office and your children's school. In a way, there's no problem: nothing has really changed for you in any way that matters. Yet, I think that would feel pretty uncomfortable if it actually happened to you, and I think the pg_upgrade experience is uncomfortable in the same way. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg_upgrade's interaction with pg_resetwal seems confusing
On Thu, Oct 12, 2023 at 7:17 AM Amit Kapila wrote: > Now, as mentioned in the first paragraph, it seems we anyway don't > need to reset the WAL at the end when setting the next OID for the new > cluster with the -o option. If that is true, then I think even without > slots work it will be helpful to have such an option in pg_resetwal. > > Thoughts? I wonder if we should instead provide a way to reset the OID counter with a function call inside the database, gated by IsBinaryUpgrade. Having something like pg_resetwal --but-dont-actually-reset-the-wal seems both self-contradictory and vulnerable to abuse that we might be better off not inviting. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
On Thu, Oct 12, 2023, 11:21 Robert Haas wrote: > > The pg_upgrade experience right now is a bit as if you woke up in the > morning and found that city officials came by during the night and > renumbered your house, thus changing your address. Then, they sent > change of address forms to everyone who ever mails you anything, plus > updated your address with your doctor's office and your children's > school. In a way, there's no problem: nothing has really changed for > you in any way that matters. Yet, I think that would feel pretty > uncomfortable if it actually happened to you, and I think the > pg_upgrade experience is uncomfortable in the same way. > It's more like a lot number or surveying tract than an postal address. Useful for a single party, the builder or the government, but not something you give out to other people so they can find you. Whether or not we copy over oids should be done based upon our internal needs, not end users. Which is why the fee that do get copied exists, because we store them in internal files that we want to copy as part of the upgrade. It also isn't like pg_dump/restore is going to retain them and the less divergence between that and pg_upgrade arguably the better. David J. >
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
On Thu, Oct 12, 2023 at 2:38 PM David G. Johnston wrote: > It's more like a lot number or surveying tract than an postal address. > Useful for a single party, the builder or the government, but not something > you give out to other people so they can find you. > > Whether or not we copy over oids should be done based upon our internal > needs, not end users. Which is why the fee that do get copied exists, > because we store them in internal files that we want to copy as part of the > upgrade. It also isn't like pg_dump/restore is going to retain them and the > less divergence between that and pg_upgrade arguably the better. We build the product for the end users. Their desires and needs are relevant. And if they're telling us we did it wrong, we need to listen to that. We don't have to do everything that everybody wants, but treating developer needs as strictly more important than end-user needs is self-defeating. I agree that there's a trade-off here. Preserving more OIDs requires more code and makes pg_dump and other things more complicated, which is not great. But, at least to me, arguing that there are no downsides of not preserving these OIDs is simply not a believable argument. Well, maybe somebody believes it. But I don't. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
On Thu, Oct 12, 2023 at 11:43 AM Robert Haas wrote: > On Thu, Oct 12, 2023 at 2:38 PM David G. Johnston > wrote: > > It's more like a lot number or surveying tract than an postal address. > Useful for a single party, the builder or the government, but not something > you give out to other people so they can find you. > > > > Whether or not we copy over oids should be done based upon our internal > needs, not end users. Which is why the fee that do get copied exists, > because we store them in internal files that we want to copy as part of the > upgrade. It also isn't like pg_dump/restore is going to retain them and > the less divergence between that and pg_upgrade arguably the better. > > We build the product for the end users. Their desires and needs are > relevant. And if they're telling us we did it wrong, we need to listen > to that. We don't have to do everything that everybody wants, but > treating developer needs as strictly more important than end-user > needs is self-defeating. > Every catalog has both a natural and a surrogate key. Developers get to use the surrogate key while end-users get to use the natural one (i.e., the one they provided). I see no reason to change that specification. And I do believe there are no compelling reasons for an end-user to need to use the surrogate key instead of the natural one. The example provided by the OP isn't one, IMO, the overall goal can be accomplished via the natural key (if it cannot, maybe we need to make retrieving the natural key for a pg_proc record given an OID easier). The fact that OIDs are not even accessible via SQL further reinforces this belief. The only reason to need OIDs as a DBA is to perform joins among the catalogs and all such joins are local to the database and even session executing them - the specific values are immaterial. The behavior of pg_upgrade only preserving OIDs that are necessary due to the physical copying of data files from the old server to the new one seems sufficient both in terms of effort and the principle of doing the minimum amount to solve the problem at hand. David J.
building 32bit windows version
Greetings, I've been running into challenges building 32 bit windows version. I suspect there are no build farms and nobody really builds this. The reason I need these is to be able to build 32 bit dll's for ODBC. At one time EDB used to provide binaries but that doesn't appear to be the case. running build.bat in an x86 environment fails but that can be easily fixed by adding $ENV{CONFIG}="x86"; in buld_env.pl build postgres then works as advertised, however install fails with "Copying build output files...Could not copy release\zic\zic.exe to postgres\bin\zic.exe" Apparently 32 bit dlls are required. If there is an easier way to get libpq.dll and the include files for building I'm all ears. Dave Cramer
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Thu, Oct 12, 2023 at 12:01 PM Peter Geoghegan wrote: > No objections from me. Here is a doc-only patch that I think could be back-patched as far as emergency mode exists. It combines all of the wording changes to the documentation from v1-v3 of the previous version, but without changing the message text that is quoted in the documentation, and without adding more instances of similar message texts to the documentation, and with a bunch of additional hacking by me. Some things I changed: - I made it so that the MXID section refers back to the XID section instead of duplicating it, but with a short list of differences. - I weakened the existing claim that says you must be a superuser or VACUUM definitely won't fix it to say instead that you SHOULD run VACUUM as the superuser, because the former is false and the latter is true. - I made the list of steps for recovering more explicit. - I split out the bit about running autovacuum in the affected database into a separate step to be performed after VACUUM for continued good operation, rather than a necessary ingredient in recovery, because it isn't. - A bit of other minor rejiggering. I'm not forgetting about the rest of the proposed patch set, or the change I proposed earlier. I'm just posting this much now because this is how far I got today, and it would be useful to get comments before I go further. I think the residual portion of the patch set not included in this documentation patch will be quite small, and I think that's a good thing, but again, I don't intend to blow that off. -- Robert Haas EDB: http://www.enterprisedb.com v10-0001-Update-the-documentation-on-recovering-from-M-XI.patch Description: Binary data
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
On Thu, Oct 12, 2023 at 3:36 PM David G. Johnston wrote: > Every catalog has both a natural and a surrogate key. Developers get to use > the surrogate key while end-users get to use the natural one (i.e., the one > they provided). I see no reason to change that specification. I agree with this. > And I do believe there are no compelling reasons for an end-user to need to > use the surrogate key instead of the natural one. But I disagree with this. > The example provided by the OP isn't one, IMO, the overall goal can be > accomplished via the natural key (if it cannot, maybe we need to make > retrieving the natural key for a pg_proc record given an OID easier). The > fact that OIDs are not even accessible via SQL further reinforces this > belief. The only reason to need OIDs as a DBA is to perform joins among the > catalogs and all such joins are local to the database and even session > executing them - the specific values are immaterial. This just all seems very simplistic to me. In theory it's true, but in practice it isn't. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
Hi, I've already implemented preserving PG_PROC oids during pg_upgrade in a way like relfilenodes, etc, actually, it is quite simple, and on the first look there are no any problems. About using surrogate key - this feature is more for data generated by the DBMS itself, i.e. data processed by some extension and saved and re-processed automatically or by user's request, but without bothering user with these internal keys. The main question - maybe, are there pitfalls of which I am not aware of? Thanks for your replies! -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
On Thu, Oct 12, 2023 at 1:31 PM Nikita Malakhov wrote: > About using surrogate key - this feature is more for data generated by > the DBMS itself, i.e. data processed by some extension and saved > and re-processed automatically or by user's request, but without bothering > user with these internal keys. > Then what does it matter whether you spell it: 12345 or my_ext.do_something(int) ? Why do you require us to redefine the scope for which pg_proc.oid is useful in order to implement this behavior? Your extension breaks if your user uses logical backups or we otherwise get into a position where pg_upgrade cannot be used to migrate in the future. Is avoiding the textual representation so necessary that you need to add another dependency to the system? That just seems unwise regardless of how easy it may be to accomplish. David J.
Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound
On Thu, Oct 12, 2023 at 1:10 PM Robert Haas wrote: > On Thu, Oct 12, 2023 at 12:01 PM Peter Geoghegan wrote: > > No objections from me. > > Here is a doc-only patch that I think could be back-patched as far as > emergency mode exists. It combines all of the wording changes to the > documentation from v1-v3 of the previous version, but without changing > the message text that is quoted in the documentation, and without > adding more instances of similar message texts to the documentation, > and with a bunch of additional hacking by me. It's a bit weird that we're effectively saying "pay no attention to that terrible HINT"...but I get it. The simple fact is that the docs were written in a way that allowed misinformation to catch on -- the damage that needs to be undone isn't exactly limited to the docs themselves. Your choice to not backpatch the changes to the log messages makes a lot more sense, now that I see that I see the wider context built by this preparatory patch. Arguably, it would be counterproductive to pretend that we didn't make this mistake on the backbranches. Better to own the mistake. > Some things I changed: > > - I made it so that the MXID section refers back to the XID section > instead of duplicating it, but with a short list of differences. > - I weakened the existing claim that says you must be a superuser or > VACUUM definitely won't fix it to say instead that you SHOULD run > VACUUM as the superuser, because the former is false and the latter is > true. > - I made the list of steps for recovering more explicit. > - I split out the bit about running autovacuum in the affected > database into a separate step to be performed after VACUUM for > continued good operation, rather than a necessary ingredient in > recovery, because it isn't. > - A bit of other minor rejiggering. Those all make sense to me. > I'm not forgetting about the rest of the proposed patch set, or the > change I proposed earlier. I'm just posting this much now because this > is how far I got today, and it would be useful to get comments before > I go further. I think the residual portion of the patch set not > included in this documentation patch will be quite small, and I think > that's a good thing, but again, I don't intend to blow that off. Of course. Your general approach seems wise. Thanks for working on this. I will be relieved once this is finally taken care of. -- Peter Geoghegan
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
Hi, Textual representation requires a long text field because it could contain schema, arguments, it is difficult and not effective to be saved as part of the data, and must be parsed to retrieve function oid. By using direct oid (actually, a value of the regprocedure field) we avoid it and function could be retrieved by pk. Why pg_upgrade cannot be used? OID preservation logic is already implemented for several OIDs in catalog tables, like pg_class, type, relfilenode, enum... I've mentioned twice that this logic is already implemented and I haven't encountered any problems with pg_upgrade. Actually, I've asked here because there are several references to PG_PROC oids from other tables in the system catalog, so I was worried if this logic could break something I do not know about. -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: Pro et contra of preserving pg_proc oids during pg_upgrade
On Thu, Oct 12, 2023 at 2:58 PM Nikita Malakhov wrote: > Why pg_upgrade cannot be used? > We document both a pg_dump/pg_restore migration and a pg_upgrade one (not to mention that logical backup and restore would cause the oids to change). It seems odd to have a feature that requires pg_upgrade to be the chosen one. pg_upgrade is an option, not a requirement. Same goes for pg_basebackup. pg_upgrade itself warns that should the on-disk file format change then it would be unusable - though I suspect that we'd end up with some kind of hybrid approach in that case. > OID preservation logic is already implemented > for several OIDs in catalog tables, like pg_class, type, relfilenode, > enum... > > We are allowed to preserve oids if we wish but that doesn't mean we must, nor does doing so constitute a declaration that such oids are part of the public API. And I don't see us making OIDs part of the public API unless we modify pg_dump to include them in its output. > Actually, I've asked here because there are several references to PG_PROC > oids > from other tables in the system catalog > Of course there are, e.g., views depending on functions would result is those. But pg_upgrade et al. recomputes the views so the changing of oids isn't a problem. Long text fields are common in databases; and if there are concerns with parsing/interpretation we can add functions to make doing that simpler. David J.
BRIN minmax multi - incorrect distance for infinite timestamp/date
Hi, Ashutosh Bapat reported me off-list a possible issue in how BRIN minmax-multi calculate distance for infinite timestamp/date values. The current code does this: if (TIMESTAMP_NOT_FINITE(dt1) || TIMESTAMP_NOT_FINITE(dt2)) PG_RETURN_FLOAT8(0); so means infinite values are "very close" to any other value, and thus likely to be merged into a summary range. That's exactly the opposite of what we want to do, possibly resulting in inefficient indexes. Consider this example create table test (a timestamptz) with (fillfactor=50); insert into test select (now() + ((1 * random())::int || ' seconds')::interval) from generate_series(1,100) s(i); update test set a = '-infinity'::timestamptz where random() < 0.01; update test set a = 'infinity'::timestamptz where random() < 0.01; explain (analyze, timing off, costs off) select * from test where a = '2024-01-01'::timestamptz; QUERY PLAN -- Bitmap Heap Scan on test (actual rows=0 loops=1) Recheck Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time zone) Rows Removed by Index Recheck: 680662 Heap Blocks: lossy=6024 -> Bitmap Index Scan on test_a_idx (actual rows=60240 loops=1) Index Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time zone) Planning Time: 0.075 ms Execution Time: 106.871 ms (8 rows) Clearly, large part of the table gets scanned - this happens because when building the index, we end up with ranges like this: [-infinity,a,b,c,...,x,y,z,infinity] and we conclude that distance for [-infinity,a] is 0, and we combine these values into a range. And the same for [z,infinity]. But we should do exactly the opposite thing - never merge those. Attached is a patch fixing this, with which the plan looks like this: QUERY PLAN -- Bitmap Heap Scan on test (actual rows=0 loops=1) Recheck Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time zone) -> Bitmap Index Scan on test_a_idx (actual rows=0 loops=1) Index Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time zone) Planning Time: 0.289 ms Execution Time: 9.432 ms (6 rows) Which seems much better. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companydiff --git a/src/backend/access/brin/brin_minmax_multi.c b/src/backend/access/brin/brin_minmax_multi.c index f8b2a3f9bc..c8775c274e 100644 --- a/src/backend/access/brin/brin_minmax_multi.c +++ b/src/backend/access/brin/brin_minmax_multi.c @@ -2084,8 +2084,14 @@ brin_minmax_multi_distance_date(PG_FUNCTION_ARGS) DateADT dateVal1 = PG_GETARG_DATEADT(0); DateADT dateVal2 = PG_GETARG_DATEADT(1); + /* + * If either value is infinite, we treat them as in infinite distance. + * We deduplicate the values before calculating distances for them, so + * either one value is finite, or the sign is different - so the + * inifinite distance is appropriate for both cases. + */ if (DATE_NOT_FINITE(dateVal1) || DATE_NOT_FINITE(dateVal2)) - PG_RETURN_FLOAT8(0); + PG_RETURN_FLOAT8(get_float8_infinity()); PG_RETURN_FLOAT8(dateVal1 - dateVal2); } @@ -2141,8 +2147,14 @@ brin_minmax_multi_distance_timestamp(PG_FUNCTION_ARGS) Timestamp dt1 = PG_GETARG_TIMESTAMP(0); Timestamp dt2 = PG_GETARG_TIMESTAMP(1); + /* + * If either value is infinite, we treat them as in infinite distance. + * We deduplicate the values before calculating distances for them, so + * either one value is finite, or the sign is different - so the + * inifinite distance is appropriate for both cases. + */ if (TIMESTAMP_NOT_FINITE(dt1) || TIMESTAMP_NOT_FINITE(dt2)) - PG_RETURN_FLOAT8(0); + PG_RETURN_FLOAT8(get_float8_infinity()); delta = dt2 - dt1;
Re: On login trigger: take three
On Thu, Oct 12, 2023 at 8:35 PM Robert Haas wrote: > On Tue, Oct 10, 2023 at 3:43 PM Alexander Korotkov > wrote: > > Yep, in v43 it worked that way. One transaction has to wait for > > another finishing update of pg_database tuple, then fails. This is > > obviously ridiculous. Since overlapping setters of flag will have to > > wait anyway, I changed lock mode in v44 for them to > > AccessExclusiveLock. Now, waiting transaction then sees the updated > > tuple and doesn't fail. > > Doesn't that mean that if you create the first login trigger in a > database and leave the transaction open, nobody can connect to that > database until the transaction ends? It doesn't mean that, because when trying to reset the flag v44 does conditional lock. So, if another transaction is holding the log we will just skip resetting the flag. So, the flag will be cleared on the first connection after that transaction ends. -- Regards, Alexander Korotkov
Re: Wait events for delayed checkpoints
On Thu, Oct 12, 2023 at 01:32:29PM -0400, Robert Haas wrote: > IPC seems right to me. Yeah, a timeout is being used, but as you say, > that's an implementation detail. > > +1 for the idea, too. Agreed that timeout makes little sense in this context, and IPC looks correct. +pgstat_report_wait_start(WAIT_EVENT_CHECKPOINT_DELAY_START); do { pg_usleep(1L);/* wait for 10 msec */ } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids, DELAY_CHKPT_START)); +pgstat_report_wait_end(); HaveVirtualXIDsDelayingChkpt() does immediately a LWLockAcquire() which would itself report a wait event for ProcArrayLock, overwriting this new one, no? -- Michael signature.asc Description: PGP signature
Re: interval_ops shall stop using btequalimage (deduplication)
On Wed, Oct 11, 2023 at 01:00:44PM -0700, Peter Geoghegan wrote: > On Wed, Oct 11, 2023 at 11:38 AM Noah Misch wrote: > > Interesting. So, >99% of interval-type indexes, even ones WITH > > (deduplicate_items=off), will get amcheck failures. The <1% of exceptions > > might include indexes having allequalimage=off due to an additional column, > > e.g. a two-column (interval, numeric) index. If interval indexes are common > > enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are > > relatively > > rare, that could argue for giving amcheck a special case. Specifically, > > downgrade its "metapage incorrectly indicates that deduplication is safe" > > from > > ERROR to WARNING for interval_ops only. > > I am not aware of any user actually running "deduplicate_items = off" > in production, for any index. It was added purely as a defensive thing > -- not because I anticipated any real need to disable deduplication. > Deduplication was optimized for being enabled by default. Sure. Low-importance background information: deduplicate_items=off got on my radar while I was wondering if ALTER INDEX ... SET (deduplicate_items=off) would clear allequalimage. If it had, we could have advised people to use ALTER INDEX, then rebuild only those indexes still failing "pg_amcheck --heapallindexed". ALTER INDEX doesn't do that, ruling out that idea. > > Without that special case (i.e. with > > the v1 patch), the release notes should probably resemble, "After updating, > > run REINDEX on all indexes having an interval-type column." > > +1 > > > There's little > > point in recommending pg_amcheck if >99% will fail. I'm inclined to bet > > that > > interval-type indexes are rare, so I lean against adding the amcheck special > > case. It's not a strong preference. Other opinions? > exactly one case like that post-fix (interval_ops is at least the only > affected core code opfamily), so why not point that out directly with > a HINT? A HINT could go a long way towards putting the problem in > context, without really adding a special case, and without any real > question of users being misled. Works for me. Added. Author: Noah Misch Commit: Noah Misch Dissociate btequalimage() from interval_ops, ending its deduplication. Under interval_ops, some equal values are distinguishable. One such pair is '24:00:00' and '1 day'. With that being so, btequalimage() breaches the documented contract for the "equalimage" btree support function. This can cause incorrect results from index-only scans. Users should REINDEX any btree indexes having interval-type columns. After updating, pg_amcheck will report an error for almost all such indexes. This fix makes interval_ops simply omit the support function, like numeric_ops does. Back-pack to v13, where btequalimage() first appeared. In back branches, for the benefit of old catalog content, btequalimage() code will return false for type "interval". Going forward, back-branch initdb will include the catalog change. Reviewed by Peter Geoghegan. Discussion: https://postgr.es/m/20231011013317.22.nmi...@google.com diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index dbb83d8..3e07a3e 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -31,6 +31,7 @@ #include "access/xact.h" #include "catalog/index.h" #include "catalog/pg_am.h" +#include "catalog/pg_opfamily_d.h" #include "commands/tablecmds.h" #include "common/pg_prng.h" #include "lib/bloomfilter.h" @@ -338,10 +339,20 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed, errmsg("index \"%s\" metapage has equalimage field set on unsupported nbtree version", RelationGetRelationName(indrel; if (allequalimage && !_bt_allequalimage(indrel, false)) + { + boolhas_interval_ops = false; + + for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(indrel); i++) + if (indrel->rd_opfamily[i] == INTERVAL_BTREE_FAM_OID) + has_interval_ops = true; ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), errmsg("index \"%s\" metapage incorrectly indicates that deduplication is safe", - RelationGetRelationName(indrel; + RelationGetRelationName(indrel)), +has_interval_ops +? errhint("This is known of \"interval\" indexes last built on a version predating 2023-11.") +: 0)); +
Re: A new strategy for pull-up correlated ANY_SUBLINK
On 12.10.2023 10:52, Andy Fan wrote: Unfortunately, I found a request when sublink did not pull-up, as in the examples above. I couldn't quite figure out why. I'm not sure what you mean with the "above", I guess it should be the "below"? Yes, you are right) explain (analyze, costs off, buffers) select b.x, b.x, a.y from b left join a on b.x=a.x and *b.t in (select max(a0.t) * from a a0 where a0.x = b.x and a0.t = b.t); ... SubPlan 2 Here the sublink can't be pulled up because of its reference to the LHS of left join, the original logic is that no matter the 'b.t in ..' returns the true or false, the rows in LHS will be returned. If we pull it up to LHS, some rows in LHS will be filtered out, which breaks its original semantics. Thanks for the explanation, it became more clear to me here. I thought it would be: explain (analyze, costs off, buffers) select b.x, b.x, a.y from b left join a on b.x=a.x and *b.t = (select max(a0.t) * from a a0 where a0.x = b.x and a0.t <= b.t); QUERY PLAN - Hash Right Join (actual time=1.181..67.927 rows=1000 loops=1) Hash Cond: (a.x = b.x) *Join Filter: (b.t = (SubPlan 2))* Buffers: shared hit=3546 -> Seq Scan on a (actual time=0.022..17.109 rows=10 loops=1) Buffers: shared hit=541 -> Hash (actual time=1.065..1.068 rows=1000 loops=1) Buckets: 4096 Batches: 1 Memory Usage: 72kB Buffers: shared hit=5 -> Seq Scan on b (actual time=0.049..0.401 rows=1000 loops=1) Buffers: shared hit=5 SubPlan 2 -> Result (actual time=0.025..0.025 rows=1 loops=1000) Buffers: shared hit=3000 InitPlan 1 (returns $2) -> Limit (actual time=0.024..0.024 rows=1 loops=1000) Buffers: shared hit=3000 -> Index Only Scan Backward using a_t_x_idx on a a0 (actual time=0.023..0.023 rows=1 loops=1000) Index Cond: ((t IS NOT NULL) AND (t <= b.t) AND (x = b.x)) Heap Fetches: 1000 Buffers: shared hit=3000 Planning Time: 0.689 ms Execution Time: 68.220 ms (23 rows) If you noticed, it became possible after replacing the "in" operator with "=". I didn't notice much difference between the 'in' and '=', maybe I missed something? It seems to me that the expressions "=" and "IN" are equivalent here due to the fact that the aggregated subquery returns only one value, and the result with the "IN" operation can be considered as the intersection of elements on the left and right. In this query, we have some kind of set on the left, among which there will be found or not only one element on the right. In general, this expression can be considered as b=const, so push down will be applied to b and we can filter b during its scanning by the subquery's result. But I think your explanation is necessary here, that this is all possible, because we can pull up the sublink here, since filtering is allowed on the right side (the nullable side) and does not break the semantics of LHS. But in contrast, I also added two queries where pull-up is impossible and it is not done here. Otherwise if filtering was applied on the left it would be mistake. To be honest, I'm not sure if this explanation is needed in the test anymore, so I didn't add it. explain (costs off) SELECT * FROM tenk1 A LEFT JOIN tenk2 B ON A.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd); QUERY PLAN - Nested Loop Left Join Join Filter: (SubPlan 2) -> Seq Scan on tenk1 a -> Materialize -> Seq Scan on tenk2 b SubPlan 2 -> Result InitPlan 1 (returns $1) -> Limit -> Index Scan using tenk2_hundred on tenk2 c Index Cond: (hundred IS NOT NULL) Filter: (odd = b.odd) (12 rows) explain (costs off) SELECT * FROM tenk1 A LEFT JOIN tenk2 B ON A.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd)); QUERY PLAN --- Nested Loop Left Join Join Filter: (hashed SubPlan 1) -> Seq Scan on tenk1 a -> Materialize -> Seq Scan on tenk2 b SubPlan 1 -> HashAggregate Group Key: c.odd -> Seq Scan on tenk2 c (9 rows) I took the liberty of adding this to your patch and added myself as reviewer, if you
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote: > After some more thought, I think we could massage the "pg_control in > backup_label" method into something that could be back patched, with more > advanced features (e.g. error on backup_label and pg_control both present on > initial cluster start) saved for HEAD. I doubt that anything changed in this area would be in the backpatchable zone, particularly as it would involve protocol changes within the replication commands, so I'd recommend to focus on HEAD. Backward-compatibility is not much of a conern as long as the backend is involved. The real problem here would be on the frontend side and how much effort we should try to put in maintaining the code of pg_basebackup compatible with older backends. -- Michael signature.asc Description: PGP signature
Re: interval_ops shall stop using btequalimage (deduplication)
On Thu, Oct 12, 2023 at 4:10 PM Noah Misch wrote: > > exactly one case like that post-fix (interval_ops is at least the only > > affected core code opfamily), so why not point that out directly with > > a HINT? A HINT could go a long way towards putting the problem in > > context, without really adding a special case, and without any real > > question of users being misled. > > Works for me. Added. Looks good. Thanks! -- Peter Geoghegan
Re: Making aggregate deserialization (and WAL receive) functions slightly faster
On Wed, 11 Oct 2023 at 08:52, Tom Lane wrote: > > David Rowley writes: > > I've attached a slightly more worked on patch that makes maxlen == 0 > > mean read-only. Unsure if a macro is worthwhile there or not. > > A few thoughts: Thank you for the review. I spent more time on this and did end up with 2 new init functions as you mentioned. One for strictly read-only (initReadOnlyStringInfo), which cannot be appended to, and as you mentioned, another (initStringInfoFromString) which can accept a palloc'd buffer which becomes managed by the stringinfo code. I know these names aren't exactly as you mentioned. I'm open to adjusting still. This means I got rid of the read-only conversion code in enlargeStringInfo(). I didn't do anything to try to handle buffer enlargement more efficiently in enlargeStringInfo() for the case where initStringInfoFromString sets maxlen to some non-power-of-2. The doubling code seems like it'll work ok without power-of-2 values, it'll just end up calling repalloc() with non-power-of-2 values. I did also wonder if resetStringInfo() would have any business touching the existing buffer in a read-only StringInfo and came to the conclusion that it wouldn't be very read-only if we allowed resetStringInfo() to do its thing on it. I added an Assert to fail if resetStringInfo() receives a read-only StringInfo. Also, since it's still being discussed, I left out the adjustment to LogicalParallelApplyLoop(). That also allows the tests to pass without the failing Assert that was checking for the NUL terminator. David diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c index d52c8963eb..ce9d5b4059 100644 --- a/src/backend/replication/logical/proto.c +++ b/src/backend/replication/logical/proto.c @@ -879,6 +879,7 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData *tuple) /* Read the data */ for (i = 0; i < natts; i++) { + char *buff; charkind; int len; StringInfo value = &tuple->colvalues[i]; @@ -899,19 +900,16 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData *tuple) len = pq_getmsgint(in, 4); /* read length */ /* and data */ - value->data = palloc(len + 1); - pq_copymsgbytes(in, value->data, len); + buff = palloc(len + 1); + pq_copymsgbytes(in, buff, len); /* * Not strictly necessary for LOGICALREP_COLUMN_BINARY, but * per StringInfo practice. */ - value->data[len] = '\0'; + buff[len] = '\0'; - /* make StringInfo fully valid */ - value->len = len; - value->cursor = 0; - value->maxlen = len; + initStringInfoFromString(value, buff, len); break; default: elog(ERROR, "unrecognized data representation type '%c'", kind); diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index 597947410f..b574188d70 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -3582,10 +3582,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received) /* Ensure we are reading the data into our memory context. */ MemoryContextSwitchTo(ApplyMessageContext); - s.data = buf; - s.len = len; - s.cursor = 0; - s.maxlen = -1; + initReadOnlyStringInfo(&s, buf, len); c = pq_getmsgbyte(&s); diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f3c9f1f9ba..94b963d6e6 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -1816,23 +1816,19 @@ exec_bind_message(StringInfo input_message) if (!isNull) { - const char *pvalue = pq_getmsgbytes(input_message, plength); + char *pvalue; /* -* Rather than copying data around, we just set up a phony -* StringInfo pointing to the correct portion of the message -* buffer. We assume we can scribble on th
Re: Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)
On Thu, Oct 12, 2023 at 02:00:00PM +0300, Alexander Lakhin wrote: > So to fail on the test, skink should perform at least twice slower than > usual, and may be it's an extraordinary condition indeed, but on the other > hand, may be increase checkpoint_timeout as already done in several tests > (015_promotion_pages, 038_save_logical_slots_shutdown, 039_end_of_wal, ...). > > [1] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-10%2017%3A10%3A11 > [2] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-11-07%2020%3A27%3A11 Thanks for the investigation. Increasing the checkpoint timeout is not a perfect science but at least it would work until a machine is able to be slower than the current limit reached, so I would be OK with your suggestion and raise the bar a bit more to prevent the race created by these extra checkpoints triggered because of the time. -- Michael signature.asc Description: PGP signature
Re: LLVM 16 (opaque pointers)
Hi, On 2023-10-11 21:59:50 +1300, Thomas Munro wrote: > +#else > + LLVMPassBuilderOptionsRef options; > + LLVMErrorRef err; > + int compile_optlevel; > + char *passes; > + > + if (context->base.flags & PGJIT_OPT3) > + compile_optlevel = 3; > + else > + compile_optlevel = 0; > + > + passes = > psprintf("default,mem2reg,function(no-op-function),no-op-module", > + compile_optlevel); I don't think the "function(no-op-function),no-op-module" bit does something particularly useful? I also don't think we should add the mem2reg pass outside of -O0 - running it after a real optimization pipeline doesn't seem useful and might even make the code worse? mem2reg is included in default (and obviously also in O3). Thanks for working on this stuff! I'm working on setting up buildfarm animals for 16, 17, each once with a normal and an assertion enabled LLVM build. Greetings, Andres Freund
Re: Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)
Hi, On 2023-10-12 14:00:00 +0300, Alexander Lakhin wrote: > So to fail on the test, skink should perform at least twice slower than > usual The machine skink is hosted on runs numerous buildfarm animals (24 I think right now, about to be 28). While it has plenty resources (16 cores/32 threads, 128GB RAM), test runtime is still pretty variable depending on what other tests are running at the same time... Greetings, Andres Freund
Re: Add support for AT LOCAL
On 10/10/23 05:34, Michael Paquier wrote: I am attaching a v5 that addresses the documentation bits, could you look at the business with date.c? Here is a v6 which hopefully addresses all of your concerns. -- Vik Fearing From 042ce9b581ca3b17afbf229d209ca59addb6c9a2 Mon Sep 17 00:00:00 2001 From: Vik Fearing Date: Wed, 4 Oct 2023 15:46:38 +0100 Subject: [PATCH v6] Add support for AT LOCAL When converting a timestamp to/from with/without time zone, the SQL Standard specifies an AT LOCAL variant of AT TIME ZONE which uses the session's time zone. --- doc/src/sgml/func.sgml| 103 +- src/backend/parser/gram.y | 7 ++ src/backend/utils/adt/date.c | 14 +++ src/backend/utils/adt/ruleutils.c | 10 +++ src/backend/utils/adt/timestamp.c | 20 + src/include/catalog/pg_proc.dat | 9 ++ src/test/regress/expected/timestamptz.out | 47 ++ src/test/regress/expected/timetz.out | 39 src/test/regress/sql/timestamptz.sql | 21 + src/test/regress/sql/timetz.sql | 17 10 files changed, 284 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index f1ad64c3d6..ce62cb37b5 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -10604,42 +10604,46 @@ SELECT date_bin('15 minutes', TIMESTAMP '2020-02-11 15:44:17', TIMESTAMP '2001-0 that date_bin can truncate to an arbitrary interval. The stride interval must be greater than zero and cannot contain units of month or larger. - AT TIME ZONE + AT TIME ZONE and AT LOCAL time zone conversion AT TIME ZONE + +AT LOCAL + + The AT TIME ZONE operator converts time stamp without time zone to/from time stamp with time zone, and time with time zone values to different time zones. shows its variants. - AT TIME ZONE Variants + AT TIME ZONE and AT LOCAL Variants Operator Description @@ -10658,93 +10662,186 @@ SELECT date_bin('15 minutes', TIMESTAMP '2020-02-11 15:44:17', TIMESTAMP '2001-0 Converts given time stamp without time zone to time stamp with time zone, assuming the given value is in the named time zone. timestamp '2001-02-16 20:38:40' at time zone 'America/Denver' 2001-02-17 03:38:40+00 + + + timestamp without time zone AT LOCAL + timestamp with time zone + + + Converts given time stamp without time zone to + time stamp with the session's + TimeZone value as time zone. + + + timestamp '2001-02-16 20:38:40' at local + 2001-02-17 03:38:40+00 + + + timestamp with time zone AT TIME ZONE zone timestamp without time zone Converts given time stamp with time zone to time stamp without time zone, as the time would appear in that zone. timestamp with time zone '2001-02-16 20:38:40-05' at time zone 'America/Denver' 2001-02-16 18:38:40 + + + timestamp with time zone AT LOCAL + timestamp without time zone + + + Converts given time stamp with time zone to + time stamp without time zone, as the time would + appear with the session's TimeZone value as time zone. + + + timestamp with time zone '2001-02-16 20:38:40-05' at local + 2001-02-16 18:38:40 + + + time with time zone AT TIME ZONE zone time with time zone Converts given time with time zone to a new time zone. Since no date is supplied, this uses the currently active UTC offset for the named destination zone. time with time zone '05:34:17-05' at time zone 'UTC' 10:34:17+00 + + + + time with time zone AT LOCAL + time with time zone + + + Converts given time with time zone to a new time + zone. Since no date is supplied, this uses the currently active UTC + offset for the session's TimeZone value. + + + Assuming the session's TimeZone is set to UTC: + + + time with time zone '05:34:17-05' at local + 10:34:17+00 + + In these expressions, the desired time zone zone can be specified either as a text value (e.g., 'America/Los_Angeles') or as an interval (e.g., INTERVAL '-08:00'). In the text case, a ti
Re: CHECK Constraint Deferrable
On 10/10/23 15:12, Robert Haas wrote: On Mon, Oct 9, 2023 at 5:07 PM David G. Johnston wrote: 2. I don't think it's a good idea for the same patch to try to solve two problems unless they are so closely related that solving one without solving the other is not sensible. A NOT NULL constraint apparently is just a special case of a check constraint which seems closely related enough to match your definition. Yes, that might be true. I suppose I'd like to hear from the patch author(s) about that. I'm somewhat coming around to your idea that maybe both should be covered together, but I'm not the one writing the patch. Álvaro Herrera has put (and is still putting) immense effort into turning NOT NULL into a CHECK constraint. Honestly, I don't see why the two patches need to be combined. -- Vik Fearing
Re: Some performance degradation in REL_16 vs REL_15
On Thu, Oct 12, 2023 at 09:20:36PM +1300, David Rowley wrote: > It would be interesting to know what's to blame here and if you can > attribute it to a certain commit. +1. -- Michael signature.asc Description: PGP signature
Re: SQL:2011 application time
On 10/11/23 05:47, Paul Jungwirth wrote: +SELECT pg_get_indexdef(conindid, 0, true) FROM pg_constraint WHERE conname = 'temporal_rng_pk'; + pg_get_indexdef +--- + CREATE UNIQUE INDEX temporal_rng_pk ON temporal_rng USING gist (id, valid_at) Shouldn't this somehow show the operator classes for the columns? We are using different operator classes for the id and valid_at columns, aren't we? We only print the operator classes if they are not the default, so they don't appear here. I do suspect something more is desirable though. For exclusion constraints we replace everything before the columns with just "EXCLUDE USING gist". I could embed WITHOUT OVERLAPS but it's not valid syntax in CREATE INDEX. Let me know if you have any ideas. Why not? The standard does not mention indexes (although some discussions last week might change that) so we can change the syntax for it as we wish. Doing so would also allow us to use ALTER TABLE ... USING INDEX for such things. -- Vik Fearing
Re: Tab completion for AT TIME ZONE
On 10/12/23 10:27, Dagfinn Ilmari Mannsåker wrote: Michael Paquier writes: On Fri, Apr 14, 2023 at 12:05:25PM +0200, Jim Jones wrote: The patch applies cleanly and it does what it is proposing. - and it's IMHO a very nice addition. I've marked the CF entry as "Ready for Committer". +/* ... AT TIME ZONE ... */ + else if (TailMatches("AT")) + COMPLETE_WITH("TIME ZONE"); + else if (TailMatches("AT", "TIME")) + COMPLETE_WITH("ZONE"); + else if (TailMatches("AT", "TIME", "ZONE")) + COMPLETE_WITH_TIMEZONE_NAME(); This style will for the completion of timezone values even if "AT" is the first word of a query. Shouldn't this be more selective by making sure that we are at least in the context of a SELECT query? It's valid anywhere an expression is, which is a lot more places than just SELECT queries. Off the top of my head I can think of WITH, INSERT, UPDATE, VALUES, CALL, CREATE TABLE, CREATE INDEX. As I mentioned upthread, the only place in the grammar where the word AT occurs is in AT TIME ZONE, so there's no ambiguity. Also, it doesn't complete time zone names after AT, it completes the literal words TIME ZONE, and you have to then hit tab again to get a list of time zones. If we (or the SQL committee) were to invent more operators that start with the word AT, we can add those to the first if clause above and complete with the appropriate values after each one separately. Speaking of this... The SQL committee already has another operator starting with AT which is AT LOCAL. I am implementing it in https://commitfest.postgresql.org/45/4343/ where I humbly admit that I did not think of psql tab completion at all. These two patches are co-dependent and whichever goes in first the other will need to be adjusted accordingly. -- Vik Fearing
Re: On login trigger: take three
On Thu, Oct 12, 2023 at 6:54 PM Alexander Korotkov wrote: > On Thu, Oct 12, 2023 at 8:35 PM Robert Haas wrote: > > Doesn't that mean that if you create the first login trigger in a > > database and leave the transaction open, nobody can connect to that > > database until the transaction ends? > > It doesn't mean that, because when trying to reset the flag v44 does > conditional lock. So, if another transaction is holding the log we > will just skip resetting the flag. So, the flag will be cleared on > the first connection after that transaction ends. But in the scenario I am describing the flag is being set, not reset. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Wait events for delayed checkpoints
On Thu, Oct 12, 2023 at 7:09 PM Michael Paquier wrote: > On Thu, Oct 12, 2023 at 01:32:29PM -0400, Robert Haas wrote: > > IPC seems right to me. Yeah, a timeout is being used, but as you say, > > that's an implementation detail. > > > > +1 for the idea, too. > > Agreed that timeout makes little sense in this context, and IPC looks > correct. > > +pgstat_report_wait_start(WAIT_EVENT_CHECKPOINT_DELAY_START); > do > { > pg_usleep(1L);/* wait for 10 msec */ > } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids, >DELAY_CHKPT_START)); > +pgstat_report_wait_end(); > > HaveVirtualXIDsDelayingChkpt() does immediately a LWLockAcquire() > which would itself report a wait event for ProcArrayLock, overwriting > this new one, no? Ah, right: the wait event should be set and cleared around pg_usleep, not the whole loop. -- Robert Haas EDB: http://www.enterprisedb.com
Re: PostgreSQL domains and NOT NULL constraint
On 10/12/23 15:54, Tom Lane wrote: Erki Eessaar writes: PostgreSQL's CREATE DOMAIN documentation (section Notes) describes a way how one can add NULL's to a column that has a domain with the NOT NULL constraint. https://www.postgresql.org/docs/current/sql-createdomain.html To me it seems very strange and amounts to a bug because it defeats the purpose of domains (to be a reusable assets) and constraints (to avoid any bypassing of these). I doubt we'd consider doing anything about that. The whole business of domains with NOT NULL constraints is arguably a defect of the SQL standard, because there are multiple ways to produce a value that is NULL and yet must be considered to be of the domain type. The subselect-with-no-output case that you show isn't even the most common one; I'd say that outer joins where there are domain columns on the nullable side are the biggest problem. There's been some discussion of treating the output of such a join, subselect, etc as being of the domain's base type not the domain proper. That'd solve this particular issue since then we'd decide we have to cast the base type back up to the domain type (and hence check its constraints) before inserting the row. But that choice just moves the surprise factor somewhere else, in that queries that used to produce one data type now produce another one. There are applications that this would break. Moreover, I do not think there's any justification for it in the SQL spec. I do not believe this is a defect of the SQL standard at all. SQL:2023-2 Section 4.14 "Domains" clearly states "The purpose of a domain is to constrain the set of valid values that can be stored in a column of a base table by various operations." That seems very clear to me that *storing* a value in a base table must respect the domain's constraints, even if *operations* on those values might not respect all of the domain's constraints. Whether or not it is practical to implement that is a different story, but allowing the null value to be stored in a column of a base table whose domain specifies NOT NULL is frankly a bug. -- Vik Fearing
Re: PostgreSQL domains and NOT NULL constraint
Vik Fearing writes: > On 10/12/23 15:54, Tom Lane wrote: >> There's been some discussion of treating the output of such a join, >> subselect, etc as being of the domain's base type not the domain >> proper. That'd solve this particular issue since then we'd decide >> we have to cast the base type back up to the domain type (and hence >> check its constraints) before inserting the row. But that choice >> just moves the surprise factor somewhere else, in that queries that >> used to produce one data type now produce another one. There are >> applications that this would break. Moreover, I do not think there's >> any justification for it in the SQL spec. > I do not believe this is a defect of the SQL standard at all. > SQL:2023-2 Section 4.14 "Domains" clearly states "The purpose of a > domain is to constrain the set of valid values that can be stored in a > column of a base table by various operations." So I wonder what is the standard's interpretation of regression=# create domain dpos as integer not null check (value > 0); CREATE DOMAIN regression=# create table t1 (x int, d dpos); CREATE TABLE regression=# create view v1 as select ty.d from t1 tx left join t1 ty using (x); CREATE VIEW regression=# \d+ v1 View "public.v1" Column | Type | Collation | Nullable | Default | Storage | Description +--+---+--+-+-+- d | dpos | | | | plain | View definition: SELECT ty.d FROM t1 tx LEFT JOIN t1 ty USING (x); If we are incorrect in ascribing the type "dpos" to v1.d, where in the spec contradicts that? (Or in other words, 4.14 might lay out some goals for the feature, but that's just empty words if it's not supported by accurate details in other places.) regards, tom lane
Re: Some performance degradation in REL_16 vs REL_15
Hi, On 2023-10-12 11:00:22 +0300, Anton A. Melnikov wrote: > Found that simple test pgbench -c20 -T20 -j8 gives approximately > for REL_15_STABLE at 5143f76: 336+-1 TPS > and > for REL_16_STABLE at 4ac7635f: 324+-1 TPS > > The performance drop is approximately 3,5% while the corrected standard > deviation is only 0.3%. > See the raw_data.txt attached. Could you provide a bit more details about how you ran the benchmark? The reason I am asking is that ~330 TPS is pretty slow for -c20. Even on spinning rust and using the default settings, I get considerably higher results. Oh - I do get results closer to yours if I use pgbench scale 1, causing a lot of row level contention. What scale did you use? Greetings, Andres Freund
Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set
On Tue, Sep 26, 2023 at 08:13:45AM +0200, Daniel Gustafsson wrote: >> On 26 Sep 2023, at 00:20, Nathan Bossart wrote: >> >> On Thu, Sep 21, 2023 at 11:18:00AM +0900, bt23nguyent wrote: >>> -basic_archive_configured(ArchiveModuleState *state) >>> +basic_archive_configured(ArchiveModuleState *state, char **logdetail) >> >> Could we do something more like GUC_check_errdetail() instead to maintain >> backward compatibility with v16? > > We'd still need something exported to call into which isn't in 16, so it > wouldn't be more than optically backwards compatible since a module written > for > 17 won't compile for 16, or am I missing something? I only mean that a module written for v16 could continue to be used in v17 without any changes. You are right that a module that uses this new functionality wouldn't compile for v16. But IMHO the interface is nicer, too, since module authors wouldn't need to worry about allocating the space for the string or formatting the message. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: PostgreSQL domains and NOT NULL constraint
On 10/13/23 02:44, Tom Lane wrote: Vik Fearing writes: On 10/12/23 15:54, Tom Lane wrote: There's been some discussion of treating the output of such a join, subselect, etc as being of the domain's base type not the domain proper. That'd solve this particular issue since then we'd decide we have to cast the base type back up to the domain type (and hence check its constraints) before inserting the row. But that choice just moves the surprise factor somewhere else, in that queries that used to produce one data type now produce another one. There are applications that this would break. Moreover, I do not think there's any justification for it in the SQL spec. I do not believe this is a defect of the SQL standard at all. SQL:2023-2 Section 4.14 "Domains" clearly states "The purpose of a domain is to constrain the set of valid values that can be stored in a column of a base table by various operations." So I wonder what is the standard's interpretation of regression=# create domain dpos as integer not null check (value > 0); CREATE DOMAIN regression=# create table t1 (x int, d dpos); CREATE TABLE regression=# create view v1 as select ty.d from t1 tx left join t1 ty using (x); CREATE VIEW regression=# \d+ v1 View "public.v1" Column | Type | Collation | Nullable | Default | Storage | Description +--+---+--+-+-+- d | dpos | | | | plain | View definition: SELECT ty.d FROM t1 tx LEFT JOIN t1 ty USING (x); If we are incorrect in ascribing the type "dpos" to v1.d, where in the spec contradicts that? (Or in other words, 4.14 might lay out some goals for the feature, but that's just empty words if it's not supported by accurate details in other places.) Objection, Your Honor: Relevance. Regardless of what the spec may or may not say about v1.d, it still remains that nulls should not be allowed in a *base table* if the domain says nulls are not allowed. Not mentioned in this thread but the constraints are also applied when CASTing to the domain. Now, to answer your straw man, this might be helpful: SQL:2023-2 Section 11.4 Syntax Rule 9, "If the descriptor of D includes any domain constraint descriptors, then T shall be a persistent base table.". Your v1 is not that and therefore arguably illegal. As you know, I am more than happy to (try to) amend the spec where needed, but Erki's complaint of a null value being allowed in a base table is clearly a bug in our implementation regardless of what we do with views. -- Vik Fearing
Re: Removing unneeded self joins
On 12/10/2023 18:32, Alexander Korotkov wrote: On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov We have almost the results we wanted to have. But in the last explain you can see that nothing happened with the OR clause. We should use the expression mutator instead of walker to handle such clauses. But It doesn't process the RestrictInfo node ... I'm inclined to put a solution of this issue off for a while. OK. I think it doesn't worth to eliminate IS NULL quals with this complexity (at least at this stage of work). Yeah. I think It would be meaningful in the case of replacing also nested x IS NOT NULL with nothing. But it requires using a mutator instead of the walker and may be done more accurately next time. I made improvements over the code. Mostly new comments, grammar corrections of existing comments and small refactoring. Great! Also, I found that the suggestion from David Rowley [1] to qsort array of relations to faster find duplicates is still unaddressed. I've implemented it. That helps to evade quadratic complexity with large number of relations. I see. The thread is too long so far, thanks for the catch. Also I've incorporated improvements from Alena Rybakina except one for skipping SJ removal when no SJ quals is found. It's not yet clear for me if this check fix some cases. But at least optimization got skipped in some useful cases (as you can see in regression tests). Agree. I wouldn't say I like it too. But also, I suggest skipping some unnecessary assertions proposed in that patch: Assert(toKeep->relid != -1); - quite strange. Why -1? Why not all the negative numbers, at least? Assert(is_opclause(orinfo->clause)); - above we skip clauses with rinfo->mergeopfamilies == NIL. Each mergejoinable clause is already checked as is_opclause. All these changes (see in the attachment) are optional. -- regards, Andrey Lepikhov Postgres Professional diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c index f0746f35a3..7b8dc7a2b7 100644 --- a/src/backend/optimizer/plan/analyzejoins.c +++ b/src/backend/optimizer/plan/analyzejoins.c @@ -1710,8 +1710,6 @@ remove_self_join_rel(PlannerInfo *root, PlanRowMark *kmark, PlanRowMark *rmark, List *binfo_candidates = NIL; ReplaceVarnoContext ctx = {.from = toRemove->relid,.to = toKeep->relid}; - Assert(toKeep->relid != -1); - /* * Replace index of removing table with the keeping one. The technique of * removing/distributing restrictinfo is used here to attach just appeared @@ -2017,8 +2015,6 @@ match_unique_clauses(PlannerInfo *root, RelOptInfo *outer, List *uclauses, /* Don't consider clauses which aren't similar to 'F(X)=G(Y)' */ continue; - Assert(is_opclause(orinfo->clause)); - oclause = bms_is_empty(orinfo->left_relids) ? get_rightop(orinfo->clause) : get_leftop(orinfo->clause); c2 = (bms_is_empty(orinfo->left_relids) ? diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 2ff4881fdf..96ebd6eed3 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -367,7 +367,6 @@ CatalogId CatalogIdMapEntry CatalogIndexState ChangeVarNodes_context -ReplaceVarnoContext CheckPoint CheckPointStmt CheckpointStatsData @@ -2341,6 +2340,7 @@ ReorderBufferUpdateProgressTxnCB ReorderTuple RepOriginId ReparameterizeForeignPathByChild_function +ReplaceVarnoContext ReplaceVarsFromTargetList_context ReplaceVarsNoMatchOption ReplicaIdentityStmt @@ -2474,6 +2474,7 @@ SeenRelsEntry SelectLimit SelectStmt Selectivity +SelfJoinCandidate SemTPadded SemiAntiJoinFactors SeqScan
Re: [dynahash] do not refill the hashkey after hash_search
On Thu, Sep 14, 2023 at 04:28:26PM +0800, Junwang Zhao wrote: > Add a v2 with some change to fix warnings about unused-parameter. > > I will add this to Commit Fest. This looks reasonable to me. I've marked the commitfest entry as ready-for-committer. I will plan on committing it in a couple of days unless John has additional feedback or would like to do the honors. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: PGDOCS - add more links in the pub/sub reference pages
On Thu, Oct 12, 2023 at 3:44 PM Amit Kapila wrote: > > On Mon, Oct 9, 2023 at 12:15 PM Peter Smith wrote: > > > > On Mon, Oct 9, 2023 at 3:32 PM Amit Kapila wrote: > > > > > > > In v1, I used the same pattern as on the CREATE SUBSCRIPTION page, > > which doesn't look like those... > > > > Yeah, I think it would have been better if we used params in the > CREATE SUBSCRIPTION page as well. I don't know if it is a good idea to > do now with this patch but it makes sense to be consistent. What do > you think? > OK, I have given those changes as separate patches: - 0002 (changes the CREATE PUBLICATION parameter ids) - 0003 (changes CREATE SUBSCRIPTION parameter ids) > > ~~~ > > > > The "Parameters" section describes some things that really are parameters: > > > > e.g. > > "sql-altersubscription-name" > > "sql-altersubscription-new-owner" > > "sql-altersubscription-new-name"> > > > > I agree, emphasising that those ones are parameters is better. Changed > > like this in v2. > > > > "sql-altersubscription-params-name" > > "sql-altersubscription-params-new-owner" > > "sql-altersubscription-params-new-name"> > > > > ~ > > > > But, the "Parameters" section also describes other SQL syntax clauses > > which are not really parameters at all. > > > > e.g. > > "sql-altersubscription-refresh-publication" > > "sql-altersubscription-enable" > > "sql-altersubscription-disable" > > > > So I felt those ones are more intuitive left as they are -- e.g., > > instead of having ids/linkends like: > > > > "sql-altersubscription-params-refresh-publication" > > "sql-altersubscription-params-enable" > > "sql-altersubscription-params-disable" > > > > I checked alter_role.sgml which has similar mixed usage and it is > using 'params' consistently in all cases. So, I would suggest > following a similar style here. > As you wish. Done that way in patch 0001. ~~ PSA the v5 patches. == Kind Regards, Peter Smith. Fujitsu Australia v5-0001-Add-more-pub-sub-links.patch Description: Binary data v5-0002-Change-ids-for-CREATE-PUBLICATION-parameters.patch Description: Binary data v5-0003-Change-ids-for-CREATE-SUBSCRIPTION-parameters.patch Description: Binary data
Re: Wait events for delayed checkpoints
On Fri, Oct 13, 2023 at 2:19 PM Robert Haas wrote: > On Thu, Oct 12, 2023 at 7:09 PM Michael Paquier wrote: > > HaveVirtualXIDsDelayingChkpt() does immediately a LWLockAcquire() > > which would itself report a wait event for ProcArrayLock, overwriting > > this new one, no? > > Ah, right: the wait event should be set and cleared around pg_usleep, > not the whole loop. Duh. Yeah. Pushed like that. Thanks both.
Re: pg_upgrade's interaction with pg_resetwal seems confusing
On Fri, Oct 13, 2023 at 12:00 AM Robert Haas wrote: > > On Thu, Oct 12, 2023 at 7:17 AM Amit Kapila wrote: > > Now, as mentioned in the first paragraph, it seems we anyway don't > > need to reset the WAL at the end when setting the next OID for the new > > cluster with the -o option. If that is true, then I think even without > > slots work it will be helpful to have such an option in pg_resetwal. > > > > Thoughts? > > I wonder if we should instead provide a way to reset the OID counter > with a function call inside the database, gated by IsBinaryUpgrade. > I think the challenge in doing so would be that when the server is running, a concurrent checkpoint can also update the OID counter value in the control file. See below code: CreateCheckPoint() { ... LWLockAcquire(OidGenLock, LW_SHARED); checkPoint.nextOid = ShmemVariableCache->nextOid; if (!shutdown) checkPoint.nextOid += ShmemVariableCache->oidCount; LWLockRelease(OidGenLock); ... UpdateControlFile() ... } Now, we can try to pass some startup options like checkpoint_timeout with a large value to ensure that checkpoint won't interfere but not sure if that would be bulletproof. Instead, how about allowing pg_upgrade to update the control file of the new cluster (with the required value of OID) following the same method as pg_resetwal does in RewriteControlFile()? > Having something like pg_resetwal --but-dont-actually-reset-the-wal > seems both self-contradictory and vulnerable to abuse that we might be > better off not inviting. > Fair point. -- With Regards, Amit Kapila.
Re: Add support for AT LOCAL
On Fri, Oct 13, 2023 at 02:20:59AM +0200, Vik Fearing wrote: > On 10/10/23 05:34, Michael Paquier wrote: > > I am attaching a v5 that addresses the documentation bits, could you > > look at the business with date.c? > > Here is a v6 Thanks for the new version. > which hopefully addresses all of your concerns. Mostly ;) The first thing I did was to extract the doc bits about timezone(zone, time) for AT TIME ZONE from v6 and applied it independently. I have then looked at the rest and it looked mostly OK to me, including the extra description you have added for the fifth example in the docs. I have tweaked a few things: the regression tests to make the views a bit more appealing to the eye, an indentation to not have koel complain and did a catalog bump. Then applied it. -- Michael signature.asc Description: PGP signature