Re: Problem, partition pruning for prepared statement with IS NULL clause.

2023-10-12 Thread David Rowley
On Wed, 11 Oct 2023 at 22:59, Sergei Glukhov  wrote:
> > Unfortunately, I'd not long sent the last email and realised that the
> > step_lastkeyno parameter is now unused and can just be removed from
> > both get_steps_using_prefix() and get_steps_using_prefix_recurse().
> > This requires some comment rewriting so I've attempted to do that too
> > in the attached updated version.
>
> Thanks, verified again and everything is fine!

Thanks for looking.  I spent quite a bit more time on this again today
and fiddled lots more with the comments and tests.

I also did more testing after finding a way to easily duplicate the
quals to cause multiple quals per partition key.  The equivalence
class code will only make ECs for mergejoin-able clauses, so if we
just find a type that's not mergejoin-able but hashable, we can
duplicate the quals with a hash partitioned table

-- find a suitable non-mergejoin-able type.
select oprleft::regtype from pg_operator where oprcanmerge=false and
oprcanhash=true;
 oprleft
-
 xid
 cid
 aclitem

create table hash_xid(a xid, b xid, c xid) partition by hash(a,b,c);
create table hash_xid1 partition of hash_xid for values with (modulus
2, remainder 0);
create table hash_xid2 partition of hash_xid for values with (modulus
2, remainder 1);

I tried out various combinations of the following query.  Each
equality clause is duplicated 6 times.  When I enable all 6 for each
of the 3 columns, I see 216 pruning steps.  That's 6*6*6, just what I
expected.

The IS NULL quals are not duplicated since we can only set a bit once
in the nullkeys.

explain select * from hash_xid where
a = '123'::xid and a = '123'::xid and a = '123'::xid and a =
'123'::xid and a = '123'::xid and a = '123'::xid and
--a is null and a is null and a is null and a is null and a is null
and a is null and
b = '123'::xid and b = '123'::xid and b = '123'::xid and b =
'123'::xid and b = '123'::xid and b = '123'::xid and
--b is null and b is null and b is null and b is null and b is null
and b is null and
c = '123'::xid and c = '123'::xid and c = '123'::xid and c =
'123'::xid and c = '123'::xid and c = '123'::xid;
--c is null and c is null and c is null and c is null and c is null
and c is null;

putting a breakpoint at the final line of
gen_prune_steps_from_opexps() yields 216 steps.

I didn't include anything of the above as part of the additional
tests. Perhaps something like that is worthwhile in a reduced form.
However, someone might make xid mergejoinable some time, which would
break the test.

Thanks for reviewing the previous version of this patch.

Onto the other run-time one now...

David




Re: New WAL record to detect the checkpoint redo location

2023-10-12 Thread Michael Paquier
On Tue, Oct 10, 2023 at 02:43:34PM -0400, Robert Haas wrote:
> - I combined what were previously 0002 and 0003 into a single patch,
> since that's how this would get committed.
> 
> - I fixed up some comments.
> 
> - I updated commit messages.
> 
> Hopefully this is getting close to good enough.

I have looked at 0001, for now..  And it looks OK to me.

+* Nonetheless, this case is simpler than the normal cases handled
+* above, which must check for changes in doPageWrites and RedoRecPtr.
+* Those checks are only needed for records that can contain
+* full-pages images, and an XLOG_SWITCH record never does.
+Assert(fpw_lsn == InvalidXLogRecPtr);

Right, that's the core reason behind the refactoring.  The assertion
is a good idea.
--
Michael


signature.asc
Description: PGP signature


Re: A new strategy for pull-up correlated ANY_SUBLINK

2023-10-12 Thread Andy Fan
Hi Alena,

On Thu, Oct 12, 2023 at 5:01 AM Alena Rybakina 
wrote:

> Hi!
>
> I reviewed your patch and it was interesting for me!
>
> Thank you for the explanation. It was really informative for me!
>
Thanks for your interest in this,  and I am glad to know it is informative.

> Unfortunately, I found a request when sublink did not pull-up, as in the
>
examples above. I couldn't quite figure out why.
>
I'm not sure what you mean with the "above", I guess it should be the
"below"?


> explain (analyze, costs off, buffers)
> select b.x, b.x, a.y
> from b
> left join a
> on b.x=a.x and
>
> *b.t in (select max(a0.t) *
>  from a a0
>  where a0.x = b.x and
>a0.t = b.t);
>
...

>SubPlan 2
>

Here the sublink can't be pulled up because of its reference to
the  LHS of left join, the original logic is that no matter the 'b.t in ..'
returns the true or false,  the rows in LHS will be returned.  If we
pull it up to LHS, some rows in LHS will be filtered out, which
breaks its original semantics.

I thought it would be:
>
> explain (analyze, costs off, buffers)
> select b.x, b.x, a.y
> from b
> left join a on
> b.x=a.x and
>
> *b.t = (select max(a0.t) *
>  from a a0
>  where a0.x = b.x and
>a0.t <= b.t);
>  QUERY
> PLAN
>
> -
>  Hash Right Join (actual time=1.181..67.927 rows=1000 loops=1)
>Hash Cond: (a.x = b.x)
>*Join Filter: (b.t = (SubPlan 2))*
>Buffers: shared hit=3546
>->  Seq Scan on a (actual time=0.022..17.109 rows=10 loops=1)
>  Buffers: shared hit=541
>->  Hash (actual time=1.065..1.068 rows=1000 loops=1)
>  Buckets: 4096  Batches: 1  Memory Usage: 72kB
>  Buffers: shared hit=5
>  ->  Seq Scan on b (actual time=0.049..0.401 rows=1000 loops=1)
>Buffers: shared hit=5
>SubPlan 2
>  ->  Result (actual time=0.025..0.025 rows=1 loops=1000)
>Buffers: shared hit=3000
>InitPlan 1 (returns $2)
>  ->  Limit (actual time=0.024..0.024 rows=1 loops=1000)
>Buffers: shared hit=3000
>->  Index Only Scan Backward using a_t_x_idx on a a0
> (actual time=0.023..0.023 rows=1 loops=1000)
>  Index Cond: ((t IS NOT NULL) AND (t <= b.t) AND
> (x = b.x))
>  Heap Fetches: 1000
>  Buffers: shared hit=3000
>  Planning Time: 0.689 ms
>  Execution Time: 68.220 ms
> (23 rows)
>
> If you noticed, it became possible after replacing the "in" operator with
> "=".
>
I didn't notice much difference between the 'in'  and '=',  maybe I
missed something?

> I took the liberty of adding this to your patch and added myself as
> reviewer, if you don't mind.
>
Sure, the patch after your modification looks better than the original.
I'm not sure how the test case around "because of got one row" is
relevant to the current changes.  After we reach to some agreement
on the above discussion, I think v4 is good for committer to review!

-- 
Best Regards
Andy Fan


Some performance degradation in REL_16 vs REL_15

2023-10-12 Thread Anton A. Melnikov

Greetengs!

Found that simple test pgbench -c20 -T20 -j8 gives approximately
for REL_15_STABLE at 5143f76:  336+-1 TPS
and
for REL_16_STABLE at 4ac7635f: 324+-1 TPS

The performance drop is approximately 3,5%  while the corrected standard 
deviation is only 0.3%.
See the raw_data.txt attached.

How do you think, is there any cause for concern here?

And is it worth spending time bisecting for the commit where this degradation 
may have occurred?

Would be glad for any comments and concerns.

With the best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres CompanyREL_15_STABLE at 5143f76, TPS
336.639765
334.376801
334.963121
336.23666
335.698673

REL_16_STABLE at 4ac7635f, TPS
324.373695
323.168622
323.728652
325.799901
324.81759



Re: Doc: Minor update for enable_partitionwise_aggregate

2023-10-12 Thread David Rowley
On Wed, 11 Oct 2023 at 19:38, David Rowley  wrote:
>
> On Wed, 11 Oct 2023 at 16:26, Andrew Atkinson  wrote:
> > "which allows grouping or aggregation on partitioned tables to be performed 
> > separately for each partition."
>
> This looks good to me. I can take care of this.

Pushed and backpatched to v11.

David




Re: Some performance degradation in REL_16 vs REL_15

2023-10-12 Thread David Rowley
On Thu, 12 Oct 2023 at 21:01, Anton A. Melnikov
 wrote:
>
> Greetengs!
>
> Found that simple test pgbench -c20 -T20 -j8 gives approximately
> for REL_15_STABLE at 5143f76:  336+-1 TPS
> and
> for REL_16_STABLE at 4ac7635f: 324+-1 TPS
>
> And is it worth spending time bisecting for the commit where this degradation 
> may have occurred?

It would be interesting to know what's to blame here and if you can
attribute it to a certain commit.

David




Re: Tab completion for AT TIME ZONE

2023-10-12 Thread Dagfinn Ilmari Mannsåker
Michael Paquier  writes:

> On Fri, Apr 14, 2023 at 12:05:25PM +0200, Jim Jones wrote:
>> The patch applies cleanly and it does what it is proposing. - and it's IMHO
>> a very nice addition.
>> 
>> I've marked the CF entry as "Ready for Committer".
>
> +/* ... AT TIME ZONE ... */
> + else if (TailMatches("AT"))
> + COMPLETE_WITH("TIME ZONE");
> + else if (TailMatches("AT", "TIME"))
> + COMPLETE_WITH("ZONE");
> + else if (TailMatches("AT", "TIME", "ZONE"))
> + COMPLETE_WITH_TIMEZONE_NAME();
>
> This style will for the completion of timezone values even if "AT" is
> the first word of a query.  Shouldn't this be more selective by making
> sure that we are at least in the context of a SELECT query?

It's valid anywhere an expression is, which is a lot more places than
just SELECT queries.  Off the top of my head I can think of WITH,
INSERT, UPDATE, VALUES, CALL, CREATE TABLE, CREATE INDEX.

As I mentioned upthread, the only place in the grammar where the word AT
occurs is in AT TIME ZONE, so there's no ambiguity.  Also, it doesn't
complete time zone names after AT, it completes the literal words TIME
ZONE, and you have to then hit tab again to get a list of time zones.
If we (or the SQL committee) were to invent more operators that start
with the word AT, we can add those to the first if clause above and
complete with the appropriate values after each one separately.

- ilmari




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-12 Thread Amit Kapila
On Wed, Oct 11, 2023 at 4:27 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thank you for reviewing! PSA new version.
>

Some more comments:
1. Let's restruture binary_upgrade_validate_wal_logical_end() a bit.
First, let's change its name to binary_upgrade_slot_has_pending_wal()
or something like that. Then move the context creation and free
related code into DecodingContextHasDecodedItems(). We can rename
DecodingContextHasDecodedItems() as
pg_logical_replication_slot_has_pending_wal() and place it in
slotfuncs.c. This will make the code structure similar to other slot
functions like pg_replication_slot_advance().

2. + * Returns true if there are no changes after the confirmed_flush_lsn.

How about something like: "Returns true if there are no decodable WAL
records after the confirmed_flush_lsn."?

3. Shouldn't we need to call CheckSlotPermissions() in
binary_upgrade_validate_wal_logical_end?

4.
+ /*
+ * Also, set processing_required flag if the message is not
+ * transactional. It is needed to notify the message's existence to
+ * the caller side. Usually, the flag is set when either the COMMIT or
+ * ABORT records are decoded, but this must be turned on here because
+ * the non-transactional logical message is decoded without waiting
+ * for these records.
+ */

The first sentence of the comments doesn't seem to be required as that
just says what the code does. So, let's slightly change it to: "We
need to set processing_required flag to notify the message's existence
to the caller side. Usually, the flag is set when either the COMMIT or
ABORT records are decoded, but this must be turned on here because the
non-transactional logical message is decoded without waiting for these
records."

-- 
With Regards,
Amit Kapila.




Special-case executor expression steps for common combinations

2023-10-12 Thread Daniel Gustafsson
The attached patch adds special-case expression steps for common sets of steps
in the executor to shave a few cycles off during execution, and make the JIT
generated code simpler.

* Adds EEOP_FUNCEXPR_STRICT_1 and EEOP_FUNCEXPR_STRICT_2 for function calls of
  strict functions with 1 or 2 arguments (EEOP_FUNCEXPR_STRICT remains used for
  > 2 arguments).
* Adds EEOP_AGG_STRICT_INPUT_CHECK_ARGS_1 which is a special case for the
  common case of one arg aggs.
* Replace EEOP_DONE with EEOP_DONE_RETURN and EEOP_DONE_NO_RETURN to be able to
  skip extra setup for steps which are only interested in the side effects.

Stressing the EEOP_FUNCEXPR_STRICT_* steps specifically shows a 1.5%
improvement and pgbench over the branch shows a ~1% improvement in TPS (both
measured over 6 runs with outliers removed).

EEOP_FUNCEXPR_STRICT_* (10M iterations):
master  : (7503.317, 7553.691, 7634.524) 
patched : (7422.756, 7455.120, 7492.393)

pgbench:
master  : (3653.83, 3792.97, 3863.70)
patched : (3743.04, 3830.02, 3869.80)

This patch was extracted from a larger body of work from Andres [0] aiming at
providing the necessary executor infrastructure for making JIT expression
caching possible.  This patch, and more which are to be submitted, is however
separate in the sense that it is not part of the infrastructure, it's an
improvements on its own.

Thoughts?

--
Daniel Gustafsson

[0]: https://postgr.es/m/20191023163849.sosqbfs5yenoc...@alap3.anarazel.de



v1-0001-Add-fast-path-expression-steps-for-common-combina.patch
Description: Binary data


Re: cataloguing NOT NULL constraints

2023-10-12 Thread Alexander Lakhin

Hi Alvaro,

25.08.2023 14:38, Alvaro Herrera wrote:

I have now pushed this again.  Hopefully it'll stick this time.


I've discovered that that commit added several recursive functions, and
some of them are not protected from stack overflow.

Namely, with "max_locks_per_transaction = 600" and default ulimit -s (8192),
I observe server crashes with the following scripts:
# ATExecSetNotNull()
(n=4; printf "create table t0 (a int, b int);";
for ((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 ))); 
"; done;
printf "alter table t0 alter b set not null;" ) | psql >psql.log

# dropconstraint_internal()
(n=2; printf "create table t0 (a int, b int not null);";
for ((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 ))); 
"; done;
printf "alter table t0 alter b drop not null;" ) | psql >psql.log

# set_attnotnull()
(n=11; printf "create table tp (a int, b int, primary key(a, b)) partition by range (a); create table tp0 (a int 
primary key, b int) partition by range (a);";
for ((i=1;i<=$n;i++)); do printf "create table tp$i partition of tp$(( $i - 1 )) for values from ($i) to (100) 
partition by range (a);"; done;
printf "alter table tp attach partition tp0 for values from (0) to (100);") | psql >psql.log # this takes half an 
hour on my machine


May be you would find appropriate to add check_stack_depth() to these
functions.

(ATAddCheckNNConstraint() is protected because it calls
AddRelationNewConstraints(), which in turn calls StoreRelCheck() ->
CreateConstraintEntry() ->  recordDependencyOnSingleRelExpr() ->
find_expr_references_walker() ->  expression_tree_walker() ->
expression_tree_walker() -> check_stack_depth().)

(There were patches prepared for similar cases [1], but they don't cover new
functions, of course, and I'm not sure how to handle all such instances.)

[1] https://commitfest.postgresql.org/45/4239/

Best regards,
Alexander




Re: Logging parallel worker draught

2023-10-12 Thread Benoit Lobréau

On 10/11/23 17:26, Imseih (AWS), Sami wrote:

Thank you for resurrecting this thread.


Well, if you read Benoit's earlier proposal at [1] you'll see that he
does propose to have some cumulative stats; this LOG line he proposes
here is not a substitute for stats, but rather a complement.  I don't
see any reason to reject this patch even if we do get stats.


I believe both cumulative statistics and logs are needed. Logs excel in 
pinpointing specific queries at precise times, while statistics provide 
a broader overview of the situation. Additionally, I often encounter 
situations where clients lack pg_stat_statements and can't restart their 
production promptly.



Regarding the current patch, the latest version removes the separate GUC,
but the user should be able to control this behavior.


I created this patch in response to Amit Kapila's proposal to keep the 
discussion ongoing. However, I still favor the initial version with the 
GUCs.



Query text is logged when  log_min_error_statement > default level of "error".

This could be especially problematic when there is a query running more than 1 
Parallel
Gather node that is in draught. In those cases each node will end up
generating a log with the statement text. So, a single query execution could 
end up
having multiple log lines with the statement text.
...
I wonder if it will be better to accumulate the total # of workers planned and 
# of workers launched and
logging this information at the end of execution?


log_temp_files exhibits similar behavior when a query involves multiple 
on-disk sorts. I'm uncertain whether this is something we should or need 
to address. I'll explore whether the error message can be made more 
informative.


[local]:5437 postgres@postgres=# SET work_mem to '125kB';
[local]:5437 postgres@postgres=# SET log_temp_files TO 0;
[local]:5437 postgres@postgres=# SET client_min_messages TO log;
[local]:5437 postgres@postgres=# WITH a AS ( SELECT x FROM 
generate_series(1,1) AS F(x) ORDER BY 1 ) , b AS (SELECT x FROM 
generate_series(1,1) AS F(x) ORDER BY 1 ) SELECT * FROM a,b;
LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp138850.20", size 
122880 => First sort

LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp138850.19", size 14
LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp138850.23", size 14
LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp138850.22", size 
122880 => Second sort

LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp138850.21", size 14

--
Benoit Lobréau
Consultant
http://dalibo.com




Re: Use virtual tuple slot for Unique node

2023-10-12 Thread Ashutosh Bapat
On Tue, Oct 10, 2023 at 2:23 PM David Rowley  wrote:
>
> On Wed, 27 Sept 2023 at 20:01, David Rowley  wrote:
> >
> > On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas  wrote:
> > > So not a win in this case. Could you peek at the outer slot type, and
> > > use the same kind of slot for the Unique's result? Or some more
> > > complicated logic, like use a virtual slot if all the values are
> > > pass-by-val? I'd also like to keep this simple, though...
> > >
> > > Would this kind of optimization make sense elsewhere?
> >
> > There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin().
> >
> > If I adjust the patch to:
> >
> > -   ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsMinimalTuple);
> > +   ExecInitResultTupleSlotTL(&uniquestate->ps,
> > +
> > ExecGetResultSlotOps(outerPlanState(uniquestate),
> > +
> > NULL));
>
> Just to keep this from going cold, here's that in patch form for
> anyone who wants to test.

Thanks.

I don't recollect why we chose MinimalTupleSlot here - may be because
we expected the underlying node to always produce a minimal tupe. But
Unique node copies the tuple returned by the underlying node. This
copy is carried out by the TupleTableSlot specific copy function
copyslot. Every implementation of this function first converts the
source slot tuple into the required form and then copies it. Having
both the TupleTableSlots, ouput slot from the underlying node and the
output slot of Unique node, of the same type avoids the first step and
just copies the slot. It makes sense that it performs better. The code
looks fine to me.

>
> I spent a bit more time running some more benchmarks and I don't see
> any workload where it slows things down.  I'd be happy if someone else
> had a go at finding a regression.

I built on your experiments and I might have found a minor regression.

Setup
=
drop table if exists t_int;
create table t_int(a int, b int);
insert into t_int select x, x from generate_series(1,100)x;
create index on t_int (a,b);
vacuum analyze t_int;

drop table if exists t_text;
create table t_text(a text, b text);
insert into t_text select lpad(x::text, 1000, '0'), x::text from
generate_series(1,100)x;
create index on t_text (a,b);
vacuum analyze t_text;

drop table if exists t_mixed; -- this one is new but it doesn't matter much
create table t_mixed(a text, b int);
insert into t_mixed select lpad(x::text, 1000, '0'), x from
generate_series(1,100)x;
create index on t_mixed (a,b);
vacuum analyze t_mixed;

Queries and measurements (average execution time from 3 runs - on my
Thinkpad T490)
==
Q1 select distinct a,b from t_int';
HEAD: 544.45 ms
patched: 381.55 ms

Q2 select distinct a,b from t_text
HEAD: 609.90 ms
patched: 513.42 ms

Q3 select distinct a,b from t_mixed
HEAD: 626.80 ms
patched: 468.22 ms

The more the pass by ref data, more memory is allocated which seems to
reduce the gain by this patch.
Above nodes use Buffer or HeapTupleTableSlot.
Try some different nodes which output minimal or virtual TTS.

set enable_hashagg to off;
Q4 select distinct a,b from (select sum(a) over (order by a rows 2
preceding) a, b from t_int) q
HEAD: 2529.58 ms
patched: 2332.23

Q5 select distinct a,b from (select sum(a) over (order by a rows 2
preceding) a, b from t_int order by a, b) q
HEAD: 2633.69 ms
patched: 2255.99 ms


Q6 select distinct a,b from (select string_agg(a, ', ') over (order by
a rows 2 preceding) a, b from t_text) q
HEAD: 108589.85 ms
patched: 107226.82 ms

Q7 select distinct a,b from (select string_agg(left(a, 100), ', ')
over (order by a rows 2 preceding) a, b from t_text) q
HEAD: 16070.62 ms
patched: 16182.16 ms

This one is surprising though. May be the advantage of using the same
tuple table slot is so narrow when large data needs to be copied that
the execution times almost match. The patched and unpatched execution
times differ by the margin of error either way.

-- 
Best Wishes,
Ashutosh Bapat




Re: Special-case executor expression steps for common combinations

2023-10-12 Thread Heikki Linnakangas

On 12/10/2023 12:48, Daniel Gustafsson wrote:

The attached patch adds special-case expression steps for common sets of steps
in the executor to shave a few cycles off during execution, and make the JIT
generated code simpler.

* Adds EEOP_FUNCEXPR_STRICT_1 and EEOP_FUNCEXPR_STRICT_2 for function calls of
   strict functions with 1 or 2 arguments (EEOP_FUNCEXPR_STRICT remains used for
   > 2 arguments).
* Adds EEOP_AGG_STRICT_INPUT_CHECK_ARGS_1 which is a special case for the
   common case of one arg aggs.


Are these relevant when JITting? I'm a little sad if the JIT compiler 
cannot unroll these on its own. Is there something we could do to hint 
it, so that it could treat the number of arguments as a constant?


I understand that this can give a small boost in interpreter mode, so 
maybe we should do it in any case. But I'd like to know if we're missing 
a trick with the JITter, before we mask it with this.



* Replace EEOP_DONE with EEOP_DONE_RETURN and EEOP_DONE_NO_RETURN to be able to
   skip extra setup for steps which are only interested in the side effects.


I'm a little surprised if this makes a measurable performance 
difference, but sure, why not. It seems nice to be more explicit when 
you don't expect a return value.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: [PATCH] Compression dictionaries for JSONB

2023-10-12 Thread Aleksander Alekseev
Hi hackers,

I would like to continue discussing compression dictionaries.

> So I summarized the requirements we agreed on so far and ended up with
> the following list: [...]

Again, here is the summary of our current agreements, at least how I
understand them. Please feel free to correct me where I'm wrong.

We are going to focus on supporting the:


SET COMPRESSION lz4 [WITH|WITHOUT] DICTIONARY
```

... syntax for now. From the UI perspective the rest of the agreements
didn't change compared to the previous summary.

In the [1] discussion (cc: Robert) we agreed to use va_tag != 18 for
the on-disk TOAST pointer representation to make TOAST pointers
extendable. If va_tag has a different value (currently it's always
18), the TOAST pointer is followed by an utf8-like varint bitmask.
This bitmask determines the rest of the content of the TOAST pointer
and its overall size. This will allow to extend TOAST pointers to
include dictionary_id and also to extend them in the future, e.g. to
support ZSTD and other compression algorithms, use 64-bit TOAST
pointers, etc.

Several things occured to me:

- Does anyone believe that va_tag should be part of the utf8-like
bitmask in order to save a byte or two?

- The described approach means that compression dictionaries are not
going to be used when data is compressed in-place (i.e. within a
tuple), since no TOAST pointer is involved in this case. Also we will
be unable to add additional compression algorithms here. Does anyone
have problems with this? Should we use the reserved compression
algorithm id instead as a marker of an extended TOAST?

- It would be nice to decompose the feature in several independent
patches, e.g. modify TOAST first, then add compression dictionaries
without automatic update of the dictionaries, then add the automatic
update. I find it difficult to imagine however how to modify TOAST
pointers and test the code properly without a dependency on a larger
feature. Could anyone think of a trivial test case for extendable
TOAST? Maybe something we could add to src/test/modules similarly to
how we test SLRU, background workers, etc.

[1]: 
https://www.postgresql.org/message-id/flat/CAN-LCVMq2X%3Dfhx7KLxfeDyb3P%2BBXuCkHC0g%3D9GF%2BJD4izfVa0Q%40mail.gmail.com

-- 
Best regards,
Aleksander Alekseev




Re: Server crash on RHEL 9/s390x platform against PG16

2023-10-12 Thread Suraj Kharage
Here is clang version:

[edb@9428da9d2137]$ clang --version

clang version 15.0.7 (Red Hat 15.0.7-2.el9)

Target: s390x-ibm-linux-gnu

Thread model: posix

InstalledDir: /usr/bin


Let me know if any further information is needed.

On Mon, Oct 9, 2023 at 8:21 AM Suraj Kharage 
wrote:

> It looks like an issue with JIT. If I disable the JIT then the above query
> runs successfully.
>
> postgres=# set jit to off;
>
> SET
>
> postgres=# SELECT * FROM rm32044_t1 LEFT JOIN rm32044_t2 ON
> rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN rm32044_t4 ON
> rm32044_t3.pkey = rm32044_t4.pkey order by rm32044_t1.pkey,label,hidden;
>
>  pkey | val  | pkey |  label  | hidden | pkey | val | pkey
>
> --+--+--+-++--+-+--
>
> 1 | row1 |1 | hidden  | t  |1 |   1 |
>
> 1 | row1 |1 | hidden  | t  |2 |   1 |
>
> 2 | row2 |2 | visible | f  |1 |   1 |
>
> 2 | row2 |2 | visible | f  |2 |   1 |
>
> (4 rows)
>
> Any idea on this?
>
> On Mon, Sep 18, 2023 at 11:20 AM Suraj Kharage <
> suraj.khar...@enterprisedb.com> wrote:
>
>> Few more details on this:
>>
>> (gdb) p val
>> $1 = 0
>> (gdb) p i
>> $2 = 3
>> (gdb) f 3
>> #3  0x01a1ef70 in ExecCopySlotMinimalTuple (slot=0x202e4f8) at
>> ../../../../src/include/executor/tuptable.h:472
>> 472 return slot->tts_ops->copy_minimal_tuple(slot);
>> (gdb) p *slot
>> $3 = {type = T_TupleTableSlot, tts_flags = 16, tts_nvalid = 8, tts_ops =
>> 0x1b6dcc8 , tts_tupleDescriptor = 0x202e0e8, tts_values =
>> 0x202e540, tts_isnull = 0x202e580, tts_mcxt = 0x1f54550, tts_tid =
>> {ip_blkid = {bi_hi = 65535,
>>   bi_lo = 65535}, ip_posid = 0}, tts_tableOid = 0}
>> (gdb) p *slot->tts_tupleDescriptor
>> $2 = {natts = 8, tdtypeid = 2249, tdtypmod = -1, tdrefcount = -1, constr
>> = 0x0, attrs = 0x202cd28}
>>
>> (gdb) p slot.tts_values[3]
>> $4 = 0
>> (gdb) p slot.tts_values[2]
>> $5 = 1
>> (gdb) p slot.tts_values[1]
>> $6 = 34027556
>>
>>
>> As per the resultslot, it has 0 value for the third attribute (column
>> lable).
>> Im testing this on the docker container and facing some issues with gdb
>> hence could not able to debug it further.
>>
>> Here is a explain plan:
>>
>> postgres=# explain (verbose, costs off) SELECT * FROM rm32044_t1 LEFT
>> JOIN rm32044_t2 ON rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN
>> rm32044_t4 ON rm32044_t3.pkey = rm32044_t4.pkey order by
>> rm32044_t1.pkey,label,hidden;
>>
>>  QUERY PLAN
>>
>>
>> -
>>  Incremental Sort
>>Output: rm32044_t1.pkey, rm32044_t1.val, rm32044_t2.pkey,
>> rm32044_t2.label, rm32044_t2.hidden, rm32044_t3.pkey, rm32044_t3.val,
>> rm32044_t4.pkey
>>Sort Key: rm32044_t1.pkey, rm32044_t2.label, rm32044_t2.hidden
>>Presorted Key: rm32044_t1.pkey
>>->  Merge Left Join
>>  Output: rm32044_t1.pkey, rm32044_t1.val, rm32044_t2.pkey,
>> rm32044_t2.label, rm32044_t2.hidden, rm32044_t3.pkey, rm32044_t3.val,
>> rm32044_t4.pkey
>>  Merge Cond: (rm32044_t1.pkey = rm32044_t2.pkey)
>>  ->  Sort
>>Output: rm32044_t3.pkey, rm32044_t3.val, rm32044_t4.pkey,
>> rm32044_t1.pkey, rm32044_t1.val
>>Sort Key: rm32044_t1.pkey
>>->  Nested Loop
>>  Output: rm32044_t3.pkey, rm32044_t3.val,
>> rm32044_t4.pkey, rm32044_t1.pkey, rm32044_t1.val
>>  ->  Merge Left Join
>>Output: rm32044_t3.pkey, rm32044_t3.val,
>> rm32044_t4.pkey
>>Merge Cond: (rm32044_t3.pkey = rm32044_t4.pkey)
>>->  Sort
>>  Output: rm32044_t3.pkey, rm32044_t3.val
>>  Sort Key: rm32044_t3.pkey
>>  ->  Seq Scan on public.rm32044_t3
>>Output: rm32044_t3.pkey,
>> rm32044_t3.val
>>->  Sort
>>  Output: rm32044_t4.pkey
>>  Sort Key: rm32044_t4.pkey
>>  ->  Seq Scan on public.rm32044_t4
>>Output: rm32044_t4.pkey
>>  ->  Materialize
>>Output: rm32044_t1.pkey, rm32044_t1.val
>>->  Seq Scan on public.rm32044_t1
>>  Output: rm32044_t1.pkey, rm32044_t1.val
>>  ->  Sort
>>Output: rm32044_t2.pkey, rm32044_t2.label,
>> rm32044_t2.hidden
>>Sort Key: rm32044_t2.pkey
>>->  Seq Scan on public.rm32044_t2
>>  Output: rm32044_t2.pkey, rm32044_t2.label,
>> rm32044_t2.hidden
>> (34 rows)
>>
>>
>> It seems like while building the innerslot for merge join, the value for
>> attnum 1 is no

Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)

2023-10-12 Thread Alexander Lakhin

Hello hackers,

While investigating the recent skink failure [1], I've reproduced this
failure under Valgrind on a slow machine and found that this happens due to
the last checkpoint recorded in the segment 2, that is removed in the test:
The failure log contains:
2023-10-10 19:10:08.212 UTC [2144251][startup][:0] LOG:  invalid checkpoint 
record
2023-10-10 19:10:08.214 UTC [2144251][startup][:0] PANIC:  could not locate a 
valid checkpoint record

The line above:
[19:10:02.701](318.076s) ok 1 - 00010001 differs from 
00010002
tells us about the duration of previous operations (> 5 mins).

src/test/recovery/tmp_check/log/026_overwrite_contrecord_primary.log:
2023-10-10 19:04:50.149 UTC [1845798][postmaster][:0] LOG:  database system is 
ready to accept connections
...
2023-10-10 19:09:49.131 UTC [1847585][checkpointer][:0] LOG: checkpoint 
starting: time
...
2023-10-10 19:10:02.058 UTC [1847585][checkpointer][:0] LOG: checkpoint 
complete: ... lsn=0/*2093980*, redo lsn=0/1F62760

And here is one more instance of this failure [2]:
2022-11-08 02:35:25.826 UTC [1614205][][:0] PANIC:  could not locate a valid 
checkpoint record
2022-11-08 02:35:26.164 UTC [1612967][][:0] LOG:  startup process (PID 1614205) 
was terminated by signal 6: Aborted

src/test/recovery/tmp_check/log/026_overwrite_contrecord_primary.log:
2022-11-08 02:29:57.961 UTC [1546469][][:0] LOG:  database system is ready to 
accept connections
...
2022-11-08 02:35:10.764 UTC [1611737][][2/10:0] LOG:  statement: SELECT 
pg_walfile_name(pg_current_wal_insert_lsn())
2022-11-08 02:35:11.598 UTC [1546469][][:0] LOG:  received immediate shutdown 
request

The next successful run after the failure [1] shows the following duration:
[21:34:48.556](180.150s) ok 1 - 00010001 differs from 
00010002
And the last successful run:
[03:03:53.892](126.206s) ok 1 - 00010001 differs from 
00010002

So to fail on the test, skink should perform at least twice slower than
usual, and may be it's an extraordinary condition indeed, but on the other
hand, may be increase checkpoint_timeout as already done in several tests
(015_promotion_pages, 038_save_logical_slots_shutdown, 039_end_of_wal, ...).

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-10%2017%3A10%3A11
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-11-07%2020%3A27%3A11

Best regards,
Alexander

pg_upgrade's interaction with pg_resetwal seems confusing

2023-10-12 Thread Amit Kapila
In pg_upgrade, we reset WAL archives (remove WAL), transaction id,
etc. in copy_xact_xlog_xid() for the new cluster. Then, we create new
objects in the new cluster, and again towards the end of the upgrade
we invoke pg_resetwal with the -o option to reset the next OID. Now,
along with resetting the OID, pg_resetwal will again reset the WAL. I
am not sure if that is intentional and it may not have any problem
today except that it seems redundant to reset WAL again.

However, this can be problematic for the ongoing work to upgrade the
logical replication slots [1]. We want to create/migrate logical slots
in the new cluster before calling the final pg_resetwal (which resets
the next OID) to ensure that there is no new WAL inserted by
background processes or otherwise between resetwal location and
creation of slots. So, we thought that we would compute the next WAL
location by doing the computation similar to what pg_resetwal does to
reset a new WAL location, create slots using that location, and pass
the same to pg_resetwal using the -l option. However, that doesn't
work because pg_resetwal uses the passed -l option only as a hint but
can reset the later WAL if present which can remove the WAL position
we have decided as restart_lsn (point to start reading WAL) for slots.
So, we came up with another idea that we will reset the WAL just
before creating slots and use that location to create slots and then
invent a new option in pg_resetwal where it won't reset the WAL.

Now, as mentioned in the first paragraph, it seems we anyway don't
need to reset the WAL at the end when setting the next OID for the new
cluster with the -o option. If that is true, then I think even without
slots work it will be helpful to have such an option in pg_resetwal.

Thoughts?

[1] - https://commitfest.postgresql.org/45/4273/

-- 
With Regards,
Amit Kapila.




Re: Removing unneeded self joins

2023-10-12 Thread Alexander Korotkov
On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov
 wrote:
> On 4/10/2023 14:34, Alexander Korotkov wrote:
> >  > Relid replacement machinery is the most contradictory code here. We used
> >  > a utilitarian approach and implemented a simplistic variant.
> >
> >  > > 2) It would be nice to skip the insertion of IS NOT NULL checks when
> >  > > they are not necessary.  [1] points that infrastructure from [2] might
> >  > > be useful.  The patchset from [2] seems committed mow.  However, I
> >  > > can't see it is directly helpful in this matter.  Could we just skip
> >  > > adding IS NOT NULL clause for the columns, that have
> >  > > pg_attribute.attnotnull set?
> >  > Thanks for the links, I will look into that case.
> To be more precise, in the attachment, you can find a diff to the main
> patch, which shows the volume of changes to achieve the desired behaviour.
> Some explains in regression tests shifted. So, I've made additional tests:
>
> DROP TABLE test CASCADE;
> CREATE TABLE test (a int, b int not null);
> CREATE UNIQUE INDEX abc ON test(b);
> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
> WHERE t1.b=t2.b;
> CREATE UNIQUE INDEX abc1 ON test(a,b);
> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
> WHERE t1.b=t2.b;
> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
> WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a);
> DROP INDEX abc1;
> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
> WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b);
>
> We have almost the results we wanted to have. But in the last explain
> you can see that nothing happened with the OR clause. We should use the
> expression mutator instead of walker to handle such clauses. But It
> doesn't process the RestrictInfo node ... I'm inclined to put a solution
> of this issue off for a while.

OK.  I think it doesn't worth to eliminate IS NULL quals with this
complexity (at least at this stage of work).

I made improvements over the code.  Mostly new comments, grammar
corrections of existing comments and small refactoring.

Also, I found that the  suggestion from David Rowley [1] to qsort
array of relations to faster find duplicates is still unaddressed.
I've implemented it.  That helps to evade quadratic complexity with
large number of relations.

Also I've incorporated improvements from Alena Rybakina except one for
skipping SJ removal when no SJ quals is found.  It's not yet clear for
me if this check fix some cases. But at least optimization got skipped
in some useful cases (as you can see in regression tests).

Links
1. 
https://www.postgresql.org/message-id/CAKJS1f8ySSsBfooH3bJK7OD3LBEbDb99d8J_FtqDd6w50p-eAQ%40mail.gmail.com
2. 
https://www.postgresql.org/message-id/96f66ae3-df10-4060-9844-4c9633062cd3%40yandex.ru

--
Regards,
Alexander Korotkov


0001-Remove-useless-self-joins-v44.patch
Description: Binary data


RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-12 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thanks for your suggestion! PSA new version.

> The other problem is that pg_resetwal removes all pre-existing WAL
> files which in this case could lead to the removal of the WAL file
> corresponding to restart_lsn. This is because at least the shutdown
> checkpoint record will be written after the creation of slots which
> could be in the new file used for restart_lsn. Then when we invoke
> pg_resetwal, it can remove that file.
> 
> One idea to deal with this could be to do the reset WAL stuff
> (FindEndOfXLOG(), KillExistingXLOG(), KillExistingArchiveStatus(),
> WriteEmptyXLOG()) in a separate function (say in pg_upgrade) and then
> create slots. If we do this, then we additionally need an option in
> pg_resetwal which skips resetting the WAL as that would have been done
> before creating the slots.

Based on above idea, I made new version patch which some functionalities were
exported from pg_resetwal. In this approach, pg_upgrade itself removed WALs and
then create logical slots, then pg_resetwal would be called with new option
--no-switch, which avoid to switch a WAL segment file. The option is only used
for the upgrading purpose so it is not written in doc and usage(). This option
is not required if pg_resetwal -o does not discard WAL records. Please see the
fork thread [1].

We do not have to reserve future restart_lsn while creating a slot, so the 
binary
function binary_upgrade_create_logical_replication_slot() was removed.

Another advantage of this approach is to avoid calling pg_log_standby_snapshot()
after the pg_resetwal. This was needed because of two reasons, but they were
resolved automatically.
  1) pg_resetwal removes all WAL files.
  2) Logical slots requires a RUNNING_XACTS record for building a snapshot.
 
[1]: 
https://www.postgresql.org/message-id/CAA4eK1KRyPMiY4fW98qFofsYrPd87Oc83zDNxSeHfTYh_asdBg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


v49-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch
Description:  v49-0001-pg_upgrade-Allow-to-replicate-logical-replicatio.patch


RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-12 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thanks for reviewing! New patch is available at [1].

> 
> Some more comments:
> 1. Let's restruture binary_upgrade_validate_wal_logical_end() a bit.
> First, let's change its name to binary_upgrade_slot_has_pending_wal()
> or something like that. Then move the context creation and free
> related code into DecodingContextHasDecodedItems(). We can rename
> DecodingContextHasDecodedItems() as
> pg_logical_replication_slot_has_pending_wal() and place it in
> slotfuncs.c. This will make the code structure similar to other slot
> functions like pg_replication_slot_advance().

Seems clearer than mine. Fixed.

> 2. + * Returns true if there are no changes after the confirmed_flush_lsn.
> 
> How about something like: "Returns true if there are no decodable WAL
> records after the confirmed_flush_lsn."?

Fixed.

> 3. Shouldn't we need to call CheckSlotPermissions() in
> binary_upgrade_validate_wal_logical_end?

Added, but actually it is not needed. This is because only superusers can 
connect
to the server while upgrading. Please see below codes in InitPostgres().

```
if (IsBinaryUpgrade && !am_superuser)
{
ereport(FATAL,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 errmsg("must be superuser to connect in binary 
upgrade mode")));
}
```

> 4.
> + /*
> + * Also, set processing_required flag if the message is not
> + * transactional. It is needed to notify the message's existence to
> + * the caller side. Usually, the flag is set when either the COMMIT or
> + * ABORT records are decoded, but this must be turned on here because
> + * the non-transactional logical message is decoded without waiting
> + * for these records.
> + */
> 
> The first sentence of the comments doesn't seem to be required as that
> just says what the code does. So, let's slightly change it to: "We
> need to set processing_required flag to notify the message's existence
> to the caller side. Usually, the flag is set when either the COMMIT or
> ABORT records are decoded, but this must be turned on here because the
> non-transactional logical message is decoded without waiting for these
> records."

Fixed.

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB5866B0614F80CE9F5EF051BDF5D3A%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Special-case executor expression steps for common combinations

2023-10-12 Thread David Rowley
On Thu, 12 Oct 2023 at 22:54, Daniel Gustafsson  wrote:
> EEOP_FUNCEXPR_STRICT_* (10M iterations):
> master  : (7503.317, 7553.691, 7634.524)
> patched : (7422.756, 7455.120, 7492.393)
>
> pgbench:
> master  : (3653.83, 3792.97, 3863.70)
> patched : (3743.04, 3830.02, 3869.80)
>
> Thoughts?

Did any of these tests compile the expression with JIT?

If not, how does the performance compare for a query that JITs the expression?

David




Re: Problem, partition pruning for prepared statement with IS NULL clause.

2023-10-12 Thread David Rowley
On Mon, 9 Oct 2023 at 12:26, David Rowley  wrote:
>
> On Sat, 7 Oct 2023 at 03:11, Sergei Glukhov  wrote:
> > I noticed that combination of prepared statement with generic plan and
> > 'IS NULL' clause could lead partition pruning to crash.
>
> > Test case:
> > --
> > set plan_cache_mode to force_generic_plan;
> > prepare stmt AS select * from hp where a is null and b = $1;
> > explain execute stmt('xxx');
>
> Thanks for the detailed report and proposed patch.
>
> I think your proposed fix isn't quite correct.  I think the problem
> lies in InitPartitionPruneContext() where we assume that the list
> positions of step->exprs are in sync with the keyno.  If you look at
> perform_pruning_base_step() the code there makes a special effort to
> skip over any keyno when a bit is set in opstep->nullkeys.

I've now also pushed the fix for the incorrect logic for nullkeys in
ExecInitPruningContext().

I didn't quite find a test to make this work for v11. I tried calling
execute 5 times as we used to have to before the plan_cache_mode GUC
was added in v12, but the test case kept picking the custom plan. So I
ended up pushing v11 without any test.  This goes out of support in ~1
month, so I'm not too concerned about the lack of test.  I did do a
manual test to ensure it works with:

create table hp (a int, b text, c int) partition by hash (a, b);
create table hp0 partition of hp for values with (modulus 4, remainder 0);
create table hp3 partition of hp for values with (modulus 4, remainder 3);
create table hp1 partition of hp for values with (modulus 4, remainder 1);
create table hp2 partition of hp for values with (modulus 4, remainder 2);

prepare hp_q1 (text) as select * from hp where a is null and b = $1;

(set breakpoint in choose_custom_plan() and have it return false when
we hit it.)

explain (costs off) execute hp_q1('xxx');

David




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-10-12 Thread Dilip Kumar
On Wed, Oct 11, 2023 at 5:57 PM Dilip Kumar  wrote:
>
> On Wed, Oct 11, 2023 at 4:34 PM Dilip Kumar  wrote:

> In my last email, I forgot to give the link from where I have taken
> the base path for dividing the buffer pool in banks so giving the same
> here[1].  And looking at this again it seems that the idea of that
> patch was from
> Andrey M. Borodin and the idea of the SLRU scale factor were
> introduced by Yura Sokolov and Ivan Lazarev.  Apologies for missing
> that in the first email.
>
> [1] https://commitfest.postgresql.org/43/2627/

In my last email I have just rebased the base patch, so now while
reading through that patch I realized that there was some refactoring
needed and some unused functions were there so I have removed that and
also added some comments.  Also did some refactoring to my patches. So
reposting the patch series.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v2-0002-bank-wise-slru-locks.patch
Description: Binary data


v2-0003-Introduce-bank-wise-LRU-counter.patch
Description: Binary data


v2-0001-Divide-SLRU-buffers-into-banks.patch
Description: Binary data


Re: RFC: Logging plan of the running query

2023-10-12 Thread torikoshia

On 2023-10-11 16:22, Ashutosh Bapat wrote:

Like many others I think this feature is useful to debug a long running 
query.


Sorry for jumping late into this.

I have a few of high level comments


Thanks for your comments!


There is a lot of similarity between what this feature does and what
auto explain does. I see the code is also duplicated. There is some
merit in avoiding this duplication
1. we will get all the features of auto_explain automatically like
choosing a format (this was expressed somebody earlier in this
thread), setings etc.
2. avoid bugs. E.g your code switches context after ExplainState has
been allocated. These states may leak depending upon when this
function gets called.
3. Building features on top as James envisions will be easier.

Considering the similarity with auto_explain I wondered whether this
function should be part of auto_explain contrib module itself? If we
do that users will need to load auto_explain extension and thus
install executor hooks when this function doesn't need those. So may
not be such a good idea. I didn't see any discussion on this.


I once thought about adding this to auto_explain, but I left it asis for 
below reasons:


- One of the typical use case of pg_log_query_plan() would be analyzing 
slow query on customer environments. On such environments, We cannot 
always control what extensions to install.
  Of course auto_explain is a major extension and it is quite possible 
that they installed auto_explain, but but it is also possible they do 
not.
- It seems a bit counter-intuitive that pg_log_query_plan() is in an 
extension called auto_explain, since it `manually`` logs plans


I tried following query to pass PID of a non-client backend to this 
function.

#select pg_log_query_plan(pid), application_name, backend_type from
pg_stat_activity where backend_type = 'autovacuum launcher';
 pg_log_query_plan | application_name |backend_type
---+--+-
 t |  | autovacuum launcher
(1 row)
I see "LOG:  backend with PID 2733631 is not running a query or a
subtransaction is aborted" in server logs. That's ok. But may be we
should not send signal to these kinds of backends at all, thus
avoiding some system calls.


Agreed, it seems better.
Attached patch checks if the backendType of target process is 'client 
backend'.


  =# select pg_log_query_plan(pid), application_name, backend_type from  
pg_stat_activity where backend_type = 'autovacuum launcher';

  WARNING:  PID 63323 is not a PostgreSQL client backend process
   pg_log_query_plan | application_name |backend_type
  ---+--+-
   f |  | autovacuum launcher



I am also wondering whether it's better to report the WARNING as
status column in the output. E.g. instead of
#select pg_log_query_plan(100);
WARNING:  PID 100 is not a PostgreSQL backend process
 pg_log_query_plan
---
 f
(1 row)
we output
#select pg_log_query_plan(100);
 pg_log_query_plan |   status
---+-
 f | PID 100 is not a PostgreSQL backend process
(1 row)

That looks neater and can easily be handled by scripts, applications
and such. But it will be inconsistent with other functions like
pg_terminate_backend() and pg_log_backend_memory_contexts().


It seems neater, but it might be inconvenient because we can no longer 
use it  in select list like the following query as you wrote:


  #select pg_log_query_plan(pid), application_name, backend_type from
  pg_stat_activity where backend_type = 'autovacuum launcher';


I do share a concern that was discussed earlier. If a query is running
longer, there's something problematic with it. A diagnostic
intervention breaking it further would be unwelcome. James has run
experiments to shake this code for any loose breakages. He has not
found any. So may be we are good. And we wouldn't know about very rare
corner cases so easily without using it in the field. So fine with it.
If we could add some safety net that will be great but may not be
necessary for the first cut.


If there are candidates for the safety net, I'm willing to add them.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom b7902cf43254450cc7831c235982438ea1e5e8b7 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Thu, 12 Oct 2023 22:03:48 +0900
Subject: [PATCH v31] Add function to log the plan of the query

Currently, we have to wait for the query execution to finish
to check its plan. This is not so convenient when
investigating long-running queries on production environments
where we cannot use debuggers.
To improve this situation, this patch adds
pg_log_query_plan() function that requests to log the
plan of the specified backend process.

By default, only superusers are allowed to request to log the
plans because allowing any users to issue 

PostgreSQL domains and NOT NULL constraint

2023-10-12 Thread Erki Eessaar
Hello

PostgreSQL's CREATE DOMAIN documentation (section Notes) describes a way how 
one can add NULL's to a column that has a domain with the NOT NULL constraint.
https://www.postgresql.org/docs/current/sql-createdomain.html

To me it seems very strange and amounts to a bug because it defeats the purpose 
of domains (to be a reusable assets) and constraints (to avoid any bypassing of 
these).

Oracle 23c added the support of domains 
(https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/create-domain.html).
 I tested the same scenario both in PostgreSQL and Oracle 
(https://www.oracle.com/database/free/) and found out that in these situations 
Oracle does not allow NULL's to be added to the column. I do not know as to 
whether the behaviour that is implemented in PostgreSQL is specified by the 
standard. However, if it is not the case, then how it could be that Oracle can 
but PostgreSQL cannot.

Best regards
Erki Eessaar

The scenario that I tested both in PostgreSQL (16) and Oracle (23c).
***
/*PostgreSQL 16*/

CREATE DOMAIN d_name VARCHAR(50) NOT NULL;

CREATE TABLE Product_state_type (product_state_type_code SMALLINT NOT NULL,
name d_name,
CONSTRAINT pk_product_state_type PRIMARY KEY (product_state_type_code),
CONSTRAINT ak_product_state_type_name UNIQUE (name));

CREATE TABLE Product (product_code INTEGER NOT NULL,
name d_name,
product_state_type_code SMALLINT NOT NULL,
CONSTRAINT pk_product PRIMARY KEY (product_code),
CONSTRAINT fk_product_product_state_type FOREIGN KEY (product_state_type_code)
REFERENCES Product_state_type(product_state_type_code) ON UPDATE CASCADE);

INSERT INTO Product_state_type (product_state_type_code, name)
VALUES (1, (SELECT name FROM Product_state_type WHERE FALSE));
/*Insertion succeeds, name is NULL!*/

INSERT INTO Product (product_code, name, product_state_type_code)
SELECT 1 AS product_code, Product.name, 1 AS product_state_type_code
FROM Product_state_type LEFT JOIN Product USING (product_state_type_code);
/*Insertion succeeds, name is NULL!*/

/*Oracle 23c*/

CREATE DOMAIN d_name AS VARCHAR2(50) NOT NULL;

CREATE TABLE Product_state_type (product_state_type_code NUMBER(4) NOT NULL,
name d_name,
CONSTRAINT pk_product_state_type PRIMARY KEY (product_state_type_code),
CONSTRAINT ak_product_state_type_name UNIQUE (name));

CREATE TABLE Product (product_code NUMBER(8) NOT NULL,
name d_name,
product_state_type_code NUMBER(4) NOT NULL,
CONSTRAINT pk_product PRIMARY KEY (product_code),
CONSTRAINT fk_product_product_state_type FOREIGN KEY (product_state_type_code)
REFERENCES Product_state_type(product_state_type_code));


INSERT INTO Product_state_type (product_state_type_code, name)
VALUES (1, (SELECT name FROM Product_state_type WHERE FALSE));
/*Fails.
Error report -
SQL Error: ORA-01400: cannot insert NULL into
("SYSTEM"."PRODUCT_STATE_TYPE"."NAME")
Help: https://docs.oracle.com/error-help/db/ora-01400/
01400. 0 -  "cannot insert NULL into (%s)"
*Cause:An attempt was made to insert NULL into previously listed objects.
*Action:   These objects cannot accept NULL values.*/

INSERT INTO Product_state_type (product_state_type_code, name)
VALUES (1, 'Active');

INSERT INTO Product (product_code, name, product_state_type_code)
SELECT 1 AS product_code, Product.name, 1 AS product_state_type_code
FROM Product_state_type LEFT JOIN Product USING (product_state_type_code);
/*Fails.
SQL Error: ORA-01400: cannot insert NULL into
("SYSTEM"."PRODUCT"."NAME")
Help: https://docs.oracle.com/error-help/db/ora-01400/
01400. 0 -  "cannot insert NULL into (%s)"
*Cause:An attempt was made to insert NULL into previously listed objects.
*Action:   These objects cannot accept NULL values.*/



Re: Lowering the default wal_blocksize to 4K

2023-10-12 Thread Robert Haas
On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro  wrote:
> That leaves only the segments where a record starts exactly on the
> first usable byte of a segment, which is why I was trying to think of
> a way to cover that case too.  I suggested we could notice and insert
> a new record at that place.  But Andres suggests it would be too
> expensive and not worth worrying about.

Hmm. Even in that case, xl_prev has to match. It's not like it's the
wild west. Sure, it's not nearly as good of a cross-check, but it's
something. It seems to me that it's not worth worrying very much about
xlp_seg_size or xlp_blcksz changing undetected in that scenario - if
you're doing that kind of advanced magic, you need to be careful
enough to not mess it up, and if we still cross-check once per
checkpoint cycle that's pretty good. I do worry a bit about the sysid
changing under us, though. It's not that hard to get your WAL archives
mixed up, and it'd be nice to catch that right away.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [RFC] Add jit deform_counter

2023-10-12 Thread Nazir Bilal Yavuz
Hi,

On Fri, 8 Sept 2023 at 20:22, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Fri, Sep 08, 2023 at 03:34:42PM +0200, Daniel Gustafsson wrote:
> > > On 5 Sep 2023, at 16:37, Daniel Gustafsson  wrote:
> >
> > > I've gone over this version of the patch and I think it's ready to go in. 
> > >  I'm
> > > marking this Ready for Committer and will go ahead with it shortly 
> > > barring any
> > > objections.
> >
> > Pushed, after another round of review with some minor fixes.

I realized that pg_stat_statements is bumped to 1.11 with this patch
but oldextversions test is not updated. So, I attached a patch for
updating oldextversions.

Regards,
Nazir Bilal Yavuz
Microsoft
From d3c63a5d68ed76257d110db6377bd3ec859d65a4 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Thu, 12 Oct 2023 14:44:38 +0300
Subject: [PATCH v1] Update oldextversions test for pg_stat_statements 1.11

pg_stat_statements 1.11 is introduced at 5a3423ad8e but oldextversions
test is not updated. Add missing pg_stat_statements 1.11 test case to
oldextversions test.
---
 .../expected/oldextversions.out   | 58 +++
 .../pg_stat_statements/sql/oldextversions.sql |  5 ++
 2 files changed, 63 insertions(+)

diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index efb2049ecff..64982aad601 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -250,4 +250,62 @@ SELECT count(*) > 0 AS has_data FROM pg_stat_statements;
  t
 (1 row)
 
+-- New views for pg_stat_statements in 1.11
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.11';
+\d pg_stat_statements
+  View "public.pg_stat_statements"
+ Column |   Type   | Collation | Nullable | Default 
++--+---+--+-
+ userid | oid  |   |  | 
+ dbid   | oid  |   |  | 
+ toplevel   | boolean  |   |  | 
+ queryid| bigint   |   |  | 
+ query  | text |   |  | 
+ plans  | bigint   |   |  | 
+ total_plan_time| double precision |   |  | 
+ min_plan_time  | double precision |   |  | 
+ max_plan_time  | double precision |   |  | 
+ mean_plan_time | double precision |   |  | 
+ stddev_plan_time   | double precision |   |  | 
+ calls  | bigint   |   |  | 
+ total_exec_time| double precision |   |  | 
+ min_exec_time  | double precision |   |  | 
+ max_exec_time  | double precision |   |  | 
+ mean_exec_time | double precision |   |  | 
+ stddev_exec_time   | double precision |   |  | 
+ rows   | bigint   |   |  | 
+ shared_blks_hit| bigint   |   |  | 
+ shared_blks_read   | bigint   |   |  | 
+ shared_blks_dirtied| bigint   |   |  | 
+ shared_blks_written| bigint   |   |  | 
+ local_blks_hit | bigint   |   |  | 
+ local_blks_read| bigint   |   |  | 
+ local_blks_dirtied | bigint   |   |  | 
+ local_blks_written | bigint   |   |  | 
+ temp_blks_read | bigint   |   |  | 
+ temp_blks_written  | bigint   |   |  | 
+ blk_read_time  | double precision |   |  | 
+ blk_write_time | double precision |   |  | 
+ temp_blk_read_time | double precision |   |  | 
+ temp_blk_write_time| double precision |   |  | 
+ wal_records| bigint   |   |  | 
+ wal_fpi| bigint   |   |  | 
+ wal_bytes  | numeric  |   |  | 
+ jit_functions  | bigint   |   |  | 
+ jit_generation_time| double precision |   |  | 
+ jit_inlining_count | bigint   |   |  | 
+ jit_inlining_time  | double precision |   |  | 
+ jit_optimization_count | bigint   |   |  | 
+ jit_optimization_time  | double precision |   |  | 
+ jit_emission_count | bigint   |   |  | 
+ jit_emission_time  | double precision |   |  | 
+ jit_deform_count   | bigint   |

Re: [RFC] Add jit deform_counter

2023-10-12 Thread Daniel Gustafsson
> On 12 Oct 2023, at 15:37, Nazir Bilal Yavuz  wrote:
> 
> Hi,
> 
> On Fri, 8 Sept 2023 at 20:22, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>> 
>>> On Fri, Sep 08, 2023 at 03:34:42PM +0200, Daniel Gustafsson wrote:
 On 5 Sep 2023, at 16:37, Daniel Gustafsson  wrote:
>>> 
 I've gone over this version of the patch and I think it's ready to go in.  
 I'm
 marking this Ready for Committer and will go ahead with it shortly barring 
 any
 objections.
>>> 
>>> Pushed, after another round of review with some minor fixes.
> 
> I realized that pg_stat_statements is bumped to 1.11 with this patch
> but oldextversions test is not updated. So, I attached a patch for
> updating oldextversions.

Thanks for the patch, that was an oversight in the original commit for this.
From a quick look it seems correct, I'll have another look later today and will
then apply it.

--
Daniel Gustafsson





Re: Lowering the default wal_blocksize to 4K

2023-10-12 Thread Robert Haas
On Wed, Oct 11, 2023 at 6:11 PM Andres Freund  wrote:
> I think the question is what the point of the crosschecks in long page headers
> is. It's pretty easy to see what the point of the xlp_sysid check is - make it
> less likely to accidentally replay WAL from a different system.  It's much
> less clear what the point of xlp_seg_size and xlp_xlog_blcksz is - after all,
> they are also in ControlFileData and the xlp_sysid check tied the control file
> and WAL file together.

Yeah, fair.

> Let me rephrase my point:
>
> If somebody uses a modified pg_resetwal to change the xlog block size, then
> tries to replay WAL from before that change, and is unlucky enough that the
> LSN looked for in a segment is the start of a valid record both before/after
> the pg_resetwal invocation, then yes, we might not catch that anymore if we
> remove the block size check. But the much much more common case is that the
> block size was *not* changed, in which case we *already* don't catch that
> pg_resetwal was invoked.

Hmm. Should we invent a mechanism just for that?

> ISTM that the xlp_seg_size and xlp_xlog_blcksz checks in long page headers are
> a belt and suspenders check that is very unlikely to ever catch a mistake that
> wouldn't otherwise be caught.

I think that's probably right.

> I think that's what Thomas was proposing.  Thinking about it a bit more I'm
> not sure that having the data both in the checkpoint record itself and in
> XLOG_CHECKPOINT_REDO buys much. But it's also pretty much free, so ...

Yes. To me, having it in the redo record seems considerably more
valuable. Because that's where we're going to begin replay, so we
should catch most problems straight off. To escape detection at that
point, you need to not just be pointed at the wrong WAL archive, but
actually have files of diverse origin mixed together in the same WAL
archive. That's a less-likely error, and we still have some ways of
catching it if it happens.

> Which would make the code more efficient...

Right.

> As outlined above, I don't think xlp_seg_size, xlp_xlog_blcksz buy us
> anything, but that the protection by xlp_sysid is a bit more meaningful. So a
> compromise position could be to include xlp_sysid in the page header, possibly
> in a "chopped up" manner, as Matthias suggested.

I'm not that keen on the idea of storing the upper half and lower half
in alternate pages. That seems to me to add code complexity and
cognitive burden with little increased likelihood of catching real
problems. I'm not completely opposed to the idea if somebody wants to
make it happen, but I bet it would be better to either store the whole
thing or just cut it in half and store, say, the low-order bits.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Problem, partition pruning for prepared statement with IS NULL clause.

2023-10-12 Thread Sergei Glukhov




On 10/12/23 16:27, David Rowley wrote:


I've now also pushed the fix for the incorrect logic for nullkeys in
ExecInitPruningContext().



Thanks!

Regards,
Gluh





Re: Lowering the default wal_blocksize to 4K

2023-10-12 Thread Ants Aasma
On Thu, 12 Oct 2023 at 16:36, Robert Haas  wrote:

> On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro 
> wrote:
> > That leaves only the segments where a record starts exactly on the
> > first usable byte of a segment, which is why I was trying to think of
> > a way to cover that case too.  I suggested we could notice and insert
> > a new record at that place.  But Andres suggests it would be too
> > expensive and not worth worrying about.
>
> Hmm. Even in that case, xl_prev has to match. It's not like it's the
> wild west. Sure, it's not nearly as good of a cross-check, but it's
> something. It seems to me that it's not worth worrying very much about
> xlp_seg_size or xlp_blcksz changing undetected in that scenario - if
> you're doing that kind of advanced magic, you need to be careful
> enough to not mess it up, and if we still cross-check once per
> checkpoint cycle that's pretty good. I do worry a bit about the sysid
> changing under us, though. It's not that hard to get your WAL archives
> mixed up, and it'd be nice to catch that right away.
>

This reminds me that xlp_tli is not being used to its full potential right
now either. We only check that it's not going backwards, but there is at
least one not very hard to hit way to get postgres to silently replay on
the wrong timeline. [1]

[1]
https://www.postgresql.org/message-id/canwkhkmn3qwacvudzhb6wsvlrtkwebiyso-klfykkqvwuql...@mail.gmail.com
-- 

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-12 Thread David Steele




On 10/11/23 21:10, Michael Paquier wrote:

On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:

I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.

I'll abandon the others for now, since we're now thinking bigger[1]
for backups, side stepping the problem.


FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be
OK to use it at least for now on HEAD before seeing where the other
discussions lead.  0004 would be OK if applied to v11, as well, but I
also agree that it is not a big deal to let this branch be as it is
now at this stage if you feel strongly this way.


Agreed on 0002 and 0004, though I don't really think a back patch of 
0004 to 11 is necessary. I'd feel differently if there was a single 
field report of this issue.


I would prefer to hold off on applying 0003 to HEAD until we see how [1] 
pans out.


Having said that, I have a hard time seeing [1] as being something we 
could back patch. The manipulation of backup_label is simple enough, but 
starting a cluster without pg_control is definitely going to change some 
things. Also, the requirement that backup software skip copying 
pg_control after a minor release is not OK.


If we do back patch 0001 is 0003 really needed? Surely if 0001 works 
with other backup software it would work fine for pg_basebackup.


Regards,
-David

[1] 
https://www.postgresql.org/message-id/flat/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net





Re: The danger of deleting backup_label

2023-10-12 Thread David Steele

Hi Thomas,

On 10/11/23 18:10, Thomas Munro wrote:


Even though I spent a whole bunch of time trying to figure out how to
make concurrent reads of the control file sufficiently atomic for
backups (pg_basebackup and low level filesystem tools), and we
explored multiple avenues with varying results, and finally came up
with something that basically works pretty well... actually I just
hate all of that stuff, and I'm hoping to be able to just withdraw
https://commitfest.postgresql.org/45/4025/ and chalk it all up to
discovery/education and call *this* thread the real outcome of that
preliminary work.

So I'm +1 on the idea of putting a control file image into the backup
label and I'm happy that you're looking into it.


Well, hopefully this thread will *at least* be the solution going 
forward. Not sure about a back patch yet, see below...



We could just leave the control file out of the base backup
completely, as you said, removing a whole foot-gun.  


That's the plan.


People following
the 'low level' instructions will still get a copy of the control file
from the filesystem, and I don't see any reliable way to poison that
file without also making it so that a crash wouldn't also be prevented
from recovering.  I have wondered about putting extra "fingerprint"
information into the control file such as the file's path and inode
number etc, so that you can try to distinguish between a control file
written by PostgreSQL, and a control file copied somewhere else, but
that all feels too fragile, and at the end of the day, people
following the low level backup instructions had better follow the low
level backup instructions (hopefully via the intermediary of an
excellent external backup tool).


Not sure about the inode idea, because it seems OK for people to move a 
cluster elsewhere under a variety of circumstances. I do have an idea 
about how to mark a cluster in "recovery to consistency" mode, but not 
quite sure how to atomically turn that off at the end of recovery to 
consistency. I have some ideas I'll work on though.



As Stephen mentioned[1], we could perhaps also complain if both backup
label and control file exist, and then hint that the user should
remove the *control file* (not the backup label!).  I had originally
suggested we would just overwrite the control file, but by explicitly
complaining about it we would also bring the matter to tool/script
authors' attention, ie that they shouldn't be backing that file up, or
should be removing it in a later step if they copy everything.  He
also mentions that there doesn't seem to be anything stopping us from
back-patching changes to the backup label contents if we go this way.
I don't have a strong opinion on that and we could leave the question
for later.


I'm worried about the possibility of back patching this unless the 
solution comes out to be simpler than I think and that rarely comes to 
pass. Surely throwing errors on something that is currently valid (i.e. 
backup_label and pg_control both present).


But perhaps there is a simpler, acceptable solution we could back patch 
(transparent to all parties except Postgres) and then a more advanced 
solution we could go forward with.


I guess I had better get busy on this.

Regards,
-David

[1] 
https://www.postgresql.org/message-id/ZL69NXjCNG%2BWHCqG%40tamriel.snowman.net





Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread Nikita Malakhov
Hi hackers!

Please advise on the idea of preserving pg_proc oids during pg_upgrade, in
a way like relfilenodes, type id and so on. What are possible downsides of
such a solution?

Thanks!

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: The danger of deleting backup_label

2023-10-12 Thread David Steele




On 10/11/23 18:22, Michael Paquier wrote:

On Tue, Oct 10, 2023 at 05:06:45PM -0400, David Steele wrote:

That fails because there is a check to make sure the checkpoint is valid
when pg_control is loaded. Another possibility is to use a special LSN like
we use for unlogged tables. Anything >= 24 and < WAL segment size will work
fine.


Do we have any reason to do that in the presence of a backup_label
file anyway?  We'll know the LSN of the checkpoint based on what the
base backup wants us to use.  Using a fake-still-rather-valid value
for the LSN in the control file to bypass this check does not address
the issue you are pointing at: it is just avoiding this check.  A
reasonable answer would be, IMO, to just not do this check at all
based on the control file in this case.


Yeah, that's fair. And it looks like we are leaning towards excluding 
pg_control from the backup entirely, so the point is probably moot.



If the contents of the control file are tweaked before sending it
through a BASE_BACKUP, it would cover more than just pg_basebackup.
Switching the way the control file is sent with new contents in
sendFileWithContent() rather than sendFile() would be one way, for
instance..


Good point, and that makes this even more compelling. If we include
pg_control into backup_label then there is no need to modify pg_control (as
above) -- we can just exclude it from the backup entirely. That will
certainly require some rejigging in recovery but seems worth it for backup
solutions that can't easily modify pg_control. The C-based solutions can do
this pretty easily but it is a pretty high bar for anyone else.


I have little idea about that, but I guess that you are referring to
backrest here.


Sure, pgBackRest, but there are other backup solutions written in C. My 
point is really that we should not depend on backup solutions being able 
to manipulate C structs. It looks the the solution we are working 
towards would not require that.


Regards,
-David




Re: Eager page freeze criteria clarification

2023-10-12 Thread Robert Haas
Thanks for these notes.

On Wed, Oct 11, 2023 at 8:43 PM Andres Freund  wrote:
> - We also discussed an idea by Robert to track the number of times we need to
>   dirty a page when unfreezing and to compare that to the number of pages
>   dirtied overall (IIRC), but I don't think we really came to a conclusion
>   around that - and I didn't write down anything so this is purely from
>   memory.

See 
http://postgr.es/m/ca+tgmoycwisxlbl-pxu13oevthloxm20ojqjnrztkkhxsy9...@mail.gmail.com
for my on-list discussion of this.

> - Attributing "unfreezes" to specific vacuums would be powerful:
>
>   - "Number of pages frozen during vacuum" and "Number of pages unfrozen that
> were frozen during the same vacuum" provides numerator / denominator for
> an "error rate"

I want to highlight the denominator issue here. I think we all have
the intuition that if we count the number of times that a
recently-frozen page gets unfrozen, and that's a big number, that's
bad, and a sign that we need to freeze less aggressively. But a lot of
the struggle has been around answering the question "big compared to
what". A lot of the obvious candidates fail to behave nicely in corner
cases, as discussed in the above email. I think this is one of the
better candidates so far proposed, possibly the best.

>   - This approach could provide "goals" for opportunistic freezing in a
> somewhat understandable way. E.g. aiming to rarely unfreeze data that has
> been frozen within 1h/1d/...

This strikes me as another important point. Making the behavior
understandable to users is going to be important, because sometimes
whatever system we might craft will misbehave, and then people are
going to need to be able to understand why it's misbehaving and how to
tune/fix it so it works.

> Around this point my laptop unfortunately ran out of battery. Possibly the
> attendees of this mini summit also ran out of steam (and tea).

When the tea is gone, there's little point in continuing.

> I likely mangled this substantially, both when taking notes during the lively
> discussion, and when revising them to make them a bit more readable.

I think it's quite a good summary, actually. Thanks!

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread Tom Lane
Nikita Malakhov  writes:
> Please advise on the idea of preserving pg_proc oids during pg_upgrade, in
> a way like relfilenodes, type id and so on. What are possible downsides of
> such a solution?

You have the burden of proof backwards.  That would add a great deal
of new mechanism, and you haven't provided even one reason why it'd
be worth doing.

regards, tom lane




Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-12 Thread David Steele

On 10/12/23 09:58, David Steele wrote:

On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:

I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.

I'll abandon the others for now, since we're now thinking bigger[1]
for backups, side stepping the problem.


FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be
OK to use it at least for now on HEAD before seeing where the other
discussions lead.  0004 would be OK if applied to v11, as well, but I
also agree that it is not a big deal to let this branch be as it is
now at this stage if you feel strongly this way.


Agreed on 0002 and 0004, though I don't really think a back patch of 
0004 to 11 is necessary. I'd feel differently if there was a single 
field report of this issue.


I would prefer to hold off on applying 0003 to HEAD until we see how [1] 
pans out.


Having said that, I have a hard time seeing [1] as being something we 
could back patch. The manipulation of backup_label is simple enough, but 
starting a cluster without pg_control is definitely going to change some 
things. Also, the requirement that backup software skip copying 
pg_control after a minor release is not OK.




After some more thought, I think we could massage the "pg_control in 
backup_label" method into something that could be back patched, with 
more advanced features (e.g. error on backup_label and pg_control both 
present on initial cluster start) saved for HEAD.


Regards,
-David




Re: PostgreSQL domains and NOT NULL constraint

2023-10-12 Thread Tom Lane
Erki Eessaar  writes:
> PostgreSQL's CREATE DOMAIN documentation (section Notes) describes a way how 
> one can add NULL's to a column that has a domain with the NOT NULL constraint.
> https://www.postgresql.org/docs/current/sql-createdomain.html
> To me it seems very strange and amounts to a bug because it defeats the 
> purpose of domains (to be a reusable assets) and constraints (to avoid any 
> bypassing of these).

I doubt we'd consider doing anything about that.  The whole business
of domains with NOT NULL constraints is arguably a defect of the SQL
standard, because there are multiple ways to produce a value that
is NULL and yet must be considered to be of the domain type.
The subselect-with-no-output case that you show isn't even the most
common one; I'd say that outer joins where there are domain columns
on the nullable side are the biggest problem.

There's been some discussion of treating the output of such a join,
subselect, etc as being of the domain's base type not the domain
proper.  That'd solve this particular issue since then we'd decide
we have to cast the base type back up to the domain type (and hence
check its constraints) before inserting the row.  But that choice
just moves the surprise factor somewhere else, in that queries that
used to produce one data type now produce another one.  There are
applications that this would break.  Moreover, I do not think there's
any justification for it in the SQL spec.

Our general opinion about this is what is stated in the NOTES
section of our CREATE DOMAIN reference page [1]:

Best practice therefore is to design a domain's constraints so that a
null value is allowed, and then to apply column NOT NULL constraints
to columns of the domain type as needed, rather than directly to the
domain type.

regards, tom lane

[1] https://www.postgresql.org/docs/current/sql-createdomain.html




Re: Lowering the default wal_blocksize to 4K

2023-10-12 Thread Robert Haas
On Thu, Oct 12, 2023 at 9:57 AM Ants Aasma  wrote:
> This reminds me that xlp_tli is not being used to its full potential right 
> now either. We only check that it's not going backwards, but there is at 
> least one not very hard to hit way to get postgres to silently replay on the 
> wrong timeline. [1]
>
> [1] 
> https://www.postgresql.org/message-id/canwkhkmn3qwacvudzhb6wsvlrtkwebiyso-klfykkqvwuql...@mail.gmail.com

Maybe I'm missing something, but that seems mostly unrelated. What
you're discussing there is the server's ability to figure out when it
ought to perform a timeline switch. In other words, the server settles
on the wrong TLI and therefore opens and reads from the wrong
filename. But here, we're talking about the case where the server is
correct about the TLI and LSN and hence opens exactly the right file
on disk, but the contents of the file on disk aren't what they're
supposed to be due to a procedural error.

Said differently, I don't see how anything we could do with xlp_tli
would actually fix the problem discussed in that thread. That can
detect a situation where the TLI of the file doesn't match the TLI of
the pages inside the file, but it doesn't help with the case where the
server decided to read the wrong file in the first place.

But this does make me wonder whether storing xlp_tli and xlp_pageaddr
in every page is really worth the bit-space. That takes 12 bytes plus
any padding it forces us to incur, but the actual entropy content of
those 12 bytes must be quite low. In normal cases probably 7 or so of
those bytes are going to consist entirely of zero bits (TLI < 256,
LSN%8k ==  0, LSN < 2^40). We could probably find a way of jumbling
the LSN, TLI, and maybe some other stuff into an 8-byte quantity or
even perhaps a 4-byte quantity that would do about as good a job
catching problems as what we have now (e.g.
LSN_HIGH32^LSN_LOW32^BITREVERSE(TLI)). In the event of a mismatch, the
value actually stored in the page header would be harder for humans to
understand, but I'm not sure that really matters here. Users should
mostly be concerned with whether a WAL file matches the cluster where
they're trying to replay it; forensics on misplaced or corrupted WAL
files should be comparatively rare.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: logical decoding and replication of sequences, take 2

2023-10-12 Thread Tomas Vondra



On 9/20/23 11:53, Dilip Kumar wrote:
> On Wed, Aug 16, 2023 at 7:57 PM Tomas Vondra
>  wrote:
>>
> 
> I was reading through 0001, I noticed this comment in
> ReorderBufferSequenceIsTransactional() function
> 
> + * To decide if a sequence change should be handled as transactional or 
> applied
> + * immediately, we track (sequence) relfilenodes created by each transaction.
> + * We don't know if the current sub-transaction was already assigned to the
> + * top-level transaction, so we need to check all transactions.
> 
> It says "We don't know if the current sub-transaction was already
> assigned to the top-level transaction, so we need to check all
> transactions". But IIRC as part of the steaming of in-progress
> transactions we have ensured that whenever we are logging the first
> change by any subtransaction we include the top transaction ID in it.
> 

Yeah, that's a stale comment - the actual code only searched through the
top-level ones (and thus relying on the immediate assignment). As I
wrote in the earlier response, I suspect this code originates from
before I added the GetCurrentTransactionId() calls.

That being said, I do wonder why with the immediate assignments we still
need the bit in ReorderBufferAssignChild that says:

/*
 * We already saw this transaction, but initially added it to the
 * list of top-level txns.  Now that we know it's not top-level,
 * remove it from there.
 */
dlist_delete(&subtxn->node);

I don't think that affects this patch, but it's a bit confusing.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-10-12 Thread Tomas Vondra



On 9/13/23 15:18, Ashutosh Bapat wrote:
> On Fri, Aug 18, 2023 at 4:28 PM Amit Kapila  wrote:
>>
>> On Fri, Aug 18, 2023 at 10:37 AM Amit Kapila  wrote:
>>>
>>> On Thu, Aug 17, 2023 at 7:13 PM Ashutosh Bapat
>>>  wrote:

 On Wed, Aug 16, 2023 at 7:56 PM Tomas Vondra
  wrote:
>
>>
>> But whether or not that's the case, downstream should not request (and
>> hence receive) any changes that have been already applied (and
>> committed) downstream as a principle. I think a way to achieve this is
>> to update the replorigin_session_origin_lsn so that a sequence change
>> applied once is not requested (and hence sent) again.
>>
>
> I guess we could update the origin, per attached 0004. We don't have
> timestamp to set replorigin_session_origin_timestamp, but it seems we
> don't need that.
>
> The attached patch merges the earlier improvements, except for the part
> that experimented with adding a "fake" transaction (which turned out to
> have a number of difficult issues).

 0004 looks good to me.
>>>
>>>
>>> + {
>>>   CommitTransactionCommand();
>>> +
>>> + /*
>>> + * Update origin state so we don't try applying this sequence
>>> + * change in case of crash.
>>> + *
>>> + * XXX We don't have replorigin_session_origin_timestamp, but we
>>> + * can just leave that set to 0.
>>> + */
>>> + replorigin_session_origin_lsn = seq.lsn;
>>>
>>> IIUC, your proposal is to update the replorigin_session_origin_lsn, so
>>> that after restart, it doesn't use some prior origin LSN to start with
>>> which can in turn lead the sequence to go backward. If so, it should
>>> be updated before calling CommitTransactionCommand() as we are doing
>>> in apply_handle_commit_internal(). If that is not the intention then
>>> it is not clear to me how updating replorigin_session_origin_lsn after
>>> commit is helpful.
>>>
>>
>> typedef struct ReplicationState
>> {
>> ...
>> /*
>>  * Location of the latest commit from the remote side.
>>  */
>> XLogRecPtrremote_lsn;
>>
>> This is the variable that will be updated with the value of
>> replorigin_session_origin_lsn. This means we will now track some
>> arbitrary LSN location of the remote side in this variable. The above
>> comment makes me wonder if there is anything we are missing or if it
>> is just a matter of updating this comment because before the patch we
>> always adhere to what is written in the comment.
> 
> I don't think we are missing anything. This value is used to track the
> remote LSN upto which all the commits from upstream have been applied
> locally. Since a non-transactional sequence change is like a single
> WAL record transaction, it's LSN acts as the LSN of the mini-commit.
> So it should be fine to update remote_lsn with sequence WAL record's
> end LSN. That's what the patches do. I don't see any hazard. But you
> are right, we need to update comments. Here and also at other places
> like
> replorigin_session_advance() which uses remote_commit as name of the
> argument which gets assigned to ReplicationState::remote_lsn.
> 

I agree - updating the replorigin_session_origin_lsn shouldn't break
anything. As you write, it's essentially a "mini-commit" and the commit
order remains the same.

I'm not sure about resetting replorigin_session_origin_timestamp to 0
though. It's not something we rely on very much (it may not correlated
with the commit order etc.). But why should we set it to 0? We don't do
that for regular commits, right? And IMO it makes sense to just use the
timestamp of the last commit before the sequence change.

FWIW I've left this in a separate commit, but I'll merge that into 0002
in the next patch version.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences, take 2

2023-10-12 Thread Tomas Vondra
On 7/25/23 12:20, Amit Kapila wrote:
> ...
>
> I have used the debugger to reproduce this as it needs quite some
> coordination. I just wanted to see if the sequence can go backward and
> didn't catch up completely before the sequence state is marked
> 'ready'. On the publisher side, I created a publication with a table
> and a sequence. Then did the following steps:
> SELECT nextval('s') FROM generate_series(1,50);
> insert into t1 values(1);
> SELECT nextval('s') FROM generate_series(51,150);
> 
> Then on the subscriber side with some debugging aid, I could find the
> values in the sequence shown in the previous email. Sorry, I haven't
> recorded each and every step but, if you think it helps, I can again
> try to reproduce it and share the steps.
> 

Amit, can you try to reproduce this backwards movement with the latest
version of the patch? I have tried triggering that (mis)behavior, but I
haven't been successful so far. I'm hesitant to declare it resolved, as
it's dependent on timing etc. and you mentioned it required quite some
coordination.


Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-10-12 Thread Tom Lane
Andres Freund  writes:
>> On 2023-09-25 15:42:26 -0400, Tom Lane wrote:
>>> I just did a git bisect run to discover when the failure documented
>>> in bug #18130 [1] started.  And the answer is commit 82a4edabd.

> Uh, huh.  The problem is that COPY uses a single BulkInsertState for multiple
> partitions. Which to me seems to run counter to the following comment:
>  *The caller can also provide a BulkInsertState object to optimize many
>  *insertions into the same relation.  This keeps a pin on the current
>  *insertion target page (to save pin/unpin cycles) and also passes a
>  *BULKWRITE buffer selection strategy object to the buffer manager.
>  *Passing NULL for bistate selects the default behavior.

> The reason this doesn't cause straight up corruption due to reusing a pin from
> another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and a
> call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk
> insertion state, which is what leads to the errors from the bug report.

> Resetting the relevant BulkInsertState fields fixes the problem. But I'm not
> sure that's the right fix. ISTM that independent of whether we fix this via
> ReleaseBulkInsertStatePin() resetting the fields or via not reusing
> BulkInsertState, we should add assertions defending against future issues like
> this (e.g. by adding a relation field to BulkInsertState in cassert builds,
> and asserting that the relation is the same as in prior calls unless
> ReleaseBulkInsertStatePin() has been called).

Ping?  We really ought to have a fix for this committed in time for
16.1.

regards, tom lane




Re: Eager page freeze criteria clarification

2023-10-12 Thread Melanie Plageman
On Wed, Oct 11, 2023 at 8:43 PM Andres Freund  wrote:
>
> Robert, Melanie and I spent an evening discussing this topic around
> pgconf.nyc. Here are, mildly revised, notes from that:

Thanks for taking notes!

>   The main thing we are worried about is repeated freezing / unfreezing of
>   pages within a relatively short time period.
>
> - Computing an average "modification distance" as I (Andres) proposed efor
>   each page is complicated / "fuzzy"
>
>   The main problem is that it's not clear how to come up with a good number
>   for workloads that have many more inserts into new pages than modifications
>   of existing pages.
>
>   It's also hard to use average for this kind of thing, e.g. in cases where
>   new pages are frequently updated, but also some old data is updated, it's
>   easy for the updates to the old data to completely skew the average, even
>   though that shouldn't prevent us from freezing.
>
> - We also discussed an idea by Robert to track the number of times we need to
>   dirty a page when unfreezing and to compare that to the number of pages
>   dirtied overall (IIRC), but I don't think we really came to a conclusion
>   around that - and I didn't write down anything so this is purely from
>   memory.

I was under the impression that we decided we still had to consider
the number of clean pages dirtied as well as the number of pages
unfrozen. The number of pages frozen and unfrozen over a time period
gives us some idea of if we are freezing the wrong pages -- but it
doesn't tell us if we are freezing the right pages. A riff on an
earlier example by Robert:

While vacuuming a relation, we freeze 100 pages. During the same time
period, we modify 1,000,000 previously clean pages. Of these 1,000,000
pages modified, 90 were frozen. So we unfroze 90% of the pages frozen
during this time. Does this mean we should back off of trying to
freeze any pages in the relation?

> A rough sketch of a freezing heuristic:
...
> - Attributing "unfreezes" to specific vacuums would be powerful:
>
>   - "Number of pages frozen during vacuum" and "Number of pages unfrozen that
> were frozen during the same vacuum" provides numerator / denominator for
> an "error rate"
>
>   - We can perform this attribution by comparing the page LSN with recorded
> start/end LSNs of recent vacuums

While implementing a rough sketch of this, I realized I had a question
about this.

vacuum 1 starts at lsn 10 and ends at lsn 200. It froze 100 pages.
vacuum 2 then starts at lsn 600.
5 frozen pages with page lsn > 10 and < 200 were updated. We count
those in vacuum 1's stats. 3 frozen pages with page lsn > 200 and <
600 were updated. Do we count those somewhere?

>   - This approach could provide "goals" for opportunistic freezing in a
> somewhat understandable way. E.g. aiming to rarely unfreeze data that has
> been frozen within 1h/1d/...

Similar to the above question, if we are tracking pages frozen and
unfrozen during a time period, if there are many vacuums in quick
succession, we might care if a page was frozen by one vacuum and then
unfrozen during a subsequent vacuum if not too much time has passed.

- Melanie




Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound

2023-10-12 Thread Robert Haas
On Wed, Oct 4, 2023 at 8:07 AM Peter Geoghegan  wrote:
> If you're willing to take over as committer here, I'll let the issue
> of backpatching go.
>
> I only ask that you note why you've not backpatched in the commit message.

Will do, but see also the last point below.

I have looked over these patches in some detail and here are my thoughts:

- I find the use of the word "generate" in error messages slightly
odd. I think it's reasonable given the existing precedent, but the
word I would have picked is "assign", which I see is what Aleksander
actually had in v1. How would people feel about changing the two
existing messages that say "database is not accepting commands that
generate new MultiXactIds to avoid wraparound data loss ..." to use
"assign" instead, and then make the new messages match that?

- I think that 0002 needs a bit of wordsmithing. I will work on that.
In particular, I don't like this sentence: "It increases downtime,
makes monitoring impossible, disables replication, bypasses safeguards
against wraparound, etc." While there's nothing untrue there, it feels
more like a sentence from a pgsql-hackers email where most people
participating in the discussion understand the general contours of the
problem already than like polished documentation that really lays
things out methodically.

- I'm somewhat inclined to have a go at restructuring these patches a
bit so that some of the documentation changes can potentially be
back-patched without back-patching the message changes. Even if we
eventually decide to back-patch everything or nothing, there are
wording adjustments spread across all 3 patches that seem somewhat
independent of the changes to the server messages. I think it would be
clearer to have one patch that is mostly about documentation wording
changes, and a second one that is about changing the server messages
and then making documentation changes that are directly dependent on
those message changes. And I might also be inclined to back-patch the
former patch as far as it makes sense to do so, while leaving the
latter one master-only.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound

2023-10-12 Thread Peter Geoghegan
On Thu, Oct 12, 2023 at 8:54 AM Robert Haas  wrote:
> - I find the use of the word "generate" in error messages slightly
> odd. I think it's reasonable given the existing precedent, but the
> word I would have picked is "assign", which I see is what Aleksander
> actually had in v1. How would people feel about changing the two
> existing messages that say "database is not accepting commands that
> generate new MultiXactIds to avoid wraparound data loss ..." to use
> "assign" instead, and then make the new messages match that?

WFM.

> - I think that 0002 needs a bit of wordsmithing. I will work on that.
> In particular, I don't like this sentence: "It increases downtime,
> makes monitoring impossible, disables replication, bypasses safeguards
> against wraparound, etc." While there's nothing untrue there, it feels
> more like a sentence from a pgsql-hackers email where most people
> participating in the discussion understand the general contours of the
> problem already than like polished documentation that really lays
> things out methodically.

I agree.

> - I'm somewhat inclined to have a go at restructuring these patches a
> bit so that some of the documentation changes can potentially be
> back-patched without back-patching the message changes. Even if we
> eventually decide to back-patch everything or nothing, there are
> wording adjustments spread across all 3 patches that seem somewhat
> independent of the changes to the server messages. I think it would be
> clearer to have one patch that is mostly about documentation wording
> changes, and a second one that is about changing the server messages
> and then making documentation changes that are directly dependent on
> those message changes. And I might also be inclined to back-patch the
> former patch as far as it makes sense to do so, while leaving the
> latter one master-only.

No objections from me.

-- 
Peter Geoghegan




Re: Parent/child context relation in pg_get_backend_memory_contexts()

2023-10-12 Thread Andres Freund
Hi,

On 2023-08-04 21:16:49 +0300, Melih Mutlu wrote:
> Melih Mutlu , 16 Haz 2023 Cum, 17:03 tarihinde şunu
> yazdı:
> 
> > With this change, here's a query to find how much space used by each
> > context including its children:
> >
> > > WITH RECURSIVE cte AS (
> > > SELECT id, total_bytes, id as root, name as root_name
> > > FROM memory_contexts
> > > UNION ALL
> > > SELECT r.id, r.total_bytes, cte.root, cte.root_name
> > > FROM memory_contexts r
> > > INNER JOIN cte ON r.parent_id = cte.id
> > > ),
> > > memory_contexts AS (
> > > SELECT * FROM pg_backend_memory_contexts
> > > )
> > > SELECT root as id, root_name as name, sum(total_bytes)
> > > FROM cte
> > > GROUP BY root, root_name
> > > ORDER BY sum DESC;
> >
> 
> Given that the above query to get total bytes including all children is
> still a complex one, I decided to add an additional info in
> pg_backend_memory_contexts.
> The new "path" field displays an integer array that consists of ids of all
> parents for the current context. This way it's easier to tell whether a
> context is a child of another context, and we don't need to use recursive
> queries to get this info.

I think that does make it a good bit easier. Both to understand and to use.



> Here how pg_backend_memory_contexts would look like with this patch:
> 
> postgres=# SELECT name, id, parent, parent_id, path
> FROM pg_backend_memory_contexts
> ORDER BY total_bytes DESC LIMIT 10;
>   name   | id  |  parent  | parent_id | path
> -+-+--+---+--
>  CacheMemoryContext  |  27 | TopMemoryContext | 0 | {0}
>  Timezones   | 124 | TopMemoryContext | 0 | {0}
>  TopMemoryContext|   0 |  |   |
>  MessageContext  |   8 | TopMemoryContext | 0 | {0}
>  WAL record construction | 118 | TopMemoryContext | 0 | {0}
>  ExecutorState   |  18 | PortalContext|17 | {0,16,17}
>  TupleSort main  |  19 | ExecutorState|18 | {0,16,17,18}
>  TransactionAbortContext |  14 | TopMemoryContext | 0 | {0}
>  smgr relation table |  10 | TopMemoryContext | 0 | {0}
>  GUC hash table  | 123 | GUCMemoryContext |   122 | {0,122}
> (10 rows)

Would we still need the parent_id column?


> +
> + 
> +  
> +   context_id int4
> +  
> +  
> +   Current context id
> +  
> + 

I think the docs here need to warn that the id is ephemeral and will likely
differ in the next invocation.


> + 
> +  
> +   parent_id int4
> +  
> +  
> +   Parent context id
> +  
> + 
> +
> + 
> +  
> +   path int4
> +  
> +  
> +   Path to reach the current context from TopMemoryContext
> +  
> + 

Perhaps we should include some hint here how it could be used?


>  
> 
>
> diff --git a/src/backend/utils/adt/mcxtfuncs.c 
> b/src/backend/utils/adt/mcxtfuncs.c
> index 92ca5b2f72..81cb35dd47 100644
> --- a/src/backend/utils/adt/mcxtfuncs.c
> +++ b/src/backend/utils/adt/mcxtfuncs.c
> @@ -20,6 +20,7 @@
>  #include "mb/pg_wchar.h"
>  #include "storage/proc.h"
>  #include "storage/procarray.h"
> +#include "utils/array.h"
>  #include "utils/builtins.h"
>  
>  /* --
> @@ -28,6 +29,8 @@
>   */
>  #define MEMORY_CONTEXT_IDENT_DISPLAY_SIZE1024
>  
> +static Datum convert_path_to_datum(List *path);
> +
>  /*
>   * PutMemoryContextsStatsTupleStore
>   *   One recursion level for pg_get_backend_memory_contexts.
> @@ -35,9 +38,10 @@
>  static void
>  PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore,
>TupleDesc 
> tupdesc, MemoryContext context,
> -  const char 
> *parent, int level)
> +  const char 
> *parent, int level, int *context_id,
> +  int parent_id, 
> List *path)
>  {
> -#define PG_GET_BACKEND_MEMORY_CONTEXTS_COLS  9
> +#define PG_GET_BACKEND_MEMORY_CONTEXTS_COLS  12
>  
>   Datum   values[PG_GET_BACKEND_MEMORY_CONTEXTS_COLS];
>   boolnulls[PG_GET_BACKEND_MEMORY_CONTEXTS_COLS];
> @@ -45,6 +49,7 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate *tupstore,
>   MemoryContext child;
>   const char *name;
>   const char *ident;
> + int  current_context_id = (*context_id)++;
>  
>   Assert(MemoryContextIsValid(context));
>  
> @@ -103,13 +108,29 @@ PutMemoryContextsStatsTupleStore(Tuplestorestate 
> *tupstore,
>   values[6] = Int64GetDatum(stat.freespace);
>   values[7] = Int64GetDatum(stat.freechunks);
>   values[8] = Int64GetDatum(stat.totalspace - stat.freespace);
> + values[9] = Int32GetDatum(current_context_id);
> +
> + if(parent_id < 0)
> +

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-10-12 Thread Andres Freund
Hi,

On 2023-10-12 11:44:09 -0400, Tom Lane wrote:
> Andres Freund  writes:
> >> On 2023-09-25 15:42:26 -0400, Tom Lane wrote:
> >>> I just did a git bisect run to discover when the failure documented
> >>> in bug #18130 [1] started.  And the answer is commit 82a4edabd.
> 
> > Uh, huh.  The problem is that COPY uses a single BulkInsertState for 
> > multiple
> > partitions. Which to me seems to run counter to the following comment:
> >  *  The caller can also provide a BulkInsertState object to optimize many
> >  *  insertions into the same relation.  This keeps a pin on the current
> >  *  insertion target page (to save pin/unpin cycles) and also passes a
> >  *  BULKWRITE buffer selection strategy object to the buffer manager.
> >  *  Passing NULL for bistate selects the default behavior.
> 
> > The reason this doesn't cause straight up corruption due to reusing a pin 
> > from
> > another relation is that b1ecb9b3fcfb added ReleaseBulkInsertStatePin() and 
> > a
> > call to it. But I didn't make ReleaseBulkInsertStatePin() reset the bulk
> > insertion state, which is what leads to the errors from the bug report.
> 
> > Resetting the relevant BulkInsertState fields fixes the problem. But I'm not
> > sure that's the right fix. ISTM that independent of whether we fix this via
> > ReleaseBulkInsertStatePin() resetting the fields or via not reusing
> > BulkInsertState, we should add assertions defending against future issues 
> > like
> > this (e.g. by adding a relation field to BulkInsertState in cassert builds,
> > and asserting that the relation is the same as in prior calls unless
> > ReleaseBulkInsertStatePin() has been called).
> 
> Ping?  We really ought to have a fix for this committed in time for
> 16.1.

I kind of had hoped somebody would comment on the approach.  Given that nobody
has, I'll push the minimal fix of resetting the state in
ReleaseBulkInsertStatePin(), even though I think architecturally that's not
great.

Greetings,

Andres Freund




Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread Nikita Malakhov
Hi!

Say, we have data processed by some user function and we want to keep
reference to this function
in our data. In this case we have two ways - first - store string output of
regprocedure, which is not
very convenient, and the second - store its OID, which requires slight
modification of pg_upgrade
(pg_dump and func/procedure creation function).

I've read previous threads about using regproc, and agree that this is not
a very good case anyway,
but I haven't found any serious obstacles that forbid modifying pg_upgrade
this way.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Separate memory contexts for relcache and catcache

2023-10-12 Thread Andres Freund
Hi,

On 2023-08-09 15:02:31 +0300, Melih Mutlu wrote:
> To quickly show how pg_backend_memory_contexts would look like, I did the
> following:
> 
> -Create some tables:
> SELECT 'BEGIN;' UNION ALL SELECT format('CREATE TABLE %1$s(id serial
> primary key, data text not null unique)', 'test_'||g.i) FROM
> generate_series(0, 1000) g(i) UNION ALL SELECT 'COMMIT;';\gexec
> 
> -Open a new connection and query pg_backend_memory_contexts [1]:
> This is what you'll see before and after the patch.
> -- HEAD:
> name| used_bytes | free_bytes | total_bytes
> +++-
>  CacheMemoryContext | 467656 |  56632 |  524288
>  index info | 111760 |  46960 |  158720
>  relation rules |   4416 |   3776 |8192
> (3 rows)
> 
> -- Patch:
>  name  | used_bytes | free_bytes | total_bytes
> ---+++-
>  CatCacheMemoryContext | 217696 |  8 |  262144
>  RelCacheMemoryContext | 248264 |  13880 |  262144
>  index info| 111760 |  46960 |  158720
>  CacheMemoryContext|   2336 |   5856 |8192
>  relation rules|   4416 |   3776 |8192
> (5 rows)

Have you checked what the source of the remaining allocations in
CacheMemoryContext are?


One thing that I had observed previously and reproduced with this patch, is
that the first backend starting after a restart uses considerably more memory:

first:
┌───┬┬┬─┐
│ name  │ used_bytes │ free_bytes │ total_bytes │
├───┼┼┼─┤
│ CatCacheMemoryContext │ 370112 │ 154176 │  524288 │
│ RelCacheMemoryContext │ 244136 │  18008 │  262144 │
│ index info│ 104392 │  45112 │  149504 │
│ CacheMemoryContext│   2304 │   5888 │8192 │
│ relation rules│   3856 │240 │4096 │
└───┴┴┴─┘

second:
┌───┬┬┬─┐
│ name  │ used_bytes │ free_bytes │ total_bytes │
├───┼┼┼─┤
│ CatCacheMemoryContext │ 215072 │  47072 │  262144 │
│ RelCacheMemoryContext │ 243856 │  18288 │  262144 │
│ index info│ 104944 │  47632 │  152576 │
│ CacheMemoryContext│   2304 │   5888 │8192 │
│ relation rules│   3856 │240 │4096 │
└───┴┴┴─┘

This isn't caused by this patch, but it does make it easier to pinpoint than
before.  The reason is fairly simple: On the first start we start without
being able to use relcache init files, in later starts we can. The reason the
size increase is in CatCacheMemoryContext, rather than RelCacheMemoryContext,
is simple: When using the init file the catcache isn't used, when not, we have
to query the catcache a lot to build the initial relcache contents.


Given the size of both CatCacheMemoryContext and RelCacheMemoryContext in a
new backend, I think it might be worth using non-default aset parameters. A
bit ridiculous to increase block sizes from 8k upwards in every single
connection made to postgres ever.


> - Run select on all tables
> SELECT format('SELECT count(*) FROM %1$s', 'test_'||g.i) FROM
> generate_series(0, 1000) g(i);\gexec
> 
> - Then check pg_backend_memory_contexts [1] again:
> --HEAD
> name| used_bytes | free_bytes | total_bytes
> +++-
>  CacheMemoryContext |8197344 | 257056 | 8454400
>  index info |2102160 | 113776 | 2215936
>  relation rules |   4416 |   3776 |8192
> (3 rows)
> 
> --Patch
>  name  | used_bytes | free_bytes | total_bytes
> ---+++-
>  RelCacheMemoryContext |4706464 |3682144 | 8388608
>  CatCacheMemoryContext |3489384 | 770712 | 4260096
>  index info|2102160 | 113776 | 2215936
>  CacheMemoryContext|   2336 |   5856 |8192
>  relation rules|   4416 |   3776 |8192
> (5 rows)
> 
> You can see that CacheMemoryContext does not use much memory without
> catcache and relcache (at least in cases similar to above), and it's easy
> to bloat catcache and relcache. That's why I think it would be useful to
> see their usage separately.

Yes, I think it'd be quite useful. There's ways to bloat particularly catcache
much further, and it's hard to differentiate that from other sources of bloat
right now.


> +static void
> +CreateCatCacheMemoryContext()

We typically use (void) to differentiate from an 

Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread David G. Johnston
On Thu, Oct 12, 2023 at 9:57 AM Nikita Malakhov  wrote:

> Say, we have data processed by some user function and we want to keep
> reference to this function
> in our data.
>

Then you need to keep the user-visible identifier of said function
(schema+name+input argument types - you'd probably want to incorporate
version into the name) in your user-space code.  Exposing runtime generated
oids to user-space is not something I can imagine the system supporting.
It goes against the very definition of "implementation detail" that
user-space code is not supposed to depend upon.

David J.


Re: Eager page freeze criteria clarification

2023-10-12 Thread Robert Haas
On Thu, Oct 12, 2023 at 11:50 AM Melanie Plageman
 wrote:
> I was under the impression that we decided we still had to consider
> the number of clean pages dirtied as well as the number of pages
> unfrozen. The number of pages frozen and unfrozen over a time period
> gives us some idea of if we are freezing the wrong pages -- but it
> doesn't tell us if we are freezing the right pages. A riff on an
> earlier example by Robert:
>
> While vacuuming a relation, we freeze 100 pages. During the same time
> period, we modify 1,000,000 previously clean pages. Of these 1,000,000
> pages modified, 90 were frozen. So we unfroze 90% of the pages frozen
> during this time. Does this mean we should back off of trying to
> freeze any pages in the relation?

I didn't think we decided the thing your first sentence says you
thought we decided ... but I also didn't think of this example. That
said, it might be fine to back off freezing in this case because we
weren't doing enough of it to make any real difference in the first
place. Maybe there's a more moderate example where it feels like a
bigger problem?

> >   - "Number of pages frozen during vacuum" and "Number of pages unfrozen 
> > that
> > were frozen during the same vacuum" provides numerator / denominator for
> > an "error rate"
> >
> >   - We can perform this attribution by comparing the page LSN with recorded
> > start/end LSNs of recent vacuums
>
> While implementing a rough sketch of this, I realized I had a question
> about this.
>
> vacuum 1 starts at lsn 10 and ends at lsn 200. It froze 100 pages.
> vacuum 2 then starts at lsn 600.
> 5 frozen pages with page lsn > 10 and < 200 were updated. We count
> those in vacuum 1's stats. 3 frozen pages with page lsn > 200 and <
> 600 were updated. Do we count those somewhere?

How did those pages get frozen when no VACUUM was running?

The LSN of the frozen page just prior to unfreezing it should be from
the operation that froze it, which should be some VACUUM. I think the
case you're talking about could happen if we did on-access freezing,
but today I believe we don't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Wait events for delayed checkpoints

2023-10-12 Thread Robert Haas
On Wed, Oct 11, 2023 at 9:13 PM Thomas Munro  wrote:
> You can't tell if your checkpointer is spending a lot of time waiting
> around for flags in delayChkptFlags to clear.  Trivial patch to add
> that.  I've managed to see it a few times when checkpointing
> repeatedly with a heavy pgbench workload.
>
> I had to stop and think for a moment about whether these events belong
> under "WaitEventIPC", "waiting for notification from another process"
> or under "WaitEventTimeout", "waiting for a timeout to expire".  I
> mean, both?  It's using sleep-and-poll instead of (say) a CV due to
> the economics, we want to make the other side as cheap as possible, so
> we don't care about making the checkpointer take some micro-naps in
> this case.  I feel like the key point here is that it's waiting for
> another process to do stuff and unblock it.

IPC seems right to me. Yeah, a timeout is being used, but as you say,
that's an implementation detail.

+1 for the idea, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: On login trigger: take three

2023-10-12 Thread Robert Haas
On Tue, Oct 10, 2023 at 3:43 PM Alexander Korotkov  wrote:
> Yep, in v43 it worked that way.  One transaction has to wait for
> another finishing update of pg_database tuple, then fails.  This is
> obviously ridiculous.  Since overlapping setters of flag will have to
> wait anyway, I changed lock mode in v44 for them to
> AccessExclusiveLock.  Now, waiting transaction then sees the updated
> tuple and doesn't fail.

Doesn't that mean that if you create the first login trigger in a
database and leave the transaction open, nobody can connect to that
database until the transaction ends?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Special-case executor expression steps for common combinations

2023-10-12 Thread Andres Freund
Hi,

On 2023-10-12 13:24:27 +0300, Heikki Linnakangas wrote:
> On 12/10/2023 12:48, Daniel Gustafsson wrote:
> > The attached patch adds special-case expression steps for common sets of 
> > steps
> > in the executor to shave a few cycles off during execution, and make the JIT
> > generated code simpler.
> > 
> > * Adds EEOP_FUNCEXPR_STRICT_1 and EEOP_FUNCEXPR_STRICT_2 for function calls 
> > of
> >strict functions with 1 or 2 arguments (EEOP_FUNCEXPR_STRICT remains 
> > used for
> >> 2 arguments).
> > * Adds EEOP_AGG_STRICT_INPUT_CHECK_ARGS_1 which is a special case for the
> >common case of one arg aggs.
> 
> Are these relevant when JITting? I'm a little sad if the JIT compiler cannot
> unroll these on its own. Is there something we could do to hint it, so that
> it could treat the number of arguments as a constant?

I think it's mainly important for interpreted execution.


> >skip extra setup for steps which are only interested in the side effects.
> 
> I'm a little surprised if this makes a measurable performance difference,
> but sure, why not. It seems nice to be more explicit when you don't expect a
> return value.

IIRC this is more interesting for JIT than the above, because it allows LLVM
to know that the return value isn't needed and thus doesn't need to be
computed.

Greetings,

Andres Freund




Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread David G. Johnston
On Thu, Oct 12, 2023 at 7:36 AM Tom Lane  wrote:

> Nikita Malakhov  writes:
> > Please advise on the idea of preserving pg_proc oids during pg_upgrade,
> in
> > a way like relfilenodes, type id and so on. What are possible downsides
> of
> > such a solution?
>
> You have the burden of proof backwards.  That would add a great deal
> of new mechanism, and you haven't provided even one reason why it'd
> be worth doing.
>
>
I was curious about the comment regarding type oids being copied over and I
found the commentary in pg_upgrade.c that describes which oids are copied
over and why, but the IMPLEMENTATION seems to be out-of-sync with the
actual implementation.

"""
It preserves the relfilenode numbers so TOAST and other references
to relfilenodes in user data is preserved.  (See binary-upgrade usage
in pg_dump). We choose to preserve tablespace and database OIDs as well.
"""

David J.


Re: New WAL record to detect the checkpoint redo location

2023-10-12 Thread Robert Haas
On Thu, Oct 12, 2023 at 3:27 AM Michael Paquier  wrote:
> I have looked at 0001, for now..  And it looks OK to me.

Cool. I've committed that one. Thanks for the review.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread Robert Haas
On Thu, Oct 12, 2023 at 10:35 AM Tom Lane  wrote:
> You have the burden of proof backwards.  That would add a great deal
> of new mechanism, and you haven't provided even one reason why it'd
> be worth doing.

"A great deal of new mechanism" seems like a slight exaggeration. We
preserve a bunch of kinds of OIDs already, and it wouldn't be any
harder to preserve this one than the ones we preserve already, or so I
think. So it would be some additional mechanism, but maybe not a great
deal.

As to whether it's a good idea, it isn't necessary for the system to
operate properly, so we didn't, but it's a judgement call whether it's
better for other reasons, like being able to have regprocedure columns
survive an upgrade, or making users being less confused, or allowing
people supporting PostgreSQL having an easier time debugging issues.
Personally, I've never been quite sure we made the right decision
there. I admit that I'm not particularly keen to try to add the amount
of mechanism that would be required to preserve every single OID
everywhere, but I also somehow feel like the fact that we don't is
pretty weird.

The pg_upgrade experience right now is a bit as if you woke up in the
morning and found that city officials came by during the night and
renumbered your house, thus changing your address. Then, they sent
change of address forms to everyone who ever mails you anything, plus
updated your address with your doctor's office and your children's
school. In a way, there's no problem: nothing has really changed for
you in any way that matters. Yet, I think that would feel pretty
uncomfortable if it actually happened to you, and I think the
pg_upgrade experience is uncomfortable in the same way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_upgrade's interaction with pg_resetwal seems confusing

2023-10-12 Thread Robert Haas
On Thu, Oct 12, 2023 at 7:17 AM Amit Kapila  wrote:
> Now, as mentioned in the first paragraph, it seems we anyway don't
> need to reset the WAL at the end when setting the next OID for the new
> cluster with the -o option. If that is true, then I think even without
> slots work it will be helpful to have such an option in pg_resetwal.
>
> Thoughts?

I wonder if we should instead provide a way to reset the OID counter
with a function call inside the database, gated by IsBinaryUpgrade.
Having something like pg_resetwal --but-dont-actually-reset-the-wal
seems both self-contradictory and vulnerable to abuse that we might be
better off not inviting.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread David G. Johnston
On Thu, Oct 12, 2023, 11:21 Robert Haas  wrote:

>
> The pg_upgrade experience right now is a bit as if you woke up in the
> morning and found that city officials came by during the night and
> renumbered your house, thus changing your address. Then, they sent
> change of address forms to everyone who ever mails you anything, plus
> updated your address with your doctor's office and your children's
> school. In a way, there's no problem: nothing has really changed for
> you in any way that matters. Yet, I think that would feel pretty
> uncomfortable if it actually happened to you, and I think the
> pg_upgrade experience is uncomfortable in the same way.
>

It's more like a lot number or surveying tract than an postal address.
Useful for a single party, the builder or the government, but not something
you give out to other people so they can find you.

Whether or not we copy over oids should be done based upon our internal
needs, not end users.  Which is why the fee that do get copied exists,
because we store them in internal files that we want to copy as part of the
upgrade.  It also isn't like pg_dump/restore is going to retain them and
the less divergence between that and pg_upgrade arguably the better.

David J.

>


Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread Robert Haas
On Thu, Oct 12, 2023 at 2:38 PM David G. Johnston
 wrote:
> It's more like a lot number or surveying tract than an postal address.  
> Useful for a single party, the builder or the government, but not something 
> you give out to other people so they can find you.
>
> Whether or not we copy over oids should be done based upon our internal 
> needs, not end users.  Which is why the fee that do get copied exists, 
> because we store them in internal files that we want to copy as part of the 
> upgrade.  It also isn't like pg_dump/restore is going to retain them and the 
> less divergence between that and pg_upgrade arguably the better.

We build the product for the end users. Their desires and needs are
relevant. And if they're telling us we did it wrong, we need to listen
to that. We don't have to do everything that everybody wants, but
treating developer needs as strictly more important than end-user
needs is self-defeating.

I agree that there's a trade-off here. Preserving more OIDs requires
more code and makes pg_dump and other things more complicated, which
is not great. But, at least to me, arguing that there are no downsides
of not preserving these OIDs is simply not a believable argument.

Well, maybe somebody believes it. But I don't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread David G. Johnston
On Thu, Oct 12, 2023 at 11:43 AM Robert Haas  wrote:

> On Thu, Oct 12, 2023 at 2:38 PM David G. Johnston
>  wrote:
> > It's more like a lot number or surveying tract than an postal address.
> Useful for a single party, the builder or the government, but not something
> you give out to other people so they can find you.
> >
> > Whether or not we copy over oids should be done based upon our internal
> needs, not end users.  Which is why the fee that do get copied exists,
> because we store them in internal files that we want to copy as part of the
> upgrade.  It also isn't like pg_dump/restore is going to retain them and
> the less divergence between that and pg_upgrade arguably the better.
>
> We build the product for the end users. Their desires and needs are
> relevant. And if they're telling us we did it wrong, we need to listen
> to that. We don't have to do everything that everybody wants, but
> treating developer needs as strictly more important than end-user
> needs is self-defeating.
>

Every catalog has both a natural and a surrogate key.  Developers get to
use the surrogate key while end-users get to use the natural one (i.e., the
one they provided).  I see no reason to change that specification.  And I
do believe there are no compelling reasons for an end-user to need to use
the surrogate key instead of the natural one.  The example provided by the
OP isn't one, IMO, the overall goal can be accomplished via the natural key
(if it cannot, maybe we need to make retrieving the natural key for a
pg_proc record given an OID easier).  The fact that OIDs are not even
accessible via SQL further reinforces this belief.  The only reason to
need OIDs as a DBA is to perform joins among the catalogs and all such
joins are local to the database and even session executing them - the
specific values are immaterial.

The behavior of pg_upgrade only preserving OIDs that are necessary due to
the physical copying of data files from the old server to the new one seems
sufficient both in terms of effort and the principle of doing the minimum
amount to solve the problem at hand.

David J.


building 32bit windows version

2023-10-12 Thread Dave Cramer
Greetings,

I've been running into challenges building 32 bit windows version. I
suspect there are no build farms and nobody really builds this.

The reason I need these is to be able to build 32 bit dll's for ODBC. At
one time EDB used to provide binaries but that doesn't appear to be the
case.

running build.bat in an x86 environment fails but that can be easily fixed
by adding

$ENV{CONFIG}="x86"; in buld_env.pl

build postgres then works as advertised, however

install  fails with

"Copying build output files...Could not copy release\zic\zic.exe to
postgres\bin\zic.exe"

Apparently 32 bit dlls are required. If there is an easier way to get
libpq.dll and the include files for building I'm all ears.


Dave Cramer


Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound

2023-10-12 Thread Robert Haas
On Thu, Oct 12, 2023 at 12:01 PM Peter Geoghegan  wrote:
> No objections from me.

Here is a doc-only patch that I think could be back-patched as far as
emergency mode exists. It combines all of the wording changes to the
documentation from v1-v3 of the previous version, but without changing
the message text that is quoted in the documentation, and without
adding more instances of similar message texts to the documentation,
and with a bunch of additional hacking by me. Some things I changed:

- I made it so that the MXID section refers back to the XID section
instead of duplicating it, but with a short list of differences.
- I weakened the existing claim that says you must be a superuser or
VACUUM definitely won't fix it to say instead that you SHOULD run
VACUUM as the superuser, because the former is false and the latter is
true.
- I made the list of steps for recovering more explicit.
- I split out the bit about running autovacuum in the affected
database into a separate step to be performed after VACUUM for
continued good operation, rather than a necessary ingredient in
recovery, because it isn't.
- A bit of other minor rejiggering.

I'm not forgetting about the rest of the proposed patch set, or the
change I proposed earlier. I'm just posting this much now because this
is how far I got today, and it would be useful to get comments before
I go further. I think the residual portion of the patch set not
included in this documentation patch will be quite small, and I think
that's a good thing, but again, I don't intend to blow that off.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v10-0001-Update-the-documentation-on-recovering-from-M-XI.patch
Description: Binary data


Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread Robert Haas
On Thu, Oct 12, 2023 at 3:36 PM David G. Johnston
 wrote:
> Every catalog has both a natural and a surrogate key.  Developers get to use 
> the surrogate key while end-users get to use the natural one (i.e., the one 
> they provided).  I see no reason to change that specification.

I agree with this.

> And I do believe there are no compelling reasons for an end-user to need to 
> use the surrogate key instead of the natural one.

But I disagree with this.

> The example provided by the OP isn't one, IMO, the overall goal can be 
> accomplished via the natural key (if it cannot, maybe we need to make 
> retrieving the natural key for a pg_proc record given an OID easier).  The 
> fact that OIDs are not even accessible via SQL further reinforces this 
> belief.  The only reason to need OIDs as a DBA is to perform joins among the 
> catalogs and all such joins are local to the database and even session 
> executing them - the specific values are immaterial.

This just all seems very simplistic to me. In theory it's true, but in
practice it isn't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread Nikita Malakhov
Hi,

I've already implemented preserving PG_PROC oids during pg_upgrade
in a way like relfilenodes, etc, actually, it is quite simple, and on the
first
look there are no any problems.

About using surrogate key - this feature is more for data generated by
the DBMS itself, i.e. data processed by some extension and saved
and re-processed automatically or by user's request, but without bothering
user with these internal keys.

The main question - maybe, are there pitfalls of which I am not aware of?

Thanks for your replies!

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread David G. Johnston
On Thu, Oct 12, 2023 at 1:31 PM Nikita Malakhov  wrote:

> About using surrogate key - this feature is more for data generated by
> the DBMS itself, i.e. data processed by some extension and saved
> and re-processed automatically or by user's request, but without bothering
> user with these internal keys.
>

Then what does it matter whether you spell it:

12345
or
my_ext.do_something(int)
?

Why do you require us to redefine the scope for which pg_proc.oid is useful
in order to implement this behavior?

Your extension breaks if your user uses logical backups or we otherwise
get into a position where pg_upgrade cannot be used to migrate in the
future.  Is avoiding the textual representation so necessary that you need
to add another dependency to the system?  That just seems unwise regardless
of how easy it may be to accomplish.

David J.


Re: [PATCH] Clarify the behavior of the system when approaching XID wraparound

2023-10-12 Thread Peter Geoghegan
On Thu, Oct 12, 2023 at 1:10 PM Robert Haas  wrote:
> On Thu, Oct 12, 2023 at 12:01 PM Peter Geoghegan  wrote:
> > No objections from me.
>
> Here is a doc-only patch that I think could be back-patched as far as
> emergency mode exists. It combines all of the wording changes to the
> documentation from v1-v3 of the previous version, but without changing
> the message text that is quoted in the documentation, and without
> adding more instances of similar message texts to the documentation,
> and with a bunch of additional hacking by me.

It's a bit weird that we're effectively saying "pay no attention to
that terrible HINT"...but I get it. The simple fact is that the docs
were written in a way that allowed misinformation to catch on -- the
damage that needs to be undone isn't exactly limited to the docs
themselves.

Your choice to not backpatch the changes to the log messages makes a
lot more sense, now that I see that I see the wider context built by
this preparatory patch. Arguably, it would be counterproductive to
pretend that we didn't make this mistake on the backbranches. Better
to own the mistake.

> Some things I changed:
>
> - I made it so that the MXID section refers back to the XID section
> instead of duplicating it, but with a short list of differences.
> - I weakened the existing claim that says you must be a superuser or
> VACUUM definitely won't fix it to say instead that you SHOULD run
> VACUUM as the superuser, because the former is false and the latter is
> true.
> - I made the list of steps for recovering more explicit.
> - I split out the bit about running autovacuum in the affected
> database into a separate step to be performed after VACUUM for
> continued good operation, rather than a necessary ingredient in
> recovery, because it isn't.
> - A bit of other minor rejiggering.

Those all make sense to me.

> I'm not forgetting about the rest of the proposed patch set, or the
> change I proposed earlier. I'm just posting this much now because this
> is how far I got today, and it would be useful to get comments before
> I go further. I think the residual portion of the patch set not
> included in this documentation patch will be quite small, and I think
> that's a good thing, but again, I don't intend to blow that off.

Of course. Your general approach seems wise.

Thanks for working on this. I will be relieved once this is finally
taken care of.

-- 
Peter Geoghegan




Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread Nikita Malakhov
Hi,

Textual representation requires a long text field because it could contain
schema,
arguments, it is difficult and not effective to be saved as part of the
data, and must
be parsed to retrieve function oid. By using direct oid (actually, a value
of the regprocedure field) we avoid it and function could be retrieved by
pk.

Why pg_upgrade cannot be used? OID preservation logic is already implemented
for several OIDs in catalog tables, like pg_class, type, relfilenode,
enum...
I've mentioned twice that this logic is already implemented and I haven't
encountered
any problems with pg_upgrade.

Actually, I've asked here because there are several references to PG_PROC
oids
from other tables in the system catalog, so I was worried if this logic
could break
something I do not know about.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Pro et contra of preserving pg_proc oids during pg_upgrade

2023-10-12 Thread David G. Johnston
On Thu, Oct 12, 2023 at 2:58 PM Nikita Malakhov  wrote:

> Why pg_upgrade cannot be used?
>

We document both a pg_dump/pg_restore migration and a pg_upgrade one (not
to mention that logical backup and restore would cause the oids to
change).  It seems odd to have a feature that requires pg_upgrade to be the
chosen one.  pg_upgrade is an option, not a requirement.  Same goes for
pg_basebackup.

pg_upgrade itself warns that should the on-disk file format change then it
would be unusable - though I suspect that we'd end up with some kind of
hybrid approach in that case.


> OID preservation logic is already implemented
> for several OIDs in catalog tables, like pg_class, type, relfilenode,
> enum...
>
>
We are allowed to preserve oids if we wish but that doesn't mean we must,
nor does doing so constitute a declaration that such oids are part of
the public API.  And I don't see us making OIDs part of the public API
unless we modify pg_dump to include them in its output.


> Actually, I've asked here because there are several references to PG_PROC
> oids
> from other tables in the system catalog
>

Of course there are, e.g., views depending on functions would result is
those.  But pg_upgrade et al. recomputes the views so the changing of oids
isn't a problem.

Long text fields are common in databases; and if there are concerns with
parsing/interpretation we can add functions to make doing that simpler.

David J.


BRIN minmax multi - incorrect distance for infinite timestamp/date

2023-10-12 Thread Tomas Vondra
Hi,

Ashutosh Bapat reported me off-list a possible issue in how BRIN
minmax-multi calculate distance for infinite timestamp/date values.

The current code does this:

if (TIMESTAMP_NOT_FINITE(dt1) || TIMESTAMP_NOT_FINITE(dt2))
PG_RETURN_FLOAT8(0);

so means infinite values are "very close" to any other value, and thus
likely to be merged into a summary range. That's exactly the opposite of
what we want to do, possibly resulting in inefficient indexes.

Consider this example

create table test (a timestamptz) with (fillfactor=50);

insert into test
select (now() + ((1 * random())::int || ' seconds')::interval)
  from generate_series(1,100) s(i);

update test set a = '-infinity'::timestamptz where random() < 0.01;
update test set a = 'infinity'::timestamptz where random() < 0.01;

explain (analyze, timing off, costs off)
 select * from test where a = '2024-01-01'::timestamptz;

  QUERY PLAN

--
 Bitmap Heap Scan on test (actual rows=0 loops=1)
   Recheck Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time zone)
   Rows Removed by Index Recheck: 680662
   Heap Blocks: lossy=6024
   ->  Bitmap Index Scan on test_a_idx (actual rows=60240 loops=1)
 Index Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time
zone)
 Planning Time: 0.075 ms
 Execution Time: 106.871 ms
(8 rows)

Clearly, large part of the table gets scanned - this happens because
when building the index, we end up with ranges like this:


[-infinity,a,b,c,...,x,y,z,infinity]

and we conclude that distance for [-infinity,a] is 0, and we combine
these values into a range. And the same for [z,infinity]. But we should
do exactly the opposite thing - never merge those.

Attached is a patch fixing this, with which the plan looks like this:

  QUERY PLAN

--
 Bitmap Heap Scan on test (actual rows=0 loops=1)
   Recheck Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time zone)
   ->  Bitmap Index Scan on test_a_idx (actual rows=0 loops=1)
 Index Cond: (a = '2024-01-01 00:00:00+01'::timestamp with time
zone)
 Planning Time: 0.289 ms
 Execution Time: 9.432 ms
(6 rows)

Which seems much better.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/access/brin/brin_minmax_multi.c b/src/backend/access/brin/brin_minmax_multi.c
index f8b2a3f9bc..c8775c274e 100644
--- a/src/backend/access/brin/brin_minmax_multi.c
+++ b/src/backend/access/brin/brin_minmax_multi.c
@@ -2084,8 +2084,14 @@ brin_minmax_multi_distance_date(PG_FUNCTION_ARGS)
 	DateADT		dateVal1 = PG_GETARG_DATEADT(0);
 	DateADT		dateVal2 = PG_GETARG_DATEADT(1);
 
+	/*
+	 * If either value is infinite, we treat them as in infinite distance.
+	 * We deduplicate the values before calculating distances for them, so
+	 * either one value is finite, or the sign is different - so the
+	 * inifinite distance is appropriate for both cases.
+	 */
 	if (DATE_NOT_FINITE(dateVal1) || DATE_NOT_FINITE(dateVal2))
-		PG_RETURN_FLOAT8(0);
+		PG_RETURN_FLOAT8(get_float8_infinity());
 
 	PG_RETURN_FLOAT8(dateVal1 - dateVal2);
 }
@@ -2141,8 +2147,14 @@ brin_minmax_multi_distance_timestamp(PG_FUNCTION_ARGS)
 	Timestamp	dt1 = PG_GETARG_TIMESTAMP(0);
 	Timestamp	dt2 = PG_GETARG_TIMESTAMP(1);
 
+	/*
+	 * If either value is infinite, we treat them as in infinite distance.
+	 * We deduplicate the values before calculating distances for them, so
+	 * either one value is finite, or the sign is different - so the
+	 * inifinite distance is appropriate for both cases.
+	 */
 	if (TIMESTAMP_NOT_FINITE(dt1) || TIMESTAMP_NOT_FINITE(dt2))
-		PG_RETURN_FLOAT8(0);
+		PG_RETURN_FLOAT8(get_float8_infinity());
 
 	delta = dt2 - dt1;
 


Re: On login trigger: take three

2023-10-12 Thread Alexander Korotkov
On Thu, Oct 12, 2023 at 8:35 PM Robert Haas  wrote:
> On Tue, Oct 10, 2023 at 3:43 PM Alexander Korotkov  
> wrote:
> > Yep, in v43 it worked that way.  One transaction has to wait for
> > another finishing update of pg_database tuple, then fails.  This is
> > obviously ridiculous.  Since overlapping setters of flag will have to
> > wait anyway, I changed lock mode in v44 for them to
> > AccessExclusiveLock.  Now, waiting transaction then sees the updated
> > tuple and doesn't fail.
>
> Doesn't that mean that if you create the first login trigger in a
> database and leave the transaction open, nobody can connect to that
> database until the transaction ends?

It doesn't mean that, because when trying to reset the flag v44 does
conditional lock.  So, if another transaction is holding the log we
will just skip resetting the flag.  So, the flag will be cleared on
the first connection after that transaction ends.

--
Regards,
Alexander Korotkov




Re: Wait events for delayed checkpoints

2023-10-12 Thread Michael Paquier
On Thu, Oct 12, 2023 at 01:32:29PM -0400, Robert Haas wrote:
> IPC seems right to me. Yeah, a timeout is being used, but as you say,
> that's an implementation detail.
> 
> +1 for the idea, too.

Agreed that timeout makes little sense in this context, and IPC looks
correct.

+pgstat_report_wait_start(WAIT_EVENT_CHECKPOINT_DELAY_START);
 do
 {
 pg_usleep(1L);/* wait for 10 msec */
 } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
   DELAY_CHKPT_START));
+pgstat_report_wait_end();

HaveVirtualXIDsDelayingChkpt() does immediately a LWLockAcquire()
which would itself report a wait event for ProcArrayLock, overwriting
this new one, no?
--
Michael


signature.asc
Description: PGP signature


Re: interval_ops shall stop using btequalimage (deduplication)

2023-10-12 Thread Noah Misch
On Wed, Oct 11, 2023 at 01:00:44PM -0700, Peter Geoghegan wrote:
> On Wed, Oct 11, 2023 at 11:38 AM Noah Misch  wrote:
> > Interesting.  So, >99% of interval-type indexes, even ones WITH
> > (deduplicate_items=off), will get amcheck failures.  The <1% of exceptions
> > might include indexes having allequalimage=off due to an additional column,
> > e.g. a two-column (interval, numeric) index.  If interval indexes are common
> > enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are 
> > relatively
> > rare, that could argue for giving amcheck a special case.  Specifically,
> > downgrade its "metapage incorrectly indicates that deduplication is safe" 
> > from
> > ERROR to WARNING for interval_ops only.
> 
> I am not aware of any user actually running "deduplicate_items = off"
> in production, for any index. It was added purely as a defensive thing
> -- not because I anticipated any real need to disable deduplication.
> Deduplication was optimized for being enabled by default.

Sure.  Low-importance background information: deduplicate_items=off got on my
radar while I was wondering if ALTER INDEX ... SET (deduplicate_items=off)
would clear allequalimage.  If it had, we could have advised people to use
ALTER INDEX, then rebuild only those indexes still failing "pg_amcheck
--heapallindexed".  ALTER INDEX doesn't do that, ruling out that idea.

> > Without that special case (i.e. with
> > the v1 patch), the release notes should probably resemble, "After updating,
> > run REINDEX on all indexes having an interval-type column."
> 
> +1
> 
> > There's little
> > point in recommending pg_amcheck if >99% will fail.  I'm inclined to bet 
> > that
> > interval-type indexes are rare, so I lean against adding the amcheck special
> > case.  It's not a strong preference.  Other opinions?

> exactly one case like that post-fix (interval_ops is at least the only
> affected core code opfamily), so why not point that out directly with
> a HINT? A HINT could go a long way towards putting the problem in
> context, without really adding a special case, and without any real
> question of users being misled.

Works for me.  Added.
Author: Noah Misch 
Commit: Noah Misch 

Dissociate btequalimage() from interval_ops, ending its deduplication.

Under interval_ops, some equal values are distinguishable.  One such
pair is '24:00:00' and '1 day'.  With that being so, btequalimage()
breaches the documented contract for the "equalimage" btree support
function.  This can cause incorrect results from index-only scans.
Users should REINDEX any btree indexes having interval-type columns.
After updating, pg_amcheck will report an error for almost all such
indexes.  This fix makes interval_ops simply omit the support function,
like numeric_ops does.  Back-pack to v13, where btequalimage() first
appeared.  In back branches, for the benefit of old catalog content,
btequalimage() code will return false for type "interval".  Going
forward, back-branch initdb will include the catalog change.

Reviewed by Peter Geoghegan.

Discussion: https://postgr.es/m/20231011013317.22.nmi...@google.com

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index dbb83d8..3e07a3e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -31,6 +31,7 @@
 #include "access/xact.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_opfamily_d.h"
 #include "commands/tablecmds.h"
 #include "common/pg_prng.h"
 #include "lib/bloomfilter.h"
@@ -338,10 +339,20 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, 
bool heapallindexed,
 errmsg("index \"%s\" metapage has 
equalimage field set on unsupported nbtree version",

RelationGetRelationName(indrel;
if (allequalimage && !_bt_allequalimage(indrel, false))
+   {
+   boolhas_interval_ops = false;
+
+   for (int i = 0; i < 
IndexRelationGetNumberOfKeyAttributes(indrel); i++)
+   if (indrel->rd_opfamily[i] == 
INTERVAL_BTREE_FAM_OID)
+   has_interval_ops = true;
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
 errmsg("index \"%s\" metapage 
incorrectly indicates that deduplication is safe",
-   
RelationGetRelationName(indrel;
+   
RelationGetRelationName(indrel)),
+has_interval_ops
+? errhint("This is known of 
\"interval\" indexes last built on a version predating 2023-11.")
+: 0));
+ 

Re: A new strategy for pull-up correlated ANY_SUBLINK

2023-10-12 Thread Alena Rybakina

On 12.10.2023 10:52, Andy Fan wrote:


Unfortunately, I found a request when sublink did not pull-up, as
in the

examples above. I couldn't quite figure out why.

I'm not sure what you mean with the "above", I guess it should be the 
"below"?


Yes, you are right)




explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
    left join a
        on b.x=a.x and
*b.t in
            (select max(a0.t) *
         from a a0
         where a0.x = b.x and
           a0.t = b.t);

...

   SubPlan 2


Here the sublink can't be pulled up because of its reference to
the  LHS of left join, the original logic is that no matter the 'b.t 
in ..'

returns the true or false,  the rows in LHS will be returned.  If we
pull it up to LHS, some rows in LHS will be filtered out, which
breaks its original semantics.


Thanks for the explanation, it became more clear to me here.



I thought it would be:

explain (analyze, costs off, buffers)
select b.x, b.x, a.y
from b
    left join a on
        b.x=a.x and
*b.t =
            (select max(a0.t) *
         from a a0
         where a0.x = b.x and
                   a0.t <= b.t);

QUERY PLAN

-
 Hash Right Join (actual time=1.181..67.927 rows=1000 loops=1)
   Hash Cond: (a.x = b.x)
*Join Filter: (b.t = (SubPlan 2))*
   Buffers: shared hit=3546
   ->  Seq Scan on a (actual time=0.022..17.109 rows=10 loops=1)
 Buffers: shared hit=541
   ->  Hash (actual time=1.065..1.068 rows=1000 loops=1)
 Buckets: 4096  Batches: 1  Memory Usage: 72kB
 Buffers: shared hit=5
 ->  Seq Scan on b (actual time=0.049..0.401 rows=1000
loops=1)
   Buffers: shared hit=5
   SubPlan 2
 ->  Result (actual time=0.025..0.025 rows=1 loops=1000)
   Buffers: shared hit=3000
   InitPlan 1 (returns $2)
 ->  Limit (actual time=0.024..0.024 rows=1 loops=1000)
   Buffers: shared hit=3000
   ->  Index Only Scan Backward using a_t_x_idx on
a a0 (actual time=0.023..0.023 rows=1 loops=1000)
 Index Cond: ((t IS NOT NULL) AND (t <=
b.t) AND (x = b.x))
 Heap Fetches: 1000
 Buffers: shared hit=3000
 Planning Time: 0.689 ms
 Execution Time: 68.220 ms
(23 rows)

If you noticed, it became possible after replacing the "in"
operator with "=".

I didn't notice much difference between the 'in'  and '=',  maybe I
missed something?


It seems to me that the expressions "=" and "IN" are equivalent here due 
to the fact that the aggregated subquery returns only one value, and the 
result with the "IN" operation can be considered as the intersection of 
elements on the left and right. In this query, we have some kind of set 
on the left, among which there will be found or not only one element on 
the right. In general, this expression can be considered as b=const, so 
push down will be applied to b and we can filter b during its scanning 
by the subquery's result.
But I think your explanation is necessary here, that this is all 
possible, because we can pull up the sublink here, since filtering is 
allowed on the right side (the nullable side) and does not break the 
semantics of LHS. But in contrast, I also added two queries where 
pull-up is impossible and it is not done here. Otherwise if filtering 
was applied on the left it would be mistake.


To be honest, I'm not sure if this explanation is needed in the test 
anymore, so I didn't add it.


explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON A.hundred in (SELECT min(c.hundred) FROM tenk2 C WHERE c.odd = b.odd);
   QUERY PLAN
-
 Nested Loop Left Join
   Join Filter: (SubPlan 2)
   ->  Seq Scan on tenk1 a
   ->  Materialize
 ->  Seq Scan on tenk2 b
   SubPlan 2
 ->  Result
   InitPlan 1 (returns $1)
 ->  Limit
   ->  Index Scan using tenk2_hundred on tenk2 c
 Index Cond: (hundred IS NOT NULL)
 Filter: (odd = b.odd)
(12 rows)

explain (costs off)
SELECT * FROM tenk1 A LEFT JOIN tenk2 B
ON A.hundred in (SELECT count(c.hundred) FROM tenk2 C group by (c.odd));
    QUERY PLAN
---
 Nested Loop Left Join
   Join Filter: (hashed SubPlan 1)
   ->  Seq Scan on tenk1 a
   ->  Materialize
 ->  Seq Scan on tenk2 b
   SubPlan 1
 ->  HashAggregate
   Group Key: c.odd
   ->  Seq Scan on tenk2 c
(9 rows)



I took the liberty of adding this to your patch and added myself
as reviewer, if you

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-12 Thread Michael Paquier
On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:
> After some more thought, I think we could massage the "pg_control in
> backup_label" method into something that could be back patched, with more
> advanced features (e.g. error on backup_label and pg_control both present on
> initial cluster start) saved for HEAD.

I doubt that anything changed in this area would be in the
backpatchable zone, particularly as it would involve protocol changes
within the replication commands, so I'd recommend to focus on HEAD.
Backward-compatibility is not much of a conern as long as the backend
is involved.  The real problem here would be on the frontend side and
how much effort we should try to put in maintaining the code of
pg_basebackup compatible with older backends.
--
Michael


signature.asc
Description: PGP signature


Re: interval_ops shall stop using btequalimage (deduplication)

2023-10-12 Thread Peter Geoghegan
On Thu, Oct 12, 2023 at 4:10 PM Noah Misch  wrote:
> > exactly one case like that post-fix (interval_ops is at least the only
> > affected core code opfamily), so why not point that out directly with
> > a HINT? A HINT could go a long way towards putting the problem in
> > context, without really adding a special case, and without any real
> > question of users being misled.
>
> Works for me.  Added.

Looks good. Thanks!

-- 
Peter Geoghegan




Re: Making aggregate deserialization (and WAL receive) functions slightly faster

2023-10-12 Thread David Rowley
On Wed, 11 Oct 2023 at 08:52, Tom Lane  wrote:
>
> David Rowley  writes:
> > I've attached a slightly more worked on patch that makes maxlen == 0
> > mean read-only.  Unsure if a macro is worthwhile there or not.
>
> A few thoughts:

Thank you for the review.

I spent more time on this and did end up with 2 new init functions as
you mentioned.  One for strictly read-only (initReadOnlyStringInfo),
which cannot be appended to, and as you mentioned, another
(initStringInfoFromString) which can accept a palloc'd buffer which
becomes managed by the stringinfo code. I know these names aren't
exactly as you mentioned. I'm open to adjusting still.

This means I got rid of the read-only conversion code in
enlargeStringInfo().  I didn't do anything to try to handle buffer
enlargement more efficiently in enlargeStringInfo() for the case where
initStringInfoFromString sets maxlen to some non-power-of-2.  The
doubling code seems like it'll work ok without power-of-2 values,
it'll just end up calling repalloc() with non-power-of-2 values.

I did also wonder if resetStringInfo() would have any business
touching the existing buffer in a read-only StringInfo and came to the
conclusion that it wouldn't be very read-only if we allowed
resetStringInfo() to do its thing on it. I added an Assert to fail if
resetStringInfo() receives a read-only StringInfo.

Also, since it's still being discussed, I left out the adjustment to
LogicalParallelApplyLoop().  That also allows the tests to pass
without the failing Assert that was checking for the NUL terminator.

David
diff --git a/src/backend/replication/logical/proto.c 
b/src/backend/replication/logical/proto.c
index d52c8963eb..ce9d5b4059 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -879,6 +879,7 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData 
*tuple)
/* Read the data */
for (i = 0; i < natts; i++)
{
+   char   *buff;
charkind;
int len;
StringInfo  value = &tuple->colvalues[i];
@@ -899,19 +900,16 @@ logicalrep_read_tuple(StringInfo in, LogicalRepTupleData 
*tuple)
len = pq_getmsgint(in, 4);  /* read length 
*/
 
/* and data */
-   value->data = palloc(len + 1);
-   pq_copymsgbytes(in, value->data, len);
+   buff = palloc(len + 1);
+   pq_copymsgbytes(in, buff, len);
 
/*
 * Not strictly necessary for 
LOGICALREP_COLUMN_BINARY, but
 * per StringInfo practice.
 */
-   value->data[len] = '\0';
+   buff[len] = '\0';
 
-   /* make StringInfo fully valid */
-   value->len = len;
-   value->cursor = 0;
-   value->maxlen = len;
+   initStringInfoFromString(value, buff, len);
break;
default:
elog(ERROR, "unrecognized data representation 
type '%c'", kind);
diff --git a/src/backend/replication/logical/worker.c 
b/src/backend/replication/logical/worker.c
index 597947410f..b574188d70 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3582,10 +3582,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
/* Ensure we are reading the data into 
our memory context. */

MemoryContextSwitchTo(ApplyMessageContext);
 
-   s.data = buf;
-   s.len = len;
-   s.cursor = 0;
-   s.maxlen = -1;
+   initReadOnlyStringInfo(&s, buf, len);
 
c = pq_getmsgbyte(&s);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f3c9f1f9ba..94b963d6e6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -1816,23 +1816,19 @@ exec_bind_message(StringInfo input_message)
 
if (!isNull)
{
-   const char *pvalue = 
pq_getmsgbytes(input_message, plength);
+   char *pvalue;
 
/*
-* Rather than copying data around, we just set 
up a phony
-* StringInfo pointing to the correct portion 
of the message
-* buffer.  We assume we can scribble on th

Re: Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)

2023-10-12 Thread Michael Paquier
On Thu, Oct 12, 2023 at 02:00:00PM +0300, Alexander Lakhin wrote:
> So to fail on the test, skink should perform at least twice slower than
> usual, and may be it's an extraordinary condition indeed, but on the other
> hand, may be increase checkpoint_timeout as already done in several tests
> (015_promotion_pages, 038_save_logical_slots_shutdown, 039_end_of_wal, ...).
> 
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-10%2017%3A10%3A11
> [2] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-11-07%2020%3A27%3A11

Thanks for the investigation.  Increasing the checkpoint timeout is
not a perfect science but at least it would work until a machine is
able to be slower than the current limit reached, so I would be OK
with your suggestion and raise the bar a bit more to prevent the race
created by these extra checkpoints triggered because of the time.
--
Michael


signature.asc
Description: PGP signature


Re: LLVM 16 (opaque pointers)

2023-10-12 Thread Andres Freund
Hi,

On 2023-10-11 21:59:50 +1300, Thomas Munro wrote:
> +#else
> + LLVMPassBuilderOptionsRef options;
> + LLVMErrorRef err;
> + int compile_optlevel;
> + char   *passes;
> +
> + if (context->base.flags & PGJIT_OPT3)
> + compile_optlevel = 3;
> + else
> + compile_optlevel = 0;
> +
> + passes = 
> psprintf("default,mem2reg,function(no-op-function),no-op-module",
> +   compile_optlevel);

I don't think the "function(no-op-function),no-op-module" bit does something
particularly useful?

I also don't think we should add the mem2reg pass outside of -O0 - running it
after a real optimization pipeline doesn't seem useful and might even make the
code worse? mem2reg is included in default (and obviously also in O3).


Thanks for working on this stuff!


I'm working on setting up buildfarm animals for 16, 17, each once with
a normal and an assertion enabled LLVM build.

Greetings,

Andres Freund




Re: Test 026_overwrite_contrecord fails on very slow machines (under Valgrind)

2023-10-12 Thread Andres Freund
Hi,

On 2023-10-12 14:00:00 +0300, Alexander Lakhin wrote:
> So to fail on the test, skink should perform at least twice slower than
> usual

The machine skink is hosted on runs numerous buildfarm animals (24 I think
right now, about to be 28). While it has plenty resources (16 cores/32
threads, 128GB RAM), test runtime is still pretty variable depending on what
other tests are running at the same time...

Greetings,

Andres Freund




Re: Add support for AT LOCAL

2023-10-12 Thread Vik Fearing

On 10/10/23 05:34, Michael Paquier wrote:

I am attaching a v5 that addresses the documentation bits, could you
look at the business with date.c?


Here is a v6 which hopefully addresses all of your concerns.
--
Vik Fearing
From 042ce9b581ca3b17afbf229d209ca59addb6c9a2 Mon Sep 17 00:00:00 2001
From: Vik Fearing 
Date: Wed, 4 Oct 2023 15:46:38 +0100
Subject: [PATCH v6] Add support for AT LOCAL

When converting a timestamp to/from with/without time zone, the SQL
Standard specifies an AT LOCAL variant of AT TIME ZONE which uses the
session's time zone.
---
 doc/src/sgml/func.sgml| 103 +-
 src/backend/parser/gram.y |   7 ++
 src/backend/utils/adt/date.c  |  14 +++
 src/backend/utils/adt/ruleutils.c |  10 +++
 src/backend/utils/adt/timestamp.c |  20 +
 src/include/catalog/pg_proc.dat   |   9 ++
 src/test/regress/expected/timestamptz.out |  47 ++
 src/test/regress/expected/timetz.out  |  39 
 src/test/regress/sql/timestamptz.sql  |  21 +
 src/test/regress/sql/timetz.sql   |  17 
 10 files changed, 284 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f1ad64c3d6..ce62cb37b5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10604,42 +10604,46 @@ SELECT date_bin('15 minutes', TIMESTAMP '2020-02-11 15:44:17', TIMESTAMP '2001-0
 that date_bin can truncate to an arbitrary interval.

 

 The stride interval must be greater than zero and
 cannot contain units of month or larger.

   
 
   
-   AT TIME ZONE
+   AT TIME ZONE and AT LOCAL
 

 time zone
 conversion

 

 AT TIME ZONE

 
+   
+AT LOCAL
+   
+

 The AT TIME ZONE operator converts time
 stamp without time zone to/from
 time stamp with time zone, and
 time with time zone values to different time
 zones.  shows its
 variants.

 
 
- AT TIME ZONE Variants
+ AT TIME ZONE and AT LOCAL Variants
  
   

 
  Operator
 
 
  Description
 
 
@@ -10658,93 +10662,186 @@ SELECT date_bin('15 minutes', TIMESTAMP '2020-02-11 15:44:17', TIMESTAMP '2001-0
  Converts given time stamp without time zone to
  time stamp with time zone, assuming the given
  value is in the named time zone.
 
 
  timestamp '2001-02-16 20:38:40' at time zone 'America/Denver'
  2001-02-17 03:38:40+00
 

 
+   
+
+ timestamp without time zone AT LOCAL
+ timestamp with time zone
+
+
+ Converts given time stamp without time zone to
+ time stamp with the session's
+ TimeZone value as time zone.
+
+
+ timestamp '2001-02-16 20:38:40' at local
+ 2001-02-17 03:38:40+00
+
+   
+

 
  timestamp with time zone AT TIME ZONE zone
  timestamp without time zone
 
 
  Converts given time stamp with time zone to
  time stamp without time zone, as the time would
  appear in that zone.
 
 
  timestamp with time zone '2001-02-16 20:38:40-05' at time zone 'America/Denver'
  2001-02-16 18:38:40
 

 
+   
+
+ timestamp with time zone AT LOCAL
+ timestamp without time zone
+
+
+ Converts given time stamp with time zone to
+ time stamp without time zone, as the time would
+ appear with the session's TimeZone value as time zone.
+
+
+ timestamp with time zone '2001-02-16 20:38:40-05' at local
+ 2001-02-16 18:38:40
+
+   
+

 
  time with time zone AT TIME ZONE zone
  time with time zone
 
 
  Converts given time with time zone to a new time
  zone.  Since no date is supplied, this uses the currently active UTC
  offset for the named destination zone.
 
 
  time with time zone '05:34:17-05' at time zone 'UTC'
  10:34:17+00
 

+
+   
+
+ time with time zone AT LOCAL
+ time with time zone
+
+
+ Converts given time with time zone to a new time
+ zone.  Since no date is supplied, this uses the currently active UTC
+ offset for the session's TimeZone value.
+
+
+ Assuming the session's TimeZone is set to UTC:
+
+
+ time with time zone '05:34:17-05' at local
+ 10:34:17+00
+
+   
   
  
 
 

 In these expressions, the desired time zone zone can be
 specified either as a text value (e.g., 'America/Los_Angeles')
 or as an interval (e.g., INTERVAL '-08:00').
 In the text case, a ti

Re: CHECK Constraint Deferrable

2023-10-12 Thread Vik Fearing

On 10/10/23 15:12, Robert Haas wrote:

On Mon, Oct 9, 2023 at 5:07 PM David G. Johnston
 wrote:

2. I don't think it's a good idea for the same patch to try to solve
two problems unless they are so closely related that solving one
without solving the other is not sensible.


A NOT NULL constraint apparently is just a special case of a check constraint 
which seems closely related enough to match your definition.


Yes, that might be true. I suppose I'd like to hear from the patch
author(s) about that. I'm somewhat coming around to your idea that
maybe both should be covered together, but I'm not the one writing the
patch.


Álvaro Herrera has put (and is still putting) immense effort into 
turning NOT NULL into a CHECK constraint.


Honestly, I don't see why the two patches need to be combined.
--
Vik Fearing





Re: Some performance degradation in REL_16 vs REL_15

2023-10-12 Thread Michael Paquier
On Thu, Oct 12, 2023 at 09:20:36PM +1300, David Rowley wrote:
> It would be interesting to know what's to blame here and if you can
> attribute it to a certain commit.

+1.
--
Michael


signature.asc
Description: PGP signature


Re: SQL:2011 application time

2023-10-12 Thread Vik Fearing

On 10/11/23 05:47, Paul Jungwirth wrote:
+SELECT pg_get_indexdef(conindid, 0, true) FROM pg_constraint WHERE 
conname = 'temporal_rng_pk';

+    pg_get_indexdef
+---
+ CREATE UNIQUE INDEX temporal_rng_pk ON temporal_rng USING gist (id, 
valid_at)


Shouldn't this somehow show the operator classes for the columns?  We 
are using different operator classes for the id and valid_at columns, 
aren't we?


We only print the operator classes if they are not the default, so they 
don't appear here.


I do suspect something more is desirable though. For exclusion 
constraints we replace everything before the columns with just "EXCLUDE 
USING gist". I could embed WITHOUT OVERLAPS but it's not valid syntax in 
CREATE INDEX. Let me know if you have any ideas.


Why not?  The standard does not mention indexes (although some 
discussions last week might change that) so we can change the syntax for 
it as we wish.  Doing so would also allow us to use ALTER TABLE ... 
USING INDEX for such things.

--
Vik Fearing





Re: Tab completion for AT TIME ZONE

2023-10-12 Thread Vik Fearing

On 10/12/23 10:27, Dagfinn Ilmari Mannsåker wrote:

Michael Paquier  writes:


On Fri, Apr 14, 2023 at 12:05:25PM +0200, Jim Jones wrote:

The patch applies cleanly and it does what it is proposing. - and it's IMHO
a very nice addition.

I've marked the CF entry as "Ready for Committer".


+/* ... AT TIME ZONE ... */
+   else if (TailMatches("AT"))
+   COMPLETE_WITH("TIME ZONE");
+   else if (TailMatches("AT", "TIME"))
+   COMPLETE_WITH("ZONE");
+   else if (TailMatches("AT", "TIME", "ZONE"))
+   COMPLETE_WITH_TIMEZONE_NAME();

This style will for the completion of timezone values even if "AT" is
the first word of a query.  Shouldn't this be more selective by making
sure that we are at least in the context of a SELECT query?


It's valid anywhere an expression is, which is a lot more places than
just SELECT queries.  Off the top of my head I can think of WITH,
INSERT, UPDATE, VALUES, CALL, CREATE TABLE, CREATE INDEX.

As I mentioned upthread, the only place in the grammar where the word AT
occurs is in AT TIME ZONE, so there's no ambiguity.  Also, it doesn't
complete time zone names after AT, it completes the literal words TIME
ZONE, and you have to then hit tab again to get a list of time zones.
If we (or the SQL committee) were to invent more operators that start
with the word AT, we can add those to the first if clause above and
complete with the appropriate values after each one separately.


Speaking of this...

The SQL committee already has another operator starting with AT which is 
AT LOCAL.  I am implementing it in 
https://commitfest.postgresql.org/45/4343/ where I humbly admit that I 
did not think of psql tab completion at all.


These two patches are co-dependent and whichever goes in first the other 
will need to be adjusted accordingly.

--
Vik Fearing





Re: On login trigger: take three

2023-10-12 Thread Robert Haas
On Thu, Oct 12, 2023 at 6:54 PM Alexander Korotkov  wrote:
> On Thu, Oct 12, 2023 at 8:35 PM Robert Haas  wrote:

> > Doesn't that mean that if you create the first login trigger in a
> > database and leave the transaction open, nobody can connect to that
> > database until the transaction ends?
>
> It doesn't mean that, because when trying to reset the flag v44 does
> conditional lock.  So, if another transaction is holding the log we
> will just skip resetting the flag.  So, the flag will be cleared on
> the first connection after that transaction ends.

But in the scenario I am describing the flag is being set, not reset.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Wait events for delayed checkpoints

2023-10-12 Thread Robert Haas
On Thu, Oct 12, 2023 at 7:09 PM Michael Paquier  wrote:
> On Thu, Oct 12, 2023 at 01:32:29PM -0400, Robert Haas wrote:
> > IPC seems right to me. Yeah, a timeout is being used, but as you say,
> > that's an implementation detail.
> >
> > +1 for the idea, too.
>
> Agreed that timeout makes little sense in this context, and IPC looks
> correct.
>
> +pgstat_report_wait_start(WAIT_EVENT_CHECKPOINT_DELAY_START);
>  do
>  {
>  pg_usleep(1L);/* wait for 10 msec */
>  } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
>DELAY_CHKPT_START));
> +pgstat_report_wait_end();
>
> HaveVirtualXIDsDelayingChkpt() does immediately a LWLockAcquire()
> which would itself report a wait event for ProcArrayLock, overwriting
> this new one, no?

Ah, right: the wait event should be set and cleared around pg_usleep,
not the whole loop.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: PostgreSQL domains and NOT NULL constraint

2023-10-12 Thread Vik Fearing

On 10/12/23 15:54, Tom Lane wrote:

Erki Eessaar  writes:

PostgreSQL's CREATE DOMAIN documentation (section Notes) describes a way how 
one can add NULL's to a column that has a domain with the NOT NULL constraint.
https://www.postgresql.org/docs/current/sql-createdomain.html
To me it seems very strange and amounts to a bug because it defeats the purpose 
of domains (to be a reusable assets) and constraints (to avoid any bypassing of 
these).


I doubt we'd consider doing anything about that.  The whole business
of domains with NOT NULL constraints is arguably a defect of the SQL
standard, because there are multiple ways to produce a value that
is NULL and yet must be considered to be of the domain type.
The subselect-with-no-output case that you show isn't even the most
common one; I'd say that outer joins where there are domain columns
on the nullable side are the biggest problem.

There's been some discussion of treating the output of such a join,
subselect, etc as being of the domain's base type not the domain
proper.  That'd solve this particular issue since then we'd decide
we have to cast the base type back up to the domain type (and hence
check its constraints) before inserting the row.  But that choice
just moves the surprise factor somewhere else, in that queries that
used to produce one data type now produce another one.  There are
applications that this would break.  Moreover, I do not think there's
any justification for it in the SQL spec.



I do not believe this is a defect of the SQL standard at all. 
SQL:2023-2 Section 4.14 "Domains" clearly states "The purpose of a 
domain is to constrain the set of valid values that can be stored in a 
column of a base table by various operations."


That seems very clear to me that *storing* a value in a base table must 
respect the domain's constraints, even if *operations* on those values 
might not respect all of the domain's constraints.


Whether or not it is practical to implement that is a different story, 
but allowing the null value to be stored in a column of a base table 
whose domain specifies NOT NULL is frankly a bug.

--
Vik Fearing





Re: PostgreSQL domains and NOT NULL constraint

2023-10-12 Thread Tom Lane
Vik Fearing  writes:
> On 10/12/23 15:54, Tom Lane wrote:
>> There's been some discussion of treating the output of such a join,
>> subselect, etc as being of the domain's base type not the domain
>> proper.  That'd solve this particular issue since then we'd decide
>> we have to cast the base type back up to the domain type (and hence
>> check its constraints) before inserting the row.  But that choice
>> just moves the surprise factor somewhere else, in that queries that
>> used to produce one data type now produce another one.  There are
>> applications that this would break.  Moreover, I do not think there's
>> any justification for it in the SQL spec.

> I do not believe this is a defect of the SQL standard at all. 
> SQL:2023-2 Section 4.14 "Domains" clearly states "The purpose of a 
> domain is to constrain the set of valid values that can be stored in a 
> column of a base table by various operations."

So I wonder what is the standard's interpretation of

regression=# create domain dpos as integer not null check (value > 0);
CREATE DOMAIN
regression=# create table t1 (x int, d dpos);
CREATE TABLE
regression=# create view v1 as select ty.d from t1 tx left join t1 ty using (x);
CREATE VIEW
regression=# \d+ v1
View "public.v1"
 Column | Type | Collation | Nullable | Default | Storage | Description 
+--+---+--+-+-+-
 d  | dpos |   |  | | plain   | 
View definition:
 SELECT ty.d
   FROM t1 tx
 LEFT JOIN t1 ty USING (x);

If we are incorrect in ascribing the type "dpos" to v1.d, where
in the spec contradicts that?  (Or in other words, 4.14 might lay
out some goals for the feature, but that's just empty words if
it's not supported by accurate details in other places.)

regards, tom lane




Re: Some performance degradation in REL_16 vs REL_15

2023-10-12 Thread Andres Freund
Hi,

On 2023-10-12 11:00:22 +0300, Anton A. Melnikov wrote:
> Found that simple test pgbench -c20 -T20 -j8 gives approximately
> for REL_15_STABLE at 5143f76:  336+-1 TPS
> and
> for REL_16_STABLE at 4ac7635f: 324+-1 TPS
> 
> The performance drop is approximately 3,5%  while the corrected standard 
> deviation is only 0.3%.
> See the raw_data.txt attached.

Could you provide a bit more details about how you ran the benchmark?  The
reason I am asking is that ~330 TPS is pretty slow for -c20.  Even on spinning
rust and using the default settings, I get considerably higher results.

Oh - I do get results closer to yours if I use pgbench scale 1, causing a lot
of row level contention. What scale did you use?

Greetings,

Andres Freund




Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2023-10-12 Thread Nathan Bossart
On Tue, Sep 26, 2023 at 08:13:45AM +0200, Daniel Gustafsson wrote:
>> On 26 Sep 2023, at 00:20, Nathan Bossart  wrote:
>> 
>> On Thu, Sep 21, 2023 at 11:18:00AM +0900, bt23nguyent wrote:
>>> -basic_archive_configured(ArchiveModuleState *state)
>>> +basic_archive_configured(ArchiveModuleState *state, char **logdetail)
>> 
>> Could we do something more like GUC_check_errdetail() instead to maintain
>> backward compatibility with v16?
> 
> We'd still need something exported to call into which isn't in 16, so it
> wouldn't be more than optically backwards compatible since a module written 
> for
> 17 won't compile for 16, or am I missing something?

I only mean that a module written for v16 could continue to be used in v17
without any changes.  You are right that a module that uses this new
functionality wouldn't compile for v16.  But IMHO the interface is nicer,
too, since module authors wouldn't need to worry about allocating the space
for the string or formatting the message.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: PostgreSQL domains and NOT NULL constraint

2023-10-12 Thread Vik Fearing

On 10/13/23 02:44, Tom Lane wrote:

Vik Fearing  writes:

On 10/12/23 15:54, Tom Lane wrote:

There's been some discussion of treating the output of such a join,
subselect, etc as being of the domain's base type not the domain
proper.  That'd solve this particular issue since then we'd decide
we have to cast the base type back up to the domain type (and hence
check its constraints) before inserting the row.  But that choice
just moves the surprise factor somewhere else, in that queries that
used to produce one data type now produce another one.  There are
applications that this would break.  Moreover, I do not think there's
any justification for it in the SQL spec.



I do not believe this is a defect of the SQL standard at all.
SQL:2023-2 Section 4.14 "Domains" clearly states "The purpose of a
domain is to constrain the set of valid values that can be stored in a
column of a base table by various operations."


So I wonder what is the standard's interpretation of

regression=# create domain dpos as integer not null check (value > 0);
CREATE DOMAIN
regression=# create table t1 (x int, d dpos);
CREATE TABLE
regression=# create view v1 as select ty.d from t1 tx left join t1 ty using (x);
CREATE VIEW
regression=# \d+ v1
 View "public.v1"
  Column | Type | Collation | Nullable | Default | Storage | Description
+--+---+--+-+-+-
  d  | dpos |   |  | | plain   |
View definition:
  SELECT ty.d
FROM t1 tx
  LEFT JOIN t1 ty USING (x);

If we are incorrect in ascribing the type "dpos" to v1.d, where
in the spec contradicts that?  (Or in other words, 4.14 might lay
out some goals for the feature, but that's just empty words if
it's not supported by accurate details in other places.)

Objection, Your Honor: Relevance.

Regardless of what the spec may or may not say about v1.d, it still 
remains that nulls should not be allowed in a *base table* if the domain 
says nulls are not allowed.  Not mentioned in this thread but the 
constraints are also applied when CASTing to the domain.


Now, to answer your straw man, this might be helpful:

SQL:2023-2 Section 11.4  Syntax Rule 9, "If the 
descriptor of D includes any domain constraint descriptors, then T shall 
be a persistent base table.".  Your v1 is not that and therefore 
arguably illegal.


As you know, I am more than happy to (try to) amend the spec where 
needed, but Erki's complaint of a null value being allowed in a base 
table is clearly a bug in our implementation regardless of what we do 
with views.

--
Vik Fearing





Re: Removing unneeded self joins

2023-10-12 Thread Andrei Lepikhov

On 12/10/2023 18:32, Alexander Korotkov wrote:

On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov

We have almost the results we wanted to have. But in the last explain
you can see that nothing happened with the OR clause. We should use the
expression mutator instead of walker to handle such clauses. But It
doesn't process the RestrictInfo node ... I'm inclined to put a solution
of this issue off for a while.


OK.  I think it doesn't worth to eliminate IS NULL quals with this
complexity (at least at this stage of work).


Yeah. I think It would be meaningful in the case of replacing also 
nested x IS NOT NULL with nothing. But it requires using a mutator 
instead of the walker and may be done more accurately next time.



I made improvements over the code.  Mostly new comments, grammar
corrections of existing comments and small refactoring.


Great!


Also, I found that the  suggestion from David Rowley [1] to qsort
array of relations to faster find duplicates is still unaddressed.
I've implemented it.  That helps to evade quadratic complexity with
large number of relations.


I see. The thread is too long so far, thanks for the catch.


Also I've incorporated improvements from Alena Rybakina except one for
skipping SJ removal when no SJ quals is found.  It's not yet clear for
me if this check fix some cases. But at least optimization got skipped
in some useful cases (as you can see in regression tests).


Agree. I wouldn't say I like it too. But also, I suggest skipping some 
unnecessary assertions proposed in that patch:
Assert(toKeep->relid != -1); - quite strange. Why -1? Why not all the 
negative numbers, at least?
Assert(is_opclause(orinfo->clause)); - above we skip clauses with 
rinfo->mergeopfamilies == NIL. Each mergejoinable clause is already 
checked as is_opclause.

All these changes (see in the attachment) are optional.

--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/plan/analyzejoins.c 
b/src/backend/optimizer/plan/analyzejoins.c
index f0746f35a3..7b8dc7a2b7 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -1710,8 +1710,6 @@ remove_self_join_rel(PlannerInfo *root, PlanRowMark 
*kmark, PlanRowMark *rmark,
List   *binfo_candidates = NIL;
ReplaceVarnoContext ctx = {.from = toRemove->relid,.to = toKeep->relid};
 
-   Assert(toKeep->relid != -1);
-
/*
 * Replace index of removing table with the keeping one. The technique 
of
 * removing/distributing restrictinfo is used here to attach just 
appeared
@@ -2017,8 +2015,6 @@ match_unique_clauses(PlannerInfo *root, RelOptInfo 
*outer, List *uclauses,
/* Don't consider clauses which aren't similar 
to 'F(X)=G(Y)' */
continue;
 
-   Assert(is_opclause(orinfo->clause));
-
oclause = bms_is_empty(orinfo->left_relids) ?
get_rightop(orinfo->clause) : 
get_leftop(orinfo->clause);
c2 = (bms_is_empty(orinfo->left_relids) ?
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2ff4881fdf..96ebd6eed3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -367,7 +367,6 @@ CatalogId
 CatalogIdMapEntry
 CatalogIndexState
 ChangeVarNodes_context
-ReplaceVarnoContext
 CheckPoint
 CheckPointStmt
 CheckpointStatsData
@@ -2341,6 +2340,7 @@ ReorderBufferUpdateProgressTxnCB
 ReorderTuple
 RepOriginId
 ReparameterizeForeignPathByChild_function
+ReplaceVarnoContext
 ReplaceVarsFromTargetList_context
 ReplaceVarsNoMatchOption
 ReplicaIdentityStmt
@@ -2474,6 +2474,7 @@ SeenRelsEntry
 SelectLimit
 SelectStmt
 Selectivity
+SelfJoinCandidate
 SemTPadded
 SemiAntiJoinFactors
 SeqScan


Re: [dynahash] do not refill the hashkey after hash_search

2023-10-12 Thread Nathan Bossart
On Thu, Sep 14, 2023 at 04:28:26PM +0800, Junwang Zhao wrote:
> Add a v2 with some change to fix warnings about unused-parameter.
> 
> I will add this to Commit Fest.

This looks reasonable to me.  I've marked the commitfest entry as
ready-for-committer.  I will plan on committing it in a couple of days
unless John has additional feedback or would like to do the honors.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: PGDOCS - add more links in the pub/sub reference pages

2023-10-12 Thread Peter Smith
On Thu, Oct 12, 2023 at 3:44 PM Amit Kapila  wrote:
>
> On Mon, Oct 9, 2023 at 12:15 PM Peter Smith  wrote:
> >
> > On Mon, Oct 9, 2023 at 3:32 PM Amit Kapila  wrote:
> > >
> >
> > In v1, I used the same pattern as on the CREATE SUBSCRIPTION page,
> > which doesn't look like those...
> >
>
> Yeah, I think it would have been better if we used params in the
> CREATE SUBSCRIPTION page as well. I don't know if it is a good idea to
> do now with this patch but it makes sense to be consistent. What do
> you think?
>

OK, I have given those changes as separate patches:
- 0002 (changes the CREATE PUBLICATION parameter ids)
- 0003 (changes CREATE SUBSCRIPTION parameter ids)

> > ~~~
> >
> > The "Parameters" section describes some things that really are parameters:
> >
> > e.g.
> > "sql-altersubscription-name"
> > "sql-altersubscription-new-owner"
> > "sql-altersubscription-new-name">
> >
> > I agree, emphasising that those ones are parameters is better. Changed
> > like this in v2.
> >
> > "sql-altersubscription-params-name"
> > "sql-altersubscription-params-new-owner"
> > "sql-altersubscription-params-new-name">
> >
> > ~
> >
> > But, the "Parameters" section also describes other SQL syntax clauses
> > which are not really parameters at all.
> >
> > e.g.
> > "sql-altersubscription-refresh-publication"
> > "sql-altersubscription-enable"
> > "sql-altersubscription-disable"
> >
> > So I felt those ones are more intuitive left as they are  -- e.g.,
> > instead of having ids/linkends like:
> >
> > "sql-altersubscription-params-refresh-publication"
> > "sql-altersubscription-params-enable"
> > "sql-altersubscription-params-disable"
> >
>
> I checked alter_role.sgml which has similar mixed usage and it is
> using 'params' consistently in all cases. So, I would suggest
> following a similar style here.
>

As you wish. Done that way in patch 0001.

~~

PSA the v5 patches.

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v5-0001-Add-more-pub-sub-links.patch
Description: Binary data


v5-0002-Change-ids-for-CREATE-PUBLICATION-parameters.patch
Description: Binary data


v5-0003-Change-ids-for-CREATE-SUBSCRIPTION-parameters.patch
Description: Binary data


Re: Wait events for delayed checkpoints

2023-10-12 Thread Thomas Munro
On Fri, Oct 13, 2023 at 2:19 PM Robert Haas  wrote:
> On Thu, Oct 12, 2023 at 7:09 PM Michael Paquier  wrote:
> > HaveVirtualXIDsDelayingChkpt() does immediately a LWLockAcquire()
> > which would itself report a wait event for ProcArrayLock, overwriting
> > this new one, no?
>
> Ah, right: the wait event should be set and cleared around pg_usleep,
> not the whole loop.

Duh.  Yeah.  Pushed like that.  Thanks both.




Re: pg_upgrade's interaction with pg_resetwal seems confusing

2023-10-12 Thread Amit Kapila
On Fri, Oct 13, 2023 at 12:00 AM Robert Haas  wrote:
>
> On Thu, Oct 12, 2023 at 7:17 AM Amit Kapila  wrote:
> > Now, as mentioned in the first paragraph, it seems we anyway don't
> > need to reset the WAL at the end when setting the next OID for the new
> > cluster with the -o option. If that is true, then I think even without
> > slots work it will be helpful to have such an option in pg_resetwal.
> >
> > Thoughts?
>
> I wonder if we should instead provide a way to reset the OID counter
> with a function call inside the database, gated by IsBinaryUpgrade.
>

I think the challenge in doing so would be that when the server is
running, a concurrent checkpoint can also update the OID counter value
in the control file. See below code:

CreateCheckPoint()
{
...
LWLockAcquire(OidGenLock, LW_SHARED);
checkPoint.nextOid = ShmemVariableCache->nextOid;
if (!shutdown)
checkPoint.nextOid += ShmemVariableCache->oidCount;
LWLockRelease(OidGenLock);
...
UpdateControlFile()
...
}

Now, we can try to pass some startup options like checkpoint_timeout
with a large value to ensure that checkpoint won't interfere but not
sure if that would be bulletproof. Instead, how about allowing
pg_upgrade to update the control file of the new cluster (with the
required value of OID) following the same method as pg_resetwal does
in RewriteControlFile()?

> Having something like pg_resetwal --but-dont-actually-reset-the-wal
> seems both self-contradictory and vulnerable to abuse that we might be
> better off not inviting.
>

Fair point.

-- 
With Regards,
Amit Kapila.




Re: Add support for AT LOCAL

2023-10-12 Thread Michael Paquier
On Fri, Oct 13, 2023 at 02:20:59AM +0200, Vik Fearing wrote:
> On 10/10/23 05:34, Michael Paquier wrote:
> > I am attaching a v5 that addresses the documentation bits, could you
> > look at the business with date.c?
> 
> Here is a v6

Thanks for the new version.

> which hopefully addresses all of your concerns.

Mostly ;)

The first thing I did was to extract the doc bits about timezone(zone,
time) for AT TIME ZONE from v6 and applied it independently.

I have then looked at the rest and it looked mostly OK to me,
including the extra description you have added for the fifth example
in the docs.  I have tweaked a few things: the regression tests to
make the views a bit more appealing to the eye, an indentation to not
have koel complain and did a catalog bump.  Then applied it.
--
Michael


signature.asc
Description: PGP signature


  1   2   >