Re: Unable to start replica after failover

2022-09-06 Thread Alexander Kukushkin
On Tue, Sep 6, 2022, 08:46 Lahnov, Igor  wrote:

> What do you think it is possible to add a check to the restore command,
> that a partial or full file already exists?
>
> Or is disabling the restore command a possible solution in this case?
>

My opinion didn't change, pg_probackup does a weird thing. It shouldn't
restore the partial file instead of the normal when the normal file doesn't
exists.

Regards,
--
Alexander Kukushkin

>


RE: Unable to start replica after failover

2022-09-06 Thread Lahnov, Igor
As far as I understand, according to the logs, the last leader does not yet 
know about the new timeline
and it is trying to download the full log from the previous timeline.
It seems there should be a conflict that the partial file already exists 
locally when restoring in this case, but this does not happen.
And the partial file from the new leader loaded as a full log.
What do you think it is possible to add a check to the restore command, that a 
partial or full file already exists?
Or is disabling the restore command a possible solution in this case?


Re: Postgres SQL unable to handle Null values for Text datatype

2022-09-06 Thread Mladen Gogala

On 9/6/22 02:10, Karthik K L V wrote:
We are migrating from Oracle 12C to Aurora Postgres 13 and running 
into query failures when the bind value of a Text datatype resolves to nul


Oracle is actually in the wrong here. Nothing should be equal to null, 
ever. There is also different behavior with unique indexes:


   [mgogala@umajor ~]$ docker start psql14-5
   psql14-5
   [mgogala@umajor ~]$ psql -U scott
   Password for user scott:
   psql (14.5)
   Type "help" for help.

   scott=> create table test1 (key1 integer,key2 integer, data
   varchar(10));
   CREATE TABLE
   scott=> alter table test1 add constraint test1_uq unique(key1,key2);
   ALTER TABLE
   scott=> insert into test1 values(1,null,'a');
   INSERT 0 1
   scott=> insert into test1 values(1,null,'b');
   INSERT 0 1
   scott=> select * from test1;
 key1 | key2 | data
   --+--+---
    1 |  | a
    1 |  | b
   (2 rows)

The same thing would not work with Oracle. However, please note that, 
according to SQL standard, NULL is not equal to anything, to those 2 
rows are actually not a unique constraint violation. To enforce the 
uniqueness the same way as with Oracle, you actually need 2 indexes.  
You need to use the "coalesce" function.


--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com


Re: Postgres SQL unable to handle Null values for Text datatype

2022-09-06 Thread Christophe Pettus



> On Sep 5, 2022, at 23:10, Karthik K L V  wrote:
> The above query fails with the below exception when the value of ?1 resolves 
> to null.
> org.postgresql.util.PSQLException: ERROR: operator does not exist: character 
> varying = bytea
>   Hint: No operator matches the given name and argument types. You might need 
> to add explicit type casts.
>   Position: 64

This is a Hibernate issue.  If you use setParameter with a NULL without 
specifying the type, it assumes bytea (with the PostgreSQL JDBC driver, at 
least).  You'll need to use the three-parameter form of setParameter() that 
specifies a type, if the value is going to be NULL.  You can also use the 
set methods on SQLQuery, since the type is specified by the particular 
method there.

That being said, PostgreSQL's handling of NULL string values is different from 
Oracle's, and this is an area that code changes are often required.



Re: Determine if a user and database are available

2022-09-06 Thread Jeffrey Walton
On Fri, Sep 2, 2022 at 5:43 PM Christophe Pettus  wrote:
>
> > On Sep 2, 2022, at 14:22, Jeffrey Walton  wrote:
> > Given the NOTES in the man page, how do we determine if a user and
> > database are present using the shell? Is there another utility we
> > should be using?
>
> pg_isready literally only checks that the server can be reached over the 
> connection path (network or sockets), not that any login credentials work.  
> You can use psql do that, though:
>
> psql 
> ... will return an error if the connection information can't be used to 
> successfully log in.

Now available as a direct replacement for pg_isready :
https://github.com/noloader/pg_check_conn .

Jeff




log_min_messages = warning

2022-09-06 Thread Dirschel, Steve
We recently upgraded from postgres 12.8 to 14.3.  We are running Aurora 
Postgres on AWS.

We have procedures that will make calls to RAISE NOTICE to write out messages 
if you interactively call the procedure through psql.  These procedures are 
getting called by pg_cron.

Since upgrading these RAISE NOTICE messages are getting written to the postgres 
log file:

For example:

13:45:00.720882-05
CONTEXT: PL/pgSQL function perf.snap_stats() line 242 at RAISE
2022-09-06 18:45:01 UTC::@:[21535]:LOG: cron job 2: NOTICE: Snapping 
dba_hist_system_event at 2022-09-06 13:45:00.725818-05

Log_min_messages is set to warning so my understanding is that should only log 
messages at warning, error, log, fatal, or panic.  Any idea how to troubleshoot 
why these are getting written to the log file?

Thanks
This e-mail is for the sole use of the intended recipient and contains 
information that may be privileged and/or confidential. If you are not an 
intended recipient, please notify the sender by return e-mail and delete this 
e-mail and any attachments. Certain required legal entity disclosures can be 
accessed on our website: 
https://www.thomsonreuters.com/en/resources/disclosures.html


Feature proposal: immutable/sealed partitions (and maybe tables, too)

2022-09-06 Thread Levi Aul
My company runs some large OLAP data warehouses with append-only,
time-partitioned datasets. Our workloads involve aggregations and joins,
and query the data in ways not amenable to constraint-exclusion; and we
serve a high concurrent number of these queries at once from a single DB.

In other words, our workload is inherently one that acquires "way too many
locks." Our largest performance bottleneck, according to
pg_wait_sampling, is the LockManager itself. Despite most of our queries
spending only milliseconds actually executing, they often spend seconds
during planning waiting to acquire hundreds of access-shared locks.

Given that our datasets are append-only, all our partitions for each table
save for the one "active" one (the one for the current time period) are
effectively immutable. No DML-triggered writes will occur to these. I think
this is pretty common in data-warehouse use-cases of PG.

If PG could avoid the need to acquire the locks for these
effectively-immutable partitions, then the remaining number of tables would
be low enough to fit into the per-backend LWLock slots set, and so avoid
LockManager contention. I believe this could be a large optimization not
just for our use-case, but in a number of other high-concurrency OLAP
use-cases.

My proposal for how this "lock elision under large numbers of immutable
partitions" could be accomplished:

1. Add some DDL statement to mark partitions as sealed/unsealed. (ALTER
TABLE ... SEAL PARTITION foo)
2. When query-planning DML against a partition or a partitioned table,
treat a sealed partition as if it had an always-false check constraint.
2. Define a "locking group" abstraction, where many entities can register
themselves under the same lock, such that access to all members of the
locking group requires only acquiring the single locking-group lock. All
sealed partitions of the same table would share a locking group.

Under such a setup, querying a time-based partitioned table with one active
(unsealed) partition would only ever require acquiring, at most, two locks
— the one for the active partition, and the one for the sealed-partitions
locking group.

The trade-off for this is that acquiring an exclusive-access lock on the
sealed-partitions locking-group for a table becomes much more expensive
than it would have been to acquire for a single partition. But this isn't a
problem in practice, because hot-path operations that take an
exclusive-access lock (DML writes) are disallowed against sealed
partitions. The only time the lock-group would need to be exclusive-access
acquired, would be to change its membership — an administrative DDL
operation.

Besides being useful operationally, such a mechanism would also be helpful
on the business-logic level, as you can rely on partition sealing to turn
accidental insertions of new data into any but the active partition(s) into
a constraint violation. (Currently, to achieve this, separate triggers need
to be maintained on each sealed partition.)

And, with knowledge of the administrative intent for a table to be
immutable, further operational optimizations could be performed. A few off
the top of my head:

1. Running CLUSTER or VACUUM (FULL, FREEZE) after the partition is marked
as immutable, could rewrite the table using an implicit "heap_immutable"
access method (still reported as "heap"), which would drop the min_xid
column (as everything in a sealed table is guaranteed to be
always-visible), and thus remove the table for consideration for
xid-wraparound-protection rewriting. Such partitions would then require a
rewrite back to "heap" if unsealed.

2. Alternatively, such storage-rewriting DDL statements could switch the
table — and its indices — over to using an entirely different
access-methods, which would store the data+indices in "perfect" packed
forms, to maximize read performance while also minimizing disk usage.

3. ANALYZE could have an (otherwise-impractical) "EXACT" argument, to
populate statistics with exact aggregate values, requiring reading all rows
rather than sampling rows. This could pre-bake table-level aggregates for
most columns, like having a single, table-sized BRIN block-range.

If this concept of "marking as sealed" were extended to tables rather than
only partitions, then further work could be done related to optimization of
bulk loads — e.g. having CREATE TABLE AS ... SEALED not generate WAL
segments for the table as it is populated, but rather treat the table as
UNLOGGED during population, and then, after creation, take the entire
finalized/sealed table's backing files and either pass them directly to
archive_command / send them directly to WAL receivers; or split+stream them
into retrospective WAL segments (each segment containing a single "put this
16MB of data into this file at this position" op), and send those.


Re: log_min_messages = warning

2022-09-06 Thread Adrian Klaver

On 9/6/22 12:07, Dirschel, Steve wrote:
We recently upgraded from postgres 12.8 to 14.3.  We are running Aurora 
Postgres on AWS.


We have procedures that will make calls to RAISE NOTICE to write out 
messages if you interactively call the procedure through psql.  These 
procedures are getting called by pg_cron.


Since upgrading these RAISE NOTICE messages are getting written to the 
postgres log file:


For example:

13:45:00.720882-05

CONTEXT: PL/pgSQL function perf.snap_stats() line 242 at RAISE

2022-09-06 18:45:01 UTC::@:[21535]:LOG: cron job 2: NOTICE: Snapping 
dba_hist_system_event at 2022-09-06 13:45:00.725818-05


Log_min_messages is set to warning so my understanding is that should 
only log messages at warning, error, log, fatal, or panic.  Any idea how 
to troubleshoot why these are getting written to the log file?



1) Has log_min_messages been commented out?

2) Was the setting changed from something else?

3) If 2) was the server reloaded/restarted to catch the change?

4) Are you sure you are looking at the correct setting?

Try:

select setting, source, sourcefile, pending_restart from pg_settings 
where name = 'log_min_messages';


to see if the value is coming from something like an include file.



Thanks

This e-mail is for the sole use of the intended recipient and contains 
information that may be privileged and/or confidential. If you are not 
an intended recipient, please notify the sender by return e-mail and 
delete this e-mail and any attachments. Certain required legal entity 
disclosures can be accessed on our website: 
https://www.thomsonreuters.com/en/resources/disclosures.html



--
Adrian Klaver
adrian.kla...@aklaver.com




RE: [EXT] Re: log_min_messages = warning

2022-09-06 Thread Dirschel, Steve
>On 9/6/22 12:07, Dirschel, Steve wrote:
>> We recently upgraded from postgres 12.8 to 14.3.  We are running 
>> Aurora Postgres on AWS.
>> 
>> We have procedures that will make calls to RAISE NOTICE to write out 
>> messages if you interactively call the procedure through psql.  These 
>> procedures are getting called by pg_cron.
>> 
>> Since upgrading these RAISE NOTICE messages are getting written to the 
>> postgres log file:
>> 
>> For example:
>> 
>> 13:45:00.720882-05
>> 
>> CONTEXT: PL/pgSQL function perf.snap_stats() line 242 at RAISE
>> 
>> 2022-09-06 18:45:01 UTC::@:[21535]:LOG: cron job 2: NOTICE: Snapping 
>> dba_hist_system_event at 2022-09-06 13:45:00.725818-05
>> 
>> Log_min_messages is set to warning so my understanding is that should 
>> only log messages at warning, error, log, fatal, or panic.  Any idea 
>> how to troubleshoot why these are getting written to the log file?
>
>
>1) Has log_min_messages been commented out?
>
>2) Was the setting changed from something else?
>
>3) If 2) was the server reloaded/restarted to catch the change?
>
>4) Are you sure you are looking at the correct setting?
>
>Try:
>
>select setting, source, sourcefile, pending_restart from pg_settings where 
>name = 'log_min_messages';
>
>to see if the value is coming from something like an include file.
>
>> 
>> Thanks
>> 
>>This e-mail is for the sole use of the intended recipient and contains 
>> information that may be privileged and/or confidential. If you are not 
>> an intended recipient, please notify the sender by return e-mail and 
>> delete this e-mail and any attachments. Certain required legal entity 
>> disclosures can be accessed on our website:
>> https://www.thomsonreuters.com/en/resources/disclosures.html
>>
>
>-- 
>Adrian Klaver
>adrian.kla...@aklaver.com


shgroup02s=> select setting, source, sourcefile, pending_restart from 
pg_settings where name = 'log_min_messages';
 setting | source  | sourcefile | pending_restart
-+-++-
 warning | default || f
(1 row)

Regards
Steve


Re: [EXT] Re: log_min_messages = warning

2022-09-06 Thread Tom Lane
"Dirschel, Steve"  writes:
>>> We recently upgraded from postgres 12.8 to 14.3.  We are running 
>>> Aurora Postgres on AWS.

Aurora Postgres != Postgres.  This doesn't seem like something they'd
change, but we can't rule that out entirely.

>> Try:
>> select setting, source, sourcefile, pending_restart from pg_settings where 
>> name = 'log_min_messages';

> shgroup02s=> select setting, source, sourcefile, pending_restart from 
> pg_settings where name = 'log_min_messages';
>  setting | source  | sourcefile | pending_restart
> -+-++-
>  warning | default || f
> (1 row)

OK, so log_min_messages has the expected value in your interactive
session, and yet evidently not in the sessions running these RAISE NOTICE
commands.  Maybe you have done ALTER USER SET or ALTER DATABASE SET
commands that'd change it in those sessions?  You could look in
the pg_db_role_setting catalog to answer that definitively.

regards, tom lane




Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)

2022-09-06 Thread Ron

By "SEALED", do you mean "READ ONLY"?

On 9/6/22 14:39, Levi Aul wrote:
My company runs some large OLAP data warehouses with append-only, 
time-partitioned datasets. Our workloads involve aggregations and joins, 
and query the data in ways not amenable to constraint-exclusion; and we 
serve a high concurrent number of these queries at once from a single DB.


In other words, our  workload is inherently one that acquires "way too 
many locks." Our largest performance bottleneck, according to 
pg_wait_sampling, is the LockManager itself. Despite most of our queries 
spending only milliseconds actually executing, they often spend seconds 
during planning waiting to acquire hundreds of access-shared locks.


Given that our datasets are append-only, all our partitions for each table 
save for the one "active" one (the one for the current time period) are 
effectively immutable. No DML-triggered writes will occur to these. I 
think this is pretty common in data-warehouse use-cases of PG.


If PG could avoid the need to acquire the locks for these 
effectively-immutable partitions, then the remaining number of tables 
would be low enough to fit into the per-backend LWLock slots set, and so 
avoid LockManager contention. I believe this could be a large optimization 
not just for our use-case, but in a number of other high-concurrency OLAP 
use-cases.


My proposal for how this "lock elision under large numbers of immutable 
partitions" could be accomplished:


1. Add some DDL statement to mark partitions as sealed/unsealed. (ALTER 
TABLE ... SEAL PARTITION foo)
2. When query-planning DML against a partition or a partitioned table, 
treat a sealed partition as if it had an always-false check constraint.
2. Define a "locking group" abstraction, where many entities can register 
themselves under the same lock, such that access to all members of the 
locking group requires only acquiring the single locking-group lock. All 
sealed partitions of the same table would share a locking group.


Under such a setup, querying a time-based partitioned table with one 
active (unsealed) partition would only ever require acquiring, at most, 
two locks — the one for the active partition, and the one for the 
sealed-partitions locking group.


The trade-off for this is that acquiring an exclusive-access lock on the 
sealed-partitions locking-group for a table becomes much more expensive 
than it would have been to acquire for a single partition. But this isn't 
a problem in practice, because hot-path operations that take an 
exclusive-access lock (DML writes) are disallowed against sealed 
partitions. The only time the lock-group would need to be exclusive-access 
acquired, would be to change its membership — an administrative DDL operation.


Besides being useful operationally, such a mechanism would also be helpful 
on the business-logic level, as you can rely on partition sealing to turn 
accidental insertions of new data into any but the active partition(s) 
into a constraint violation. (Currently, to achieve this, separate 
triggers need to be maintained on each sealed partition.)


And, with knowledge of the administrative intent for a table to be 
immutable, further operational optimizations could be performed. A few off 
the top of my head:


1. Running CLUSTER or VACUUM (FULL, FREEZE) after the partition is marked 
as immutable, could rewrite the table using an implicit "heap_immutable" 
access method (still reported as "heap"), which would drop the min_xid 
column (as everything in a sealed table is guaranteed to be 
always-visible), and thus remove the table for consideration for 
xid-wraparound-protection rewriting. Such partitions would then require a 
rewrite back to "heap" if unsealed.


2. Alternatively, such storage-rewriting DDL statements could switch the 
table — and its indices — over to using an entirely different 
access-methods, which would store the data+indices in "perfect" packed 
forms, to maximize read performance while also minimizing disk usage.


3. ANALYZE could have an (otherwise-impractical) "EXACT" argument, to 
populate statistics with exact aggregate values, requiring reading all 
rows rather than sampling rows. This could pre-bake table-level aggregates 
for most columns, like having a single, table-sized BRIN block-range.


If this concept of "marking as sealed" were extended to tables rather than 
only partitions, then further work could be done related to optimization 
of bulk loads — e.g. having CREATE TABLE AS ... SEALED not generate WAL 
segments for the table as it is populated, but rather treat the table as 
UNLOGGED during population, and then, after creation, take the entire 
finalized/sealed table's backing files and either pass them directly to 
archive_command / send them directly to WAL receivers; or split+stream 
them into retrospective WAL segments (each segment containing a single 
"put this 16MB of data into this file at this position" op), and send those.


--
Angular moment

RE: [EXT] Re: log_min_messages = warning

2022-09-06 Thread Dirschel, Steve
 We recently upgraded from postgres 12.8 to 14.3.  We are running 
 Aurora Postgres on AWS.

>Aurora Postgres != Postgres.  This doesn't seem like something they'd change, 
>but we can't rule that out entirely.

>>> Try:
>>> select setting, source, sourcefile, pending_restart from pg_settings 
>>> where name = 'log_min_messages';

>> shgroup02s=> select setting, source, sourcefile, pending_restart from 
>> pg_settings where name = 'log_min_messages';  setting | source  | 
>> sourcefile | pending_restart
>> -+-++-
>>  warning | default || f
>> (1 row)

>OK, so log_min_messages has the expected value in your interactive session, 
>and yet evidently not in the sessions running these RAISE NOTICE commands.  
>Maybe you have done ALTER USER SET or ALTER DATABASE SET commands that'd 
>change it in >those sessions?  You could look in the pg_db_role_setting 
>catalog to answer that definitively.

>   regards, tom lane

We have opened a case with AWS on this issue as well since it is Aurora but I 
decided to reach out to the community as well.  

We have pg_cron (which is kicking off the procedures that are executing the 
code that is causing these NOTICE messages to get written to the log file) 
setup with cron.database_name = shgroup02s.  That database is oid 16436.  

shgroup02s=> select * from pg_db_role_setting ;
 setdatabase | setrole |

  setconfig

-+-+--

   16401 |   0 | {auto_explain.log_min_duration=-1}
   16401 |  10 | {log_min_messages=panic}
   16436 |   0 | {TimeZone=America/Chicago}
   0 |  10 | 
{TimeZone=utc,log_statement=all,log_min_error_statement=debug5,log_min_messages=panic,exit_on_error=0,statement_timeout=0,role=rdsadmin,auto_explain.log_min_duration=-1,temp_file_limit=
-1,"search_path=pg_catalog, 
public",pg_hint_plan.enable_hint=off,default_transaction_read_only=off}
(4 rows)

If I login to the shgroup02s database as postgres user (using psql) and 
interactively call a procedure that cron calls which causes the RAISE NOTICE 
commands to be written to the log they do NOT get written to the log when I 
call the procedure.  The messages will be displayed on my screen but I don't 
see them getting written to the log.   I see cron has a setting-  
cron.log_min_messages which is set to warning so it doesn't seem to come from 
that.

Regards
Steve


Re: [EXT] Re: log_min_messages = warning

2022-09-06 Thread Tom Lane
"Dirschel, Steve"  writes:
>  setdatabase | setrole |  
>   
>   setconfig

> -+-+--
> 
>16401 |   0 | {auto_explain.log_min_duration=-1}
>16401 |  10 | {log_min_messages=panic}
>16436 |   0 | {TimeZone=America/Chicago}
>0 |  10 | 
> {TimeZone=utc,log_statement=all,log_min_error_statement=debug5,log_min_messages=panic,exit_on_error=0,statement_timeout=0,role=rdsadmin,auto_explain.log_min_duration=-1,temp_file_limit=
> -1,"search_path=pg_catalog, 
> public",pg_hint_plan.enable_hint=off,default_transaction_read_only=off}
> (4 rows)

> If I login to the shgroup02s database as postgres user (using psql) and
> interactively call a procedure that cron calls which causes the RAISE
> NOTICE commands to be written to the log they do NOT get written to the
> log when I call the procedure.  The messages will be displayed on my
> screen but I don't see them getting written to the log.

You've evidently done "ALTER USER postgres SET log_min_messages=panic",
so the lack of any messages under that userid is unsurprising.  But
we're not a lot closer to being sure why it's different for the
procedures' normal execution environment.

At this point it's hard to avoid the conclusion that those procedures'
session is changing the value itself.  (This is scary, because it implies
that you're running those as superuser, which seems like a bad idea.)
You might have to enable log_statement = all to verify that.

regards, tom lane




Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)

2022-09-06 Thread David Rowley
On Wed, 7 Sept 2022 at 07:40, Levi Aul  wrote:
> In other words, our workload is inherently one that acquires "way too many 
> locks." Our largest performance bottleneck, according to pg_wait_sampling, is 
> the LockManager itself. Despite most of our queries spending only 
> milliseconds actually executing, they often spend seconds during planning 
> waiting to acquire hundreds of access-shared locks.

It would be good to have a better understanding of the problem you're
facing.  There have been many changes to table partitioning since
PostgreSQL 10 and many of those changes affect the number of
partitions which are locked for certain classes of queries.

It would be good to know the following:

1. Which version of PostgreSQL are you having this problem with, and;
2. Example of queries you're having this problem with.

If you share that information we may be able to inform you about
features/performance improvements in newer versions which help with
the problem you're facing.

You mention "constraint-exclusion", that's no longer how we perform
partition pruning and hasn't been since (if I remember correctly)
PostgreSQL 11. Perhaps you're using PG10?

David




Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)

2022-09-06 Thread Levi Aul
We're using Postgres 14.5. I meant partition pruning.

To be clear, this isn't a bug report. There is no bug—everything is working
exactly as it should. The partitions are not being pruned because the
workload consists of OLAP aggregations that fetch a small number of rows
spread across all partitions in the set, relying for speed on an index that
isn't prefixed with the partitioning key (nor can it be.)

I'll try to avoid the particulars of the business domain (it gets clumsy
because there's a lot of jargon collisions, with tables named things like
"transactions"), but you can think of it abstractly as follows: we have a
table holding a CQRS event-stream; we are trying to discover all events
related to a particular person, where events "related to a person" are
related either directly (by an indexed field of the event) or indirectly
(by a foreign-key reference to the event from a row in a second partitioned
table — let's call it "event attributes" — where the person is an indexed
field of the event-attribute.) We construct/linearize/uniquify a
time-ordered rowset of all such related events; and then we
reduce/aggregate some value fields from those events. This query requires
index scans of all N partitions of events + all N partitions of
event_attributes.

This workload is performing exactly how you'd expect it to perform (i.e.
badly) given Postgres's current operational pragmatics around partition
locking. The only way it could possibly perform better, is if Postgres
didn't have to acquire N shared-access locks in order to index-scan N
partitions. And the only way that could work, is if Postgres could make
some assumption about the locking behavior of the partitions. Thus my
feature proposal.

To be clear, I'm more interested in discussing the feature proposal than in
solving the immediate problem. The immediate problem has an obvious, if
painful, solution: merging the historical partitions into a single huge
historical partition per table. The feature proposal, meanwhile, has the
potential to solve many of our current business-level problems if the
"further optimizations" I mentioned can be made.

Also, to be clear, I'm interested in implementing the feature I've proposed
myself. I've read the relevant parts of the Postgres codebase and feel
confident I can make the required changes. I would just like to have a
design discussion around the shape the feature should take, with Postgres
stakeholders, before I go to all that effort to build something they might
not accept upstream.

On Tue, Sep 6, 2022 at 5:53 PM David Rowley  wrote:

> On Wed, 7 Sept 2022 at 07:40, Levi Aul  wrote:
> > In other words, our workload is inherently one that acquires "way too
> many locks." Our largest performance bottleneck, according to
> pg_wait_sampling, is the LockManager itself. Despite most of our queries
> spending only milliseconds actually executing, they often spend seconds
> during planning waiting to acquire hundreds of access-shared locks.
>
> It would be good to have a better understanding of the problem you're
> facing.  There have been many changes to table partitioning since
> PostgreSQL 10 and many of those changes affect the number of
> partitions which are locked for certain classes of queries.
>
> It would be good to know the following:
>
> 1. Which version of PostgreSQL are you having this problem with, and;
> 2. Example of queries you're having this problem with.
>
> If you share that information we may be able to inform you about
> features/performance improvements in newer versions which help with
> the problem you're facing.
>
> You mention "constraint-exclusion", that's no longer how we perform
> partition pruning and hasn't been since (if I remember correctly)
> PostgreSQL 11. Perhaps you're using PG10?
>
> David
>


-- 

Levi Aul, CTO, Covalent 


Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)

2022-09-06 Thread David Rowley
On Wed, 7 Sept 2022 at 13:33, Levi Aul  wrote:
> To be clear, this isn't a bug report. There is no bug—everything is working 
> exactly as it should. The partitions are not being pruned because the 
> workload consists of OLAP aggregations that fetch a small number of rows 
> spread across all partitions in the set, relying for speed on an index that 
> isn't prefixed with the partitioning key (nor can it be.)

Probably the -hackers mailing list might a better place to discuss
design ideas for new features.  -general is more for general help with
using the software, not hacking on it.

The main reason individual partitions need to be locked is because
they can still be referenced by queries directly as if they were just
a normal table. To get around that we'd either need to have the
locking groups, as you describe, or remove the ability to access the
partition directly, not through the top-level partitioned table.  The
ship has probably sailed on the latter one, but it probably could be
done as an opt-in feature if the former was too difficult or
impractical.

FWIW, I'm not quite seeing why you need "sealed" partitions for the
group locking idea.  I understand the other parts you mentioned about
conversion to a table AM which is more optimized for non-transactional
workloads, but that seems like a different problem that you're mixing
in and adding complexity to the whole thing. If that's true, then it
might be better not to mix that in and confuse / complicate your
explanation of the problem and proposed solution.

I'd suggest posting to -hackers and stating that your queries can't
make use of partition pruning and that currently all partitions are
being locked and you believe that this is a bottleneck.  Some examples
of perf output to show how large the locking overhead is. Extra points
for hacking up some crude code so we don't obtain the partition locks
to show what the performance could be if we didn't lock all the
partitions. That'll help show you have a worthy cause, as FWIW, I'm
surprised that executor startup / shutdown for a plan which accesses a
large number of partitions is not drowning out the locking overheads.
As far as I knew, this problem was only visible when run-time
partition pruning removed the large majority of the Append/MergeAppend
subnodes and made executor startup/shutdown significantly faster.

David