pg_stat_progress_vacuum comes up empty ...?

2019-07-17 Thread Michael Harris
Hello,

We have a database cluster which recently got very close to XID Wraparound. To 
get
it back under control I've been running a lot of aggressive manual vacuums.

However, I have noticed a few anomolies. When I try to check the status of 
vacuum commands:

qtodb_pmxtr=# select * from  pg_stat_progress_vacuum;
 pid | datid | datname | relid | phase | heap_blks_total | heap_blks_scanned | 
heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples
-+---+-+---+---+-+---+++-+-
(0 rows)

Yet there definitely are plenty running:

qtodb_pmxtr=# select pid, state, current_timestamp-query_start as 
duration,query from pg_stat_activity where datname='qtodb_pmxtr' and 
query~'VACUUM' ;
  pid  | state  |duration | 
  query
---++-+---
 40615 | active | 13:46:35.081203 | autovacuum: VACUUM 
qn.mtrds_cnestmeas_oo_18032
 40617 | active | 00:46:35.270232 | autovacuum: VACUUM 
qn.mtrds_cantu100_oo_18046
 40622 | active | 00:00:04.55167  | autovacuum: VACUUM 
qn.mtrds_cbeekops_on_17684 (to prevent wraparound)
 25685 | active | 00:00:04.568989 | VACUUM FREEZE pg_toast.pg_toast_228072029;
 25686 | active | 00:00:02.716111 | VACUUM FREEZE pg_toast.pg_toast_228072943;
 25687 | active | 00:00:03.788131 | VACUUM FREEZE pg_toast.pg_toast_228069006;
 25688 | active | 00:00:02.531885 | VACUUM FREEZE pg_toast.pg_toast_228067023;
 25689 | active | 00:00:02.098389 | VACUUM FREEZE pg_toast.pg_toast_228071980;
 25690 | active | 00:00:00.621036 | VACUUM FREEZE pg_toast.pg_toast_228071852;
 25691 | active | 00:00:11.424717 | VACUUM FREEZE pg_toast.pg_toast_228069597;
 25692 | active | 00:00:03.359416 | VACUUM FREEZE pg_toast.pg_toast_228073892;
 25693 | active | 00:00:04.569248 | VACUUM FREEZE pg_toast.pg_toast_228068022;
 25694 | active | 00:00:20.151786 | VACUUM FREEZE pg_toast.pg_toast_228068878;
 25695 | active | 00:00:00.517688 | VACUUM FREEZE pg_toast.pg_toast_228068478;
 25696 | active | 00:00:23.746402 | VACUUM FREEZE pg_toast.pg_toast_228067431;
 25697 | active | 00:00:10.759025 | VACUUM FREEZE pg_toast.pg_toast_228072997;
 25698 | active | 00:00:14.281798 | VACUUM FREEZE pg_toast.pg_toast_228074613;
 25699 | active | 00:00:05.631052 | VACUUM FREEZE pg_toast.pg_toast_228074247;
 25700 | active | 00:00:00.056749 | VACUUM FREEZE pg_toast.pg_toast_228071681;
 28008 | active | 00:00:00| select pid, state, 
current_timestamp-query_start as duration,query from pg_stat_activity where 
datname='qtodb_pmxtr' and query~'VACUUM' ;
(20 rows)

Why don't any of these (manual OR auto) show up in the pg_stat_progress_vacuum?

Another concern: the normal autovacuums seem to be stalling. The table 
qn.mtrds_cnestmeas_oo_18032 should surely not take more than 13 hours to
vacuum, since it is only 160KB in size ...!

qtodb_pmxtr=# select 
pg_size_pretty(pg_relation_size('qn.mtrds_cnestmeas_oo_18032'::regclass));
 pg_size_pretty

 160 kB
(1 row)

We have autovacuum_cost_delay set to 0.

I also don't understand why only one autovac worker is working on the
wraparound issue, as there are thousands of tables with oldest xid > 
autovacuum_freeze_max_age.
I would have thought it would be prioritizing those.

I'm worried that something is wrong with autovacuum on this database, which 
might
be responsible for it getting into this state to begin with. Other similar 
databases we
have, running the same application and with similar configuration, are managing 
to
keep up with the xid freezing nicely.

The database was on 9.6, but was recently upgraded to 11.4.

Any advice welcome!

Cheers
Mike.






Hot Standby Replica Recovery Problem

2018-12-13 Thread Michael Harris
Hello Experts,

We've been having a strange problem with one of our databases.

A summary of the setup follows:

 - We are running postgresql 9.6.9 on Centos 7.
 - We are using postgresql native streaming replication 
 - There is one master and one hot standby
 - The master is archiving it's WAL files with this archive command:

   archive_command = 'test ! -f %p || (test ! -f 
/var/backup2/dbs1a/pg_xlog/%f.gz && pigz -c %p > 
/var/backup2/dbs1a/pg_xlog/%f.gz)' # command to use to archive a 
logfile segment

   /var/backup2 is a shared gfs2 filesystem.

 - Recently we had to do some maintenance that involved rebuilding the database 
servers.
   At that point dbs1b was the master, so we promoted dbs1a, did the work on 
dbs1b,
   and then began establishing a new replica on dbs1b using pg_basebackup

 - After pg_basebackup completed we set up recovery.conf:

restore_command='gzip -dc /var/backup2/dbs1a/pg_xlog/%f.gz > %p'
standby_mode='on'
primary_conninfo='host=dbs1a user=xx password=x'

   and then started postgresql.

At first it seemed to be starting up as per normal. It began REDO-ing WAL files 
from 
the archive. It reached a certain point and then failed with this:

[2018-12-14 11:07:39 AEDT]  2050@  LOG:  restored log file 
"00030002C4D70028" from archive
[2018-12-14 11:07:39 AEDT]  29240@  LOG:  fetching timeline history file for 
timeline 3 from primary server
[2018-12-14 11:07:39 AEDT]  29240@  FATAL:  could not receive timeline history 
file from the primary server: ERROR:  could not open file 
"pg_xlog/0003.history": No such file or directory

These messages repeated every few seconds until we stopped the postmaster.

I was surprised because we have done this procedure many times on this & other 
databases, but anyway
Dr Google suggested this 
https://superuser.com/questions/1127360/cant-find-the-timeline-history-file-to-get-the-replication-working.
After reading that and other linked articles I decided to create a 'dummy' 
history file.

This time the startup sequence failed like this:

[2018-12-14 16:26:46 AEDT]  16575@  LOG:  restored log file 
"00030002C4D70028" from archive
[2018-12-14 16:26:47 AEDT]  17669@  LOG:  fetching timeline history file for 
timeline 3 from primary server
[2018-12-14 16:26:47 AEDT]  17669@  LOG:  started streaming WAL from primary at 
2C4D7/2800 on timeline 3
[2018-12-14 16:26:47 AEDT]  17669@  FATAL:  could not receive data from WAL 
stream: ERROR:  requested WAL segment 00030002C4D70028 has already been 
removed

Again I was surprised - why wasn't it getting this file from the archives? It 
is present and does not
seem to be truncated or corrupted (as far as I can tell).

Anyway in an attempt to force it to get the file from the archives, I removed 
the primary_conninfo.
This time it just repeatedly fetched the file 00030002C4D70028 again 
and again, never progressing
to the next one.

Finally I turned up the debugging output, to see this:

[2018-12-14 17:58:25 AEDT]  12365@  DEBUG:  executing restore command "gzip -dc 
/var/backup2/dbs1a/pg_xlog/00030002C4D70028.gz > pg_xlog/RECOVERYXLOG"
[2018-12-14 17:58:25 AEDT]  12365@  LOG:  restored log file 
"00030002C4D70028" from archive
[2018-12-14 17:58:25 AEDT]  12365@  DEBUG:  got WAL segment from archive
[2018-12-14 17:58:25 AEDT]  12365@  DEBUG:  switched WAL source from archive to 
stream after failure
[2018-12-14 17:58:25 AEDT]  12365@  DEBUG:  switched WAL source from stream to 
archive after failure

It seems like something is going wrong with the WAL file replay here, although 
I cannot see what??

I'm writing this email after the second attempt at replication: the first 
attempt ended much the same.
I assumed I had done something wrong so I re-started from the pg_basebackup. 
Now I am concerned
that something is wrong with the master, or maybe with our WAL archiving 
process.

Any tips would be gratefully received!

Regards
Mike




RE: Hot Standby Replica Recovery Problem

2018-12-24 Thread Michael Harris
Hello All

We got to the bottom of this in the end - it was due to a misconfiguration that 
resulted in random corruption of WAL archives.

Part of the problem was the use of 'pigz' (multi-threaded gzip tool) for 
compression of the WAL files, which in our case led to a file that decompresses 
without error into a file of the correct size but where 2/3 of the files were 
all '0's. Needless to say I have stopped using pigz!

There were a couple if learnings from the whole exercise: first, during 
recovery, corrupted WAL files do not result in error messages describing the 
corruption in the log. In fact there is code in xlog.c (XLogPageRead) which 
validates the WAL file header, and if it finds any errors, discards them and 
tries again. From the log file all you can see is repeated attempts to fetch 
the WAL file from alternating sources.

The other wrinkle was the missing timeline history file. Even after fixing the 
WAL corruption, and starting over with a fresh base backup, it still hiccupped 
about 0003.history being mysteriously absent from the master. The following 
sequence was repeatedly present in the log:

[2018-12-24 21:41:26 AEDT]  16246@  LOG:  restored log file 
"00030002CE5500ED" from archive
[2018-12-24 21:41:31 AEDT]  16246@  LOG:  restored log file 
"00030002CE5500EE" from archive
[2018-12-24 21:41:31 AEDT]  16246@  LOG:  restored log file 
"00030002CE5500EF" from archive
[2018-12-24 21:41:32 AEDT]  16246@  LOG:  restored log file 
"00030002CE5500F0" from archive
[2018-12-24 21:41:33 AEDT]  16246@  LOG:  restored log file 
"00030002CE5500F1" from archive

gzip: /var/backup2/dbs1a/pg_xlog/00030002CE5500F2.gz: unexpected end of 
file
[2018-12-24 21:41:33 AEDT]  4287@  LOG:  fetching timeline history file for 
timeline 3 from primary server
[2018-12-24 21:41:33 AEDT]  4287@  FATAL:  could not receive timeline history 
file from the primary server: ERROR:  could not open file 
"pg_xlog/0003.history": No such file or directory

It seems like it reached the end of the WAL archive, then tried to switch to 
streaming, but then that failed because the master did not have a copy of this 
history file. Then it switched back to the WAL archive, and the cycle continued.

In the end I created a fake history file and put it in the master's pg_xlog 
dir. After that the switch to WAL streaming was successful and the standby was 
at last up to speed.

Anyway, all is well that ends well!

Cheers
Mike

-Original Message-
From: Michael Harris 
Sent: Friday, December 14, 2018 6:40 PM
To: pgsql-general@lists.postgresql.org
Subject: Hot Standby Replica Recovery Problem

Hello Experts,

We've been having a strange problem with one of our databases.

A summary of the setup follows:

 - We are running postgresql 9.6.9 on Centos 7.
 - We are using postgresql native streaming replication
 - There is one master and one hot standby
 - The master is archiving it's WAL files with this archive command:

   archive_command = 'test ! -f %p || (test ! -f 
/var/backup2/dbs1a/pg_xlog/%f.gz && pigz -c %p > 
/var/backup2/dbs1a/pg_xlog/%f.gz)' # command to use to archive a 
logfile segment

   /var/backup2 is a shared gfs2 filesystem.

 - Recently we had to do some maintenance that involved rebuilding the database 
servers.
   At that point dbs1b was the master, so we promoted dbs1a, did the work on 
dbs1b,
   and then began establishing a new replica on dbs1b using pg_basebackup

 - After pg_basebackup completed we set up recovery.conf:

restore_command='gzip -dc /var/backup2/dbs1a/pg_xlog/%f.gz > %p'
standby_mode='on'
primary_conninfo='host=dbs1a user=xx password=x'

   and then started postgresql.

At first it seemed to be starting up as per normal. It began REDO-ing WAL files 
from the archive. It reached a certain point and then failed with this:

[2018-12-14 11:07:39 AEDT]  2050@  LOG:  restored log file 
"00030002C4D70028" from archive
[2018-12-14 11:07:39 AEDT]  29240@  LOG:  fetching timeline history file for 
timeline 3 from primary server
[2018-12-14 11:07:39 AEDT]  29240@  FATAL:  could not receive timeline history 
file from the primary server: ERROR:  could not open file 
"pg_xlog/0003.history": No such file or directory

These messages repeated every few seconds until we stopped the postmaster.

I was surprised because we have done this procedure many times on this & other 
databases, but anyway Dr Google suggested this 
https://superuser.com/questions/1127360/cant-find-the-timeline-history-file-to-get-the-replication-working.
After reading that and other linked articles I decided to create a 'dummy' 
history file.

This time the startup sequence failed like this:

[2018-12-14 16:26:46 AEDT]  16575@  LOG:  restored l

ANALYZE on partitioned tables vs on individual partitions

2024-08-06 Thread Michael Harris
Hello Experts,

Our application has a database with a large number of partitioned tables
used to store time series data.

It is partitioned by time: new data is populated into the current
partition, old partitions eventually get dropped, and the partitions in
between are largely static.

I had not realized until recently that the partitioned tables themselves
are not analyzed by the autovacuum processes. I am now wondering if we
should be manually analyzing those.

The problem is they are quite enormous and take a long time to analyze - I
tested one yesterday and it took ~30mins, and we have hundreds of them. It
might not be practical to regularly analyze them.

My questions are:

  1. What is the consequence of not having good statistics on partitioned
table level, if you do have good statistics on the partition level?

  2. Which planning operations rely on partitioned table level statistics?

My reason for asking is I'm trying to understand the tradeoff between the
IO cost associated with frequently analyzing as opposed to possibly
non-optimal plans being chosen.

Thanks in advance, and sorry if this has all been covered before. I could
not find anything definitive in the docs, and while I did find a few
references by googling, none really explained the answers to those
questions too well.

Regards
Mike


Re: ANALYZE on partitioned tables vs on individual partitions

2024-08-07 Thread Michael Harris
Many thanks David for the comprehensive response.

> I think the complaint was about no autovacuum on the partitioned
> table, not the partitions.

Yes, exactly.

One other piece of information: these tables contain a lot of columns, of which
only 4 are normally used for WHERE clauses or joins. The table I was
experimenting
with has 150 columns, 156026832 rows and occupies 166GB.

I found that running an ANALYZE specifying only those 4 columns only took
5 minutes, compared to the 30 minutes for the whole table.

That was a bit of a surprise as I imagined actually reading the table would take
most of the time and would be the same regardless of the number of columns
being analyzed, but I guess that is wrong.

Regards, Mike


On Wed, 7 Aug 2024 at 15:23, David Rowley  wrote:
>
> On Wed, 7 Aug 2024 at 16:44, Christophe Pettus  wrote:
> > Child partitions should be autovacuumed and autoanalyzed just like any 
> > other table; they are not prohibited from autovacuum in any way by default. 
> >  It's probably a good idea to investigate why they are not being picked up 
> > by autovacuum.  If they are created by a bulk load process, it's not a bad 
> > idea to do a VACUUM ANALYZE on them once the bulk load is complete.
>
> I think the complaint was about no autovacuum on the partitioned
> table, not the partitions.  This is expected as we don't track the
> counters (in particular n_mod_since_analyze) shown in
> pg_stat_all_tables at the partitioned table level, so the trigger
> points that normally cause autovacuum to analyze or vacuum a table
> just won't be triggered for a partitioned table.  For VACUUM, that's
> fine as, as you mentioned, no rows are stored. But for analyze, that
> does present a problem.
>
> To name the aspects of planning that rely on statistics of the
> partitioned table, basically anything above the Append or MergeAppend
> which joins the partitioned results together. So that doesn't include
> the scans of each partition and any quals that are pushed down to the
> scan level as those are able to use the partition level statistics.
> However, it does include things like joins, group by, distinct as
> those require n_distinct estimates for the partitioned table. It's not
> all bad though as the row estimates for each individual partition will
> be totalled up through the Append / MergeAppend simply by adding up
> the row estimates for each Append / MergeAppend child plan. So, it's
> really only an estimation problem for any node that comes after a join
> node or a group by node as the output rows for those nodes will depend
> on a good n_distinct estimate for the partitioned table.
>
> Partition-wise joins and aggregates do change things a bit as those
> features do permit moving those operations below the Append / Merge
> Append, in which case the statistics for the individual partition can
> be used.
>
> You could consider manually setting the n_distinct_inherited estimates
> for the columns that you join on or group by in the partitioned table.
> You might find that you're able to choose a suitable value for that if
> you review the documentation for that setting. In particular, please
> review what is mentioned about using negative numbers for that
> setting. You may be able to choose a value that scales correctly with
> the row estimate that doesn't get outdated as you add more rows to the
> partitions. You'll need to determine that based on the data you're
> storing.
>
> David




Re: ANALYZE on partitioned tables vs on individual partitions

2024-08-07 Thread Michael Harris
> You could probably do some sampling of the pg_stat_progress_analyze
> view to figure out what's taking the most time.

I did another run, sampling the pg_stat_progress_analyze every 30s.

For the first 4 minutes it was working on the partitioned table.

After that it began analyzing all the partitions, which took approx 29 minutes.

I think you are correct - an ONLY option for ANALYZE would be a huge
benefit. In my use case, the autovacuum processes are keeping the partitions
analyzed so there would seem to be little benefit to including them in
the manual
table level ANALYZE.

Regards
Mike

On Wed, 7 Aug 2024 at 18:09, David Rowley  wrote:
>
> On Wed, 7 Aug 2024 at 19:20, Michael Harris  wrote:
> > I found that running an ANALYZE specifying only those 4 columns only took
> > 5 minutes, compared to the 30 minutes for the whole table.
> >
> > That was a bit of a surprise as I imagined actually reading the table would 
> > take
> > most of the time and would be the same regardless of the number of columns
> > being analyzed, but I guess that is wrong.
>
> ANALYZE does do sampling of the data in the table.  It would only read
> all of the rows for fairly small tables.  The docs in [1] mention
> this:
>
> "For large tables, ANALYZE takes a random sample of the table
> contents, rather than examining every row. This allows even very large
> tables to be analyzed in a small amount of time."
>
> I think the reason it's taking so long is not because of it performing
> ANALYZE on the partitioned table which results in gathering statistics
> for the partitioned table which means proportionately (based on the
> size of the partition) sampling rows from each partition, it's more
> likely due to the fact that each partition is also analysed and the
> statistics for each of those is updated. There is no "ANALYZE ONLY"
> command similar to "FROM ONLY" in SELECT queries.
>
> You could probably do some sampling of the pg_stat_progress_analyze
> view to figure out what's taking the most time. If you find that the
> majority of the time is spent analysing the partitions and not the
> partitioned table then maybe we should expand ANALYZE to add the ONLY
> option...
>
> David
>
> [1] https://www.postgresql.org/docs/current/sql-analyze.html
> [2] https://www.postgresql.org/docs/current/progress-reporting.html




ERROR: invalid memory alloc request size when committing transaction

2021-08-11 Thread Michael Harris
Hello Experts

We have a large-ish (16T) database cluster which were are performing the 
following sequence on.

  - First we upgrade the whole cluster from pg11 to pg13, using pg_upgrade 
(this succeeds)

  - Next we run a migration script on each database in the cluster. The 
migration script
converts a large number of tables from inheritance based partitioning to 
declarative 
partitioning. Unfortunately I am not at liberty to share the migration 
script.

The second step succeeds for most of the databases in the cluster but fails on 
one of them.

The migration is performed inside a transaction, and during the process of 
committing the
transaction the following error is thrown:

[2021-08-11 11:27:50 CEST] aue_75@218006 218015@[local] db_vrqv1 ERROR:  
invalid memory alloc request size 1073741824
[2021-08-11 11:27:50 CEST] aue_75@218006 218015@[local] db_vrqv1 STATEMENT:  
commit
[2021-08-11 11:27:50 CEST] aue_75@218006 218015@[local] db_vrqv1 WARNING:  
AbortTransaction while in COMMIT state

The transaction is rolled back.

I have looked into the error message - it is very low level from the memory 
manager and
occurs when a memory allocation of >= 1GB is requested. Most of the hits on 
google for this
error indicate database corruption, however I am not sure this is the case for 
us as
we have been able to do a complete pg_dump on the database without errors.

We repeated the migration with all postgres debug logging enabled - however 
this did not
provide any more detail than above.

Luckily this has occurred while we were testing the procedure on a replica of 
the production
system, not on the actual production system.

We are using pg 13.0. We are currently re-testing this with huge pages disabled 
(on a hunch)
and after that we plan to re-test it on 13.3.

Any ideas as to what could be causing this problem, or any suggestions for 
troubleshooting steps
we could take?

Many thanks in advance,

Cheers
Mike.






RE: ERROR: invalid memory alloc request size when committing transaction

2021-08-11 Thread Michael Harris
Thanks Tom,

> How many is "a large number"?

377k approx.

I thought we had logged all the SQL statements last time we ran this, but we 
forgot to set log_statement. We'll do that next time so we can get a figure on 
exactly how many of different types of statement are in the transaction.

> Can you adjust your migration script to break it into multiple transactions 
> that process fewer tables per transaction?

We are considering this. It is not ideal, because if conversion fails for any 
reason we would like the database to roll back to the pre-migration state, but 
maybe that's not possible with a database of this size.

I am surprised by the error message though - I thought that if we'd hit some 
limit on a transaction we would get a more specific error.

Cheers
Mike

-Original Message-
From: Tom Lane  
Sent: Thursday, August 12, 2021 1:41 PM
To: Michael Harris 
Cc: 'pgsql-general@lists.postgresql.org' 
Subject: Re: ERROR: invalid memory alloc request size when committing 
transaction

Michael Harris  writes:
>   - Next we run a migration script on each database in the cluster. The 
> migration script
> converts a large number of tables from inheritance based partitioning to 
> declarative 
> partitioning. Unfortunately I am not at liberty to share the migration 
> script.

How many is "a large number"?

> The migration is performed inside a transaction, and during the 
> process of committing the transaction the following error is thrown:
> [2021-08-11 11:27:50 CEST] aue_75@218006 218015@[local] db_vrqv1 
> ERROR:  invalid memory alloc request size 1073741824
> [2021-08-11 11:27:50 CEST] aue_75@218006 218015@[local] db_vrqv1 
> STATEMENT:  commit
> [2021-08-11 11:27:50 CEST] aue_75@218006 218015@[local] db_vrqv1 
> WARNING:  AbortTransaction while in COMMIT state

It's impossible to do more than speculate with this amount of information.
However, a plausible speculation is that you're processing so many objects in 
one transaction as to overrun the limit on how large a WAL commit record, or 
something related to that, can be.  Can you adjust your migration script to 
break it into multiple transactions that process fewer tables per transaction?

regards, tom lane




RE: ERROR: invalid memory alloc request size when committing transaction

2021-08-15 Thread Michael Harris
Thanks Simon

Setting wal_level to minimal has indeed allowed it to work. Thanks for the 
suggestion!

We aren't easily able to change the number of partitions - it consists of a 
large number of partitioned tables each having a reasonable number of 
partitions, but they multiply up to a huge number. Probably it needs to be 
broken into multiple smaller databases, but that's easier said than done.

Thanks again

Cheers
Mike

-Original Message-
From: Simon Riggs  
Sent: Thursday, August 12, 2021 7:19 PM
To: Michael Harris 
Cc: Tom Lane ; pgsql-general@lists.postgresql.org
Subject: Re: ERROR: invalid memory alloc request size when committing 
transaction

On Thu, 12 Aug 2021 at 06:42, Michael Harris  
wrote:
>
> Thanks Tom,
>
> > How many is "a large number"?
>
> 377k approx.

I'm going to guess that it is the invalidation messages for all the DDL that is 
causing the memory allocation error. If you set wal_level = minimal then this 
might work.

The total number indicates you are using too many partitions and should 
probably lower that by a factor of about 100.
If you do that, you probably won't need to set wal_level.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


Undetected Deadlock

2022-01-24 Thread Michael Harris
Hello Experts

I'm hoping you will be able to help me with a tricky issue.

We've recently updated our application to PG 14.1, and in the test instance we
have started to see some alarming undetected deadlocks.

An example of what we have seen is:

 locktype | database |  relation  | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction |
pid   |  mode   | granted | fastpath |   waitstart
  |relation
--+--++--+---++---+-+---+--++-+-+-+--+---+
 relation |   529986 | 1842228045 |  |   ||
   | |   |  | 165/1941408| 2130531 |
AccessShareLock | f   | f| 2022-01-19 00:32:32.626152+01 |
st.ctr_table_efr_oa
(1 row)

 locktype | database |  relation  | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction |
pid   |mode | granted | fastpath | waitstart |
   relation
--+--++--+---++---+-+---+--++-+-+-+--+---+
 relation |   529986 | 1842228045 |  |   ||
   | |   |  | 75/2193719 | 2128603 |
AccessExclusiveLock | t   | f|   |
st.ctr_table_efr_oa
(1 row)

 locktype | database |  relation  | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction |
pid   |mode | granted | fastpath |   waitstart
  | relation
--+--++--+---++---+-+---+--++-+-+-+--+---+---
 relation |   529986 | 1842231489 |  |   ||
   | |   |  | 75/2193719 | 2128603 |
AccessExclusiveLock | f   | f| 2022-01-19
00:32:32.924694+01 | st.tpd_oa
(1 row)

   locktype| database |  relation  | page | tuple |  virtualxid  |
transactionid | classid |   objid   | objsubid | virtualtransaction |
 pid   | mode  | granted | fastpath |
waitstart   | relation
---+--++--+---+--+---+-+---+--++-+---+-+--+---+---
 relation  |   529986 | 1842231489 |  |   |  |
  | |   |  | 165/1941408|
2130531 | AccessShareLock   | t   | f|
  | st.tpd_oa

So:
  pid 2130531 waits for an AccessShareLock on relation 1842228045,
blocked by pid 2128603 which holds an AccessExclusiveLock
  pid 2128603 waits for an AccessExclusiveLock on relation 1842231489,
blocked by pid 2130531 which holds an AccessShareLock

The queries being executed by these backends are:

   pid   |  query_start  | state_change
  | wait_event_type | wait_event | state  |
  query
-+---+---+-+++-
 2128603 | 2022-01-19 00:32:32.924413+01 | 2022-01-19
00:32:32.924413+01 | Lock| relation   | active | DROP
TABLE st.tpd_oa_18929
 2130531 | 2022-01-19 00:32:32.625706+01 | 2022-01-19
00:32:32.625708+01 | Lock| relation   | active | DELETE
FROM st.ctr_table_efr_oa_19010 WHERE ropid = 44788868
(2 rows)

Both of these processes had been waiting for nearly 2 hours when I
took the above printouts. My understanding of deadlock detection is
that when a backend is waiting for a lock, it runs a deadlock
detection function once a second so it should have been detected
straight away but somehow it is not.

Are there any deadlock conditions that postgresql is not able to detect?

Note that there were a lot of other processes also waiting on relation
1842231489 - could that be confusing the deadlock detection routine?

I am also confused about the locks which are being taken out by the
DELETE query. Process 2130531 is trying to delete rows from a
partition `st.ctr_table_efr_oa_19010`, and to do so it is trying to
get an AccessShareLock on the parent table `st.ctr_table_efr_oa`. I
don't really understand why that is, and in fact if I try to reproduce
the deadlock manually by executing the relevant SQL in psql, the
DELETE never tries to take out such a lock and no deadlock occurs.

What are the circumstances where deleting rows directly from a
partition will cause it to take a lock on the p

Re: Undetected Deadlock

2022-01-24 Thread Michael Harris
My apologies,

After posting this and looking at how it appears I realised that line
wrapping makes the tables totally illegible.

Here they are again with all unnecessary columns removed and others shortened.

 locktype | database |  relation  |   pid   |mode |
granted |   waitstart   | relation
--+--++-+-+-+---+
 relation |   529986 | 1842228045 | 2130531 | AccessShareLock | f
 | 2022-01-19 00:32:32.626152+01 | st.ctr_table_efr_oa
 relation |   529986 | 1842228045 | 2128603 | AccessExclusiveLock | t
 |   | st.ctr_table_efr_oa
 relation |   529986 | 1842231489 | 2128603 | AccessExclusiveLock | f
 | 2022-01-19 00:32:32.924694+01 | st.tpd_oa
 relation |   529986 | 1842231489 | 2130531 | AccessShareLock | t
 |   | st.tpd_oa


   pid   |query_start |   state_change | wait_event_type |
wait_event |   query
-+++-++-
 2128603 | 00:32:32.924413+01 | 00:32:32.924413+01 | Lock|
relation   | DROP TABLE st.tpd_oa_18929
 2130531 | 00:32:32.625706+01 | 00:32:32.625708+01 | Lock|
relation   | DELETE FROM st.ctr_table_efr_oa_19010 WHERE ropid =
44788868

I hope this version is easier to read.

Cheers
Mike
On Tue, 25 Jan 2022 at 15:49, Michael Harris  wrote:
>
> Hello Experts
>
> I'm hoping you will be able to help me with a tricky issue.
>
> We've recently updated our application to PG 14.1, and in the test instance we
> have started to see some alarming undetected deadlocks.
>
> An example of what we have seen is:
>
>  locktype | database |  relation  | page | tuple | virtualxid |
> transactionid | classid | objid | objsubid | virtualtransaction |
> pid   |  mode   | granted | fastpath |   waitstart
>   |relation
> --+--++--+---++---+-+---+--++-+-+-+--+---+
>  relation |   529986 | 1842228045 |  |   ||
>| |   |  | 165/1941408| 2130531 |
> AccessShareLock | f   | f| 2022-01-19 00:32:32.626152+01 |
> st.ctr_table_efr_oa
> (1 row)
>
>  locktype | database |  relation  | page | tuple | virtualxid |
> transactionid | classid | objid | objsubid | virtualtransaction |
> pid   |mode | granted | fastpath | waitstart |
>relation
> --+--++--+---++---+-+---+--++-+-+-+--+---+
>  relation |   529986 | 1842228045 |  |   ||
>| |   |  | 75/2193719 | 2128603 |
> AccessExclusiveLock | t   | f|   |
> st.ctr_table_efr_oa
> (1 row)
>
>  locktype | database |  relation  | page | tuple | virtualxid |
> transactionid | classid | objid | objsubid | virtualtransaction |
> pid   |mode | granted | fastpath |   waitstart
>   | relation
> --+--++--+---++---+-+---+--++-+-+-+--+---+---
>  relation |   529986 | 1842231489 |  |   ||
>| |   |  | 75/2193719 | 2128603 |
> AccessExclusiveLock | f   | f| 2022-01-19
> 00:32:32.924694+01 | st.tpd_oa
> (1 row)
>
>locktype| database |  relation  | page | tuple |  virtualxid  |
> transactionid | classid |   objid   | objsubid | virtualtransaction |
>  pid   | mode  | granted | fastpath |
> waitstart   | relation
> ---+--++--+---+--+---+-+---+--++-+---+-+--+---+---
>  relation  |   529986 | 1842231489 |  |   |  |
>   | |   |  | 165/1941408|
> 2130531 | AccessShareLock   | t   | f|
>   | st.tpd_oa
>
> So:
>   pid 2130531 waits for an AccessShareLock on relation 1842228045,
> blocked by pid 2128603 which holds an AccessExclusiveLock
>   pid 2128603 waits for an AccessEx

Re: Undetected Deadlock

2022-01-26 Thread Michael Harris
Hi Alvaro

Thanks for the feedback!

> What version were you using previously?

We were previously on 11.4. Another difference is that we were using
inheritance based partitioning before, whereas now we are using
declarative partitioning.

> Maybe the lock is already taken before the DELETE is run; do you have
> any triggers, rules, constraints, or anything?

There are no triggers, rules or constraints on the table involved in
the DELETE (either the partition or the table that the partition is
part of).

Even more confusingly - when I reproduce the SQL statements that
should be triggering the deadlock, it does not happen: the DELETE does
not attempt to take an AccessShareLock on the parent table, so it does
not deadlock.

Is there any state associated with a transaction or a database
connection that would alter the lock(s) that gets taken out for a
DELETE on a partition? Or can other concurrent transactions in other
processes cause more locks to be needed somehow?

> If you have seen this several times already, maybe a way to investigate 
> deeper is an
> exhaustive log capture of everything that these transactions do

So far it has happened at least twice. There were a couple of other
incidents that may well also have been caused by this, but not enough
data was collected at the time to be sure.

A bit more detail: the two processes that are deadlocked here are one
that is ingesting new data, while the other is removing old data by
dropping partitions. Even before we shifted to 14.1 and native
partitioning, we did get deadlocks between these two processes every
so often which we could not really prevent, so we adopted a retry
approach when it does occur. However we never had an undetected
deadlock in the database.

Since going to 14.1 & native partitioning, we are seeing a lot more
frequent deadlocks. It looks like the increase in frequency might be
related to the extra lock taken by the DELETE that I mentioned above.
However most of the time the deadlock is detected and released by
Postgres and the impact is minimal. Of course it is another story if
it is not detected!

I have enabled `log_statement=all`, but the undetected deadlock hasn't
happened again since. I can easily reproduce the deadlock itself, but
not the undetected case.

Thanks again.

Cheers
Mike

On Wed, 26 Jan 2022 at 10:11, Alvaro Herrera  wrote:
>
> On 2022-Jan-25, Michael Harris wrote:
>
> > We've recently updated our application to PG 14.1, and in the test instance 
> > we
> > have started to see some alarming undetected deadlocks.
>
> This is indeed suspicious / worrisome / curious.
>
> What version were you using previously?
>
> I reformatted the result sets:
>
> > An example of what we have seen is:
> >
> >  locktype | database |  relation  | page | tuple | virtualxid | 
> > transactionid | classid | objid | objsubid | virtualtransaction |   pid   | 
> >  mode   | granted | fastpath |   waitstart   |  
> >   relation
> > --+--++--+---++---+-+---+--++-+-+-+--+---+
> >  relation |   529986 | 1842228045 |  |   || 
> >   | |   |  | 165/1941408| 2130531 | 
> > AccessShareLock | f   | f| 2022-01-19 00:32:32.626152+01 | 
> > st.ctr_table_efr_oa
> > (1 row)
> >
> >  locktype | database |  relation  | page | tuple | virtualxid | 
> > transactionid | classid | objid | objsubid | virtualtransaction |   pid   | 
> >mode | granted | fastpath | waitstart | relation
> > --+--++--+---++---+-+---+--++-+-+-+--+---+
> >  relation |   529986 | 1842228045 |  |   || 
> >   | |   |  | 75/2193719 | 2128603 | 
> > AccessExclusiveLock | t   | f|   | st.ctr_table_efr_oa
> > (1 row)
> >
> >  locktype | database |  relation  | page | tuple | virtualxid | 
> > transactionid | classid | objid | objsubid | virtualtransaction |   pid   | 
> >mode | granted | fastpath |   waitstart   | 
> > relation
> > --+--++--+---++---+-+---+--++-+-+-+--+---+---
> >  relation |   529986 | 1842231489 |  |   || 
&g

Re: Undetected Deadlock

2022-01-26 Thread Michael Harris
> I must be missing something. You mentioned dropping a partition, so is there 
> an actual need for the delete? Could you detach concurrently and then drop 
> the table or delete rows if needed?

The DELETE is part of a transaction performing data loading. Our
application allows data to be overwritten. Before using a COPY to
actually store the new data, we do a DELETE to make sure there are no
existing records. We tested a number of sequences and found that the
most efficient is to perform a DELETE first, even though in most cases
there is nothing to delete, and then do the COPY (rather than for
example checking if it exists first and then deleting it).

The DROP is part of a separate transaction that is doing the data
expiry. Note that it is not dropping the same partition; old
partitions get dropped, new data is loaded into newer partitions.

Cheers
Mike




Re: Undetected Deadlock

2022-01-27 Thread Michael Harris
> but I would expect drop concurrently to resolve your issue with the two 
> processes conflicting

I guess you mean ALTER TABLE DETACH PARTITION ... CONCURRENTLY?
DROP TABLE does not seem to have a concurrent option.
Still that does seem like a good option to try, thanks for drawing it
to my attention.

Cheers
Mike

On Thu, 27 Jan 2022 at 14:20, Michael Lewis  wrote:
>
> There may be a bug so perhaps still pursue reproducing the issue, but I would 
> expect drop concurrently to resolve your issue with the two processes 
> conflicting. Also, perhaps trying"insert, on conflict do update" could be 
> more efficient than the copy but obviously there are too many unknowns and 
> variables for either to be a definite winner.




Re: Undetected Deadlock

2022-01-31 Thread Michael Harris
Hi

The undetected deadlock occurred again today and I was able to collect
some more info.

The presentation was very similar to the case I reported previously:
- One backend trying to do a DROP TABLE on a partition of a
partitioned table, waiting for an AccessExclusiveLock on that table
- Another backend trying to do a DELETE on records within a partition
of another table, waiting on an AccessShareLock which is already held
by the first table and already holding a lock on the table that the
first backend is waiting for
- A load of other backends also trying to do a DELETE.

I was able to attach gdb to the backends and I discovered a few things.

First, taking a stack trace of the process doing the DELETE, it looks like this:

#0  0x7fc82a6750bb in epoll_wait (epfd=13, events=0x18696f8,
maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at
../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x007d0c32 in WaitEventSetWaitBlock (nevents=1,
occurred_events=, cur_timeout=-1, set=0x1869698) at
latch.c:1450
#2  WaitEventSetWait (set=0x1869698, timeout=,
occurred_events=, nevents=1, wait_event_info=) at latch.c:1396
#3  0x007d0f94 in WaitLatch (latch=,
wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=0,
wait_event_info=50331648) at latch.c:473
#4  0x007eaea7 in ProcSleep
(locallock=locallock@entry=0x24571d0,
lockMethodTable=lockMethodTable@entry=0xdb2360 )
at proc.c:1361
#5  0x007df419 in WaitOnLock
(locallock=locallock@entry=0x24571d0, owner=owner@entry=0x18b80f8) at
lock.c:1858
#6  0x007e052c in LockAcquireExtended
(locktag=locktag@entry=0x7ffced1d0b80, lockmode=lockmode@entry=1,
sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false,
reportMemoryError=reportMemoryError@entry=true,
locallockp=locallockp@entry=0x7ffced1d0b78) at lock.c:1100
#7  0x007ddb23 in LockRelationOid (relid=1842227607,
lockmode=1) at lmgr.c:117
#8  0x0050dba5 in relation_open (relationId=1842227607,
lockmode=lockmode@entry=1) at relation.c:56
#9  0x008fc838 in generate_partition_qual (rel=0x56614e8) at
partcache.c:361
#10 generate_partition_qual (rel=0x56614e8) at partcache.c:336
#11 0x007502bf in set_baserel_partition_constraint
(relation=relation@entry=0x56614e8, rel=,
rel=) at plancat.c:2402
~snip~

The lock is being taken out by the function generate_partition_qual
(in partcache.c) which has this comment:

/*
 * Grab at least an AccessShareLock on the parent table.  Must do this
 * even if the partition has been partially detached, because transactions
 * concurrent with the detach might still be trying to use a partition
 * descriptor that includes it.
 */

This is happening during query planning rather than execution, where
it is trying to fetch the partition bound expression.
It explains why this lock is not always required (the result is cached
in the relcache, and anyway it looks to me like this code path is
called depending on factors such as the row estimates).

The second thing was that the DROP process, and all the DELETE
processes, were all waiting inside epoll_wait ultimately called from
ProcSleep as shown above. It is sitting in this section of code
(proc.c line 1361):

(void) WaitLatch(MyLatch, WL_LATCH_SET |
WL_EXIT_ON_PM_DEATH, 0,   <--- Processes sitting
here
 PG_WAIT_LOCK | locallock->tag.lock.locktag_type);
ResetLatch(MyLatch);
/* check for deadlocks first, as that's probably log-worthy */
if (got_deadlock_timeout)

<-- set breakpoint here
{
CheckDeadLock();
got_deadlock_timeout = false;
}

I set a breakpoint on the next line after ResetLatch is called, and it
was never reached. I thought that could be due to GDB attaching to the
process and disrupting the alarm signal that should be sent into the
timeout handler, so I manually sent it a SIGALRM.
At that point the breakpoint was triggered - but the flag
got_deadlock_timeout is not set, so the deadlock check is not
executed.
This led me to believe that the deadlock check timeout wasn't set
properly at all.

To see if the deadlock would be detected if CheckDeadlock() got an
opportunity to run, I manually set got_deadlock_timeout, and sure
enough the deadlock was detected and released.

I concluded that the deadlock detection function is able to detect
these deadlocks, but for some reason the 1sec timeout is not being
correctly scheduled so it never runs.
When it is artificially triggered, it does release the deadlock.

After the release, a new deadlock formed between the DROP process and
another one of the waiting DELETE processes (not an entirely
unexpected outcome).
I will leave it in that state for as long as possible in case anyone
can think of any other info that should be gathered. Luckily it is a
test instance.

Sorry for the slightly long and detailed email - let me know if this
should be moved to 

Re: Undetected Deadlock

2022-02-02 Thread Michael Harris
Hi again

Some good news. After some more debugging & reflection, I realized
that the likely cause is one of our own libraries that gets loaded as
part of some custom functions we are using.

Some of these functions trigger fetching of remote resources, for
which a timeout is set using `alarm`. The function unfortunately does
not re-establish any pre-existing interval timers after it is done,
which leads to postgresql missing it's own expected alarm signal.

The reason that this was not affecting us on previous postgres
versions was this commit:


https://github.com/postgres/postgres/commit/09cf1d52267644cdbdb734294012cf1228745aaa#diff-b12a7ca3bf9c6a56745844c2670b0b28d2a4237741c395dda318c6cc3664ad4a

After this commit, once an alarm is missed, that backend never sets
one again, so no timeouts of any kind will work. Therefore, the
deadlock detector was never being run. Prior to that, the next time
any timeout was set by the backend it would re-establish it's timer.

We will of course fix our own code to prevent this issue, but I am a
little concerned at the above commit as it reduces the robustness of
postgres in this situation. Perhaps I will raise it on the
pgsql-hackers list.

Cheers
Mike

On Tue, 1 Feb 2022 at 17:50, Michael Harris  wrote:
>
> Hi
>
> The undetected deadlock occurred again today and I was able to collect
> some more info.
>
> The presentation was very similar to the case I reported previously:
> - One backend trying to do a DROP TABLE on a partition of a
> partitioned table, waiting for an AccessExclusiveLock on that table
> - Another backend trying to do a DELETE on records within a partition
> of another table, waiting on an AccessShareLock which is already held
> by the first table and already holding a lock on the table that the
> first backend is waiting for
> - A load of other backends also trying to do a DELETE.
>
> I was able to attach gdb to the backends and I discovered a few things.
>
> First, taking a stack trace of the process doing the DELETE, it looks like 
> this:
>
> #0  0x7fc82a6750bb in epoll_wait (epfd=13, events=0x18696f8,
> maxevents=maxevents@entry=1, timeout=timeout@entry=-1) at
> ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> #1  0x007d0c32 in WaitEventSetWaitBlock (nevents=1,
> occurred_events=, cur_timeout=-1, set=0x1869698) at
> latch.c:1450
> #2  WaitEventSetWait (set=0x1869698, timeout=,
> occurred_events=, nevents=1, wait_event_info= out>) at latch.c:1396
> #3  0x007d0f94 in WaitLatch (latch=,
> wakeEvents=wakeEvents@entry=33, timeout=timeout@entry=0,
> wait_event_info=50331648) at latch.c:473
> #4  0x007eaea7 in ProcSleep
> (locallock=locallock@entry=0x24571d0,
> lockMethodTable=lockMethodTable@entry=0xdb2360 )
> at proc.c:1361
> #5  0x007df419 in WaitOnLock
> (locallock=locallock@entry=0x24571d0, owner=owner@entry=0x18b80f8) at
> lock.c:1858
> #6  0x007e052c in LockAcquireExtended
> (locktag=locktag@entry=0x7ffced1d0b80, lockmode=lockmode@entry=1,
> sessionLock=sessionLock@entry=false, dontWait=dontWait@entry=false,
> reportMemoryError=reportMemoryError@entry=true,
> locallockp=locallockp@entry=0x7ffced1d0b78) at lock.c:1100
> #7  0x007ddb23 in LockRelationOid (relid=1842227607,
> lockmode=1) at lmgr.c:117
> #8  0x0050dba5 in relation_open (relationId=1842227607,
> lockmode=lockmode@entry=1) at relation.c:56
> #9  0x008fc838 in generate_partition_qual (rel=0x56614e8) at
> partcache.c:361
> #10 generate_partition_qual (rel=0x56614e8) at partcache.c:336
> #11 0x007502bf in set_baserel_partition_constraint
> (relation=relation@entry=0x56614e8, rel=,
> rel=) at plancat.c:2402
> ~snip~
>
> The lock is being taken out by the function generate_partition_qual
> (in partcache.c) which has this comment:
>
> /*
>  * Grab at least an AccessShareLock on the parent table.  Must do this
>  * even if the partition has been partially detached, because transactions
>  * concurrent with the detach might still be trying to use a partition
>  * descriptor that includes it.
>  */
>
> This is happening during query planning rather than execution, where
> it is trying to fetch the partition bound expression.
> It explains why this lock is not always required (the result is cached
> in the relcache, and anyway it looks to me like this code path is
> called depending on factors such as the row estimates).
>
> The second thing was that the DROP process, and all the DELETE
> processes, were all waiting inside epoll_wait ultimately called from
> ProcSleep as shown above. It is sitting in this section of code
> (proc.c line 1361):
>
> (void) WaitLatch(MyLatch, WL_LATCH_SET |
> WL_EXIT_ON_PM_DEATH, 0,   

Re: Undetected Deadlock

2022-02-03 Thread Michael Harris
> If Michael's analysis were accurate, I'd agree that there is a robustness
> issue, but I don't think there is.  See timeout.c:220:

Actually that only sets a new timer after the nearest timeout has expired.

The pattern I was seeing went like this:

1. Command occurs during which a signal was not delivered (due to our
custom function). signal_pending is set and never cleared.

2. This command finishes normally, so the deadlock timeout is removed.
As long as there are either no more active timeouts, or the nearest
active timeout is still in the future, then no interval timer is set.

3. A later command starts, and again because the new timeout is in the
future, no new interval timer is set.

4. This transaction gets into a deadlock, but because the interval
timer is not running, ALRM is never received so the process is stuck.

In fact once a backend gets into this state it is permanently stuck
like this because new calls to schedule timeouts are always scheduling
them for a future time.

I was thinking that to improve robustness, we could add a check for
`now < signal_due_at` to schedule_alarm line 300:

if (signal_pending && now < signal_due_at && nearest_timeout
>= signal_due_at)
return;

That way, even if we think we are due for a signal, if we are already
past the time we were expecting it we go ahead and set a new one
anyway.

Even though the fault in this case was a faulty custom function, there
could be other scenarios in which a signal isn't delivered (that bit
of code Tom highlighted is specifically to cater for that, going by
the comments).

One other thing that struck me when reading this code: the variable
signal_due_at is not declared as volatile sig_atomic_t, even though it
is read from / written to by the signal handler. Maybe that could
cause problems?

One other question: I've fixed our custom function, so that it
correctly restores any interval timers that were running, but of
course if our function takes longer than the remaining time on the
postgres timer the signal will be delivered late. Beyond the fact that
a deadlock or statement timeout will take longer than expected, are
there any other negative consequences? Are any of the timeouts
time-critical such that being delayed by a few seconds would cause a
problem?

Thanks for any feedback.

Regards
Mike




Re: Undetected Deadlock

2022-02-09 Thread Michael Harris
On Mon, 7 Feb 2022 at 09:57, Tom Lane  wrote:
> Do you want to try this and see if it actually adds any robustness with your 
> buggy code?

Sorry for the delayed response, & thanks for the patch.

I wasn't able to test with our actual application because it could
take days for it to actually trigger the problem, so I tested it with
a simulation, which you can find here:

https://github.com/harmic/pg_almloss

With that simulation I could attach gdb to the backend and see that
signal_pending & signal_due_at were being reset in the expected way,
even when a missed interrupt was triggered.

I'm convinced your patch improves robustness under the scenario we saw.

Thanks again!

Cheers
Mike