We have seen this twice. The first time, it was on a 32 bit Fedora Core 3 system during a `postgres stop`. The stop hung because one of the processes was in async_notify and refused to get out. We were unable to do further debugging because the db was in shutdown. We were forced to kill -9 the process.
The second time we saw this, today, was on a 64 bit Opteron Fedora Core 3 system during operation. One of our postgres processes was hung in async_notify causing all further processing by that process to be stopped. All other processes were fine, including ones which used notify. After a variety of debugging and shutdown attempts (described in further detail below), we were forced to kill -9 the process Unfortunately, we do not know how to reproduce this (though we are willing to try if anyone has any ideas). We are obviously looking for a patch, a recovery method more subtle than kill -9, or at least something to do the next time we see this problem to help further debugging. A description of what happened, a dump of pg_locks, some `ps` examples, and some final postgres shutdown messages are included below. Any attention you could spare would be most appreciated. The way the notify gets called is a bit convoluted. Our C program performs an insert in table T. Some triggers run doing all sorts of stuff in the background involving changes to other tables. Under some circumstances, a row is inserted by the trigger into table W (a trigger off of that is executed) and the parent trigger performs a notify. This, while expensive, normally all works. However, at a customer site we saw the postgres process hang out in async_notify and refuse to allow more inserts. As far as we know, there was a listener (a perl process) for that notify (we failed to look at pg_listeners) but we were able to use psql to listen on that value and notify on it as well. We doubt the notify rate was high enough to cause hundreds or even dozens of notifies to be outstanding, but it is certainly possible that a handful were. After struggling with this for a bit (which included sending the hung process a SIGUSR2 because of some emails on this list which seemed to suggest a relationship between the signal and notifies), we attempted to stop the system. We killed off all postgres initiators (including the process doing the insert which caused it to hang in async_notify) but the program remained hung. We initiated a database shutdown (whapped the postmaster with a SIGTERM) but the postgres remained in async_notify. We tried directly sending the hung process a SIGTERM but that did not help. Finally, we were forced to send the process the uncatchable signal SIGKILL, which obviously worked. psql=# select * from pg_locks; relation | database | transaction | pid | mode | granted ----------+----------+-------------+-------+-----------------+--------- | | 565845 | 11603 | ExclusiveLock | t 16759 | 17212 | | 469 | AccessShareLock | t | | 2153119 | 469 | ExclusiveLock | t different psql (pid 1013) psql=# select * from pg_locks; relation | database | transaction | pid | mode | granted ----------+----------+-------------+-------+-----------------+--------- | | 565845 | 11603 | ExclusiveLock | t | | 2154368 | 1013 | ExclusiveLock | t 16759 | 17212 | | 1013 | AccessShareLock | t ---------------------------------------------------------------------- 26535 ? S 0:08 postmaster -i -D postgres/data 26536 ? S 0:47 postgres: stats buffer process 26537 ? S 0:27 postgres: stats collector process 26551 ? Ss 0:00 initlog -c pg_autovacuum -U user1 26599 ? S 0:19 pg_autovacuum -U user1 29684 ? S 16:31 postgres: sysd sysd 10.139.140.58 idle 30881 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle 10549 ? S 0:11 postgres: sysd sysd 10.139.140.58 idle 10626 ? S 0:03 postgres: sysd sysd 10.139.140.58 idle 10668 ? S 1:53 postgres: sysd sysd 10.139.140.58 idle 11178 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle 11258 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle 11603 ? S 1:14 postgres: sysd sysd 10.139.140.58 async_notify 11718 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle 32506 pts/3 T 0:00 psql 32553 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle 32649 pts/3 T 0:00 psql -U user1 32696 ? S 0:00 postgres: user1 sysd 10.139.140.58 idle 374 ? S 0:00 postgres: sysd sysd 10.139.140.58 idle 389 pts/3 S+ 0:00 grep postgres ---------------------------------------------------------------------- After killing off *all* postgres initiators, including the parent of 11603. ---------------------------------------------------------------------- 26535 ? S 0:08 postmaster -i -D postgres/data 26536 ? S 0:47 postgres: stats buffer process 26537 ? S 0:28 postgres: stats collector process 11603 ? S 1:14 postgres: sysd sysd 10.139.140.58 async_notify ---------------------------------------------------------------------- Apr 25 11:01:02 localhost postgres[26535]: [5-1] LOG: server process (PID 11603) was terminated by signal 9 Apr 25 11:01:02 localhost postgres[26535]: [6-1] LOG: terminating any other active server processes Apr 25 11:01:02 localhost postgres[26535]: [7-1] LOG: all server processes terminated; reinitializing Apr 25 11:01:02 localhost postgres[1947]: [8-1] LOG: database system was interrupted at 2005-04-25 10:56:05 EDT Apr 25 11:01:02 localhost postgres[1947]: [9-1] LOG: checkpoint record is at 0/153152B8 Apr 25 11:01:02 localhost postgres[1947]: [10-1] LOG: redo record is at 0/153152B8; undo record is at 0/0; shutdown FALSE Apr 25 11:01:02 localhost postgres[1947]: [11-1] LOG: next transaction ID: 2156499; next OID: 66294 Apr 25 11:01:02 localhost postgres[1947]: [12-1] LOG: database system was not properly shut down; automatic recovery in progress Apr 25 11:01:02 localhost postgres[1947]: [13-1] LOG: record with zero length at 0/15315300 Apr 25 11:01:02 localhost postgres[1947]: [14-1] LOG: redo is not required Apr 25 11:01:04 localhost postgres[1947]: [15-1] LOG: database system is ready Apr 25 11:01:04 localhost postgres[1949]: [8-1] LOG: shutting down Apr 25 11:01:06 localhost postgres[1949]: [9-1] LOG: database system is shut do ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly