On Wed, Oct 1, 2014 at 11:10 AM, Robert Haas <robertmh...@gmail.com> wrote: > As far as I can tell, it's configured to run everything. I just > checked, though, and found it wedged again. I'm not sure whether it > was the same problem, though; I ended up just killing all of the > postgres processes to fix it. We may be just at the beginning of an > exciting debugging journey.
After much more time spent on this, and some help from two people whose names are both six letters starting with "Andre", I discovered the following: - anole apparently had not tried to build master in a long time - I think since May - because of a leftover "inst" directory instead HEAD's buildroot. - anole still refused to build either 9.4 or master until I removed a couple of leftover socket directories in /tmp. - after that it hung again trying to run the test_shm_mq regression tests, in pretty much the same way as before. The postmaster state looks like this: 0xc00000000044a350:0 in _select_sys+0x30 () from /usr/lib/hpux64/libc.so.1 (gdb) bt #0 0xc00000000044a350:0 in _select_sys+0x30 () from /usr/lib/hpux64/libc.so.1 #1 0xc00000000045f950:0 in select+0x90 () from /usr/lib/hpux64/libc.so.1 #2 0x4000000000a4e510:0 in ServerLoop () at postmaster.c:1540 #3 0x4000000000a48b20:0 in PostmasterMain (argc=3, argv=0x87ffffffffffd178) at postmaster.c:1223 #4 0x40000000008d0a90:0 in main (argc=3, argv=0x87ffffffffffd178) at main.c:227 (gdb) p BackgroundWorkerList $1 = {head = {next = 0x60000000000e8a58}} (gdb) p BackgroundWorkerList.head $2 = {next = 0x60000000000e8a58} (gdb) p ((char *) BackgroundWorkerList.head) - 264 $3 = 0x60000000000e8950 "test_shm_mq" (gdb) p (RegisteredBgWorker *) (((char *) BackgroundWorkerList.head) - 264) $4 = (struct RegisteredBgWorker *) 0x60000000000e8950 (gdb) p ((RegisteredBgWorker *) (((char *) BackgroundWorkerList.head) - 264))[0] $5 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001?\000 \000\00511985", '\000' <repeats 13 times>, "\005T\000\000\000\000\000\001?\000 \000\00512038\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\005t\000\000\000\000\000\001?\000 \000\00511917", '\000' <repeats 13 times>, "\005?\000\000\000\000\000\001?\000 \000\0051209", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?\000 \000\00511903", '\000' <repeats 13 times>, "\005?\000\000\000\000\000\001?\000 \000\0051199", bgw_main_arg = 1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 12369, rw_child_slot = 35, rw_crashed_at = 0, rw_shmem_slot = 6, rw_terminate = 0 '\000', rw_lnode = {next = 0x60000000000e8938}} (gdb) p $5.rw_lnode $6 = {next = 0x60000000000e8938} (gdb) p ((RegisteredBgWorker *) (((char *) $5.rw_lnode) - 264))[0] $7 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001?x\000 \000\00511931", '\000' <repeats 13 times>, "\004?\000\000\000\000\000\001?}\000 \000\00511939\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\004?\000\000\000\000\000\001?\000 \000\00511957", '\000' <repeats 13 times>, "\004?\000\000\000\000\000\001?\000 \000\0051209", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?\000 \000\00511956", '\000' <repeats 13 times>, "\005\024\000\000\000\000\000\001?\000 \000\0051192", bgw_main_arg = 1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 0, rw_shmem_slot = 5, rw_terminate = 0 '\000', rw_lnode = { next = 0x60000000000e8818}} (gdb) p ((RegisteredBgWorker *) (((char *) $7.rw_lnode) - 264))[0] $8 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001?f\000 \000\00512051", '\000' <repeats 13 times>, "\0044\000\000\000\000\000\001?g\000 \000\00512047\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\004T\000\000\000\000\000\001?j\000 \000\00511979", '\000' <repeats 13 times>, "\004d\000\000\000\000\000\001?k\000 \000\0051197", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?m\000 \000\00511873", '\000' <repeats 13 times>, "\004?\000\000\000\000\000\001?u\000 \000\0051196", bgw_main_arg = 1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 0, rw_shmem_slot = 4, rw_terminate = 0 '\000', rw_lnode = {next = 0x60000000000e86f8}} (gdb) p ((RegisteredBgWorker *) (((char *) $8.rw_lnode) - 264))[0] $9 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001?\\\000 \000\00512085", '\000' <repeats 13 times>, "\003?\000\000\000\000\000\001?]\000 \000\00512076\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\003?\000\000\000\000\000\001?_\000 \000\00511885", '\000' <repeats 13 times>, "\003?\000\000\000\000\000\001?`\000 \000\0051206", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?a\000 \000\00512065", '\000' <repeats 13 times>, "\003?\000\000\000\000\000\001?c\000 \000\0051188", bgw_main_arg = 695156898, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665468967706, rw_shmem_slot = 3, rw_terminate = 0 '\000', rw_lnode = { next = 0x60000000000e85d8}} (gdb) p ((RegisteredBgWorker *) (((char *) $9.rw_lnode) - 264))[0] $10 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001?\f\000 \000\00512045", '\000' <repeats 13 times>, "\003\024\000\000\000\000\000\001?\016\000 \000\00512049\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\0034\000\000\000\000\000\001?=\000 \000\00512057", '\000' <repeats 13 times>, "\003D\000\000\000\000\000\001?A\000 \000\0051208", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?B\000 \000\00512092", '\000' <repeats 13 times>, "\003d\000\000\000\000\000\001?C\000 \000\0051192", bgw_main_arg = 507386864, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665463586145, rw_shmem_slot = 2, rw_terminate = 0 '\000', rw_lnode = { next = 0x60000000000e84b8}} (gdb) p ((RegisteredBgWorker *) (((char *) $10.rw_lnode) - 264))[0] $11 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001??\000 \000\00511937", '\000' <repeats 13 times>, "\002?\000\000\000\000\000\001??\000 \000\00511941\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\002?\000\000\000\000\000\001?\000\000 \000\00511950", '\000' <repeats 13 times>, "\002?\000\000\000\000\000\001?\002\000 \000\0051195", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?\a\000 \000\00511960", '\000' <repeats 13 times>, "\002?\000\000\000\000\000\001?\b\000 \000\0051198", bgw_main_arg = 1747912200, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665462962221, rw_shmem_slot = 1, rw_terminate = 0 '\000', rw_lnode = { next = 0x60000000000e8398}} (gdb) p ((RegisteredBgWorker *) (((char *) $11.rw_lnode) - 264))[0] $12 = {rw_worker = { bgw_name = "test_shm_mq", '\000' <repeats 11 times>, " \000\000\000\000\000\000\000 ", '\000' <repeats 23 times>, "\021?\000\000\000\000\000\001??", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\001??\000(\000\t12115_fsm", '\000' <repeats 17 times>, "\022(\000\000\000\000\000\001??\0000\000\020pg_i", bgw_function_name = "test_shm_mq_main", '\000' <repeats 14 times>, "\0228\000\000\000\000\000\001?\003\000 \000\00552236", '\000' <repeats 13 times>, "\022H", bgw_main_arg = 502019512, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665462860980, rw_shmem_slot = 0, rw_terminate = 0 '\000', rw_lnode = {next = 0x0}} (gdb) fr 3 #3 0x4000000000a48b20:0 in PostmasterMain (argc=3, argv=0x87ffffffffffd178) at postmaster.c:1223 1223 status = ServerLoop(); (gdb) p StartWorkerNeeded $13 = 1 '\001' So apparently commit a39e78b710eb588e102aedd2828611d7bc74714b, a good idea though it certainly was, wasn't the root cause of this issue - because the symptoms now seem identical to what I saw debugging earlier hangs, but I verified that those changes are present here. Further debugging reveals that sigusr1_handler() gets called repeatedly, to start autovacuum workers, and it keeps waking up and starting them. But that doesn't cause the background workers to get started either, because although sigusr1_handler() contains a call to maybe_start_bgworker, it only does that if start_bgworker = true, which only happens if CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true, which on these calls it isn't. And I think this might also be the missing ingredient answering Andres's question from before: why doesn't the 60-second select()-timeout cause the background worker to eventually start even if the SELECT doesn't get interrupted? There seems to be a SIGUSR1 arriving about every 3 seconds, and I bet that's resetting the timer every time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers