Hi Gavin, On 08/06/2013 01:59 PM, Gavin Jones wrote: > Hi Goldwyn, > > Apologies for the delayed reply. > > The hung Apache process / OCFS issue cropped up again, so I thought > I'd pass along the contents of /proc/<pid>/stack of a few affected > processes: > > gjones@slipapp02:~> sudo cat /proc/27521/stack > gjones's password: > [<ffffffff811663b4>] poll_schedule_timeout+0x44/0x60 > [<ffffffff81166d56>] do_select+0x5a6/0x670 > [<ffffffff81166fbe>] core_sys_select+0x19e/0x2d0 > [<ffffffff811671a5>] sys_select+0xb5/0x110 > [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f > [<00007f394bdd5f23>] 0x7f394bdd5f23 > [<ffffffffffffffff>] 0xffffffffffffffff > gjones@slipapp02:~> sudo cat /proc/27530/stack > [<ffffffff81249721>] sys_semtimedop+0x5a1/0x8b0 > [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f > [<00007f394bdddb77>] 0x7f394bdddb77 > [<ffffffffffffffff>] 0xffffffffffffffff > gjones@slipapp02:~> sudo cat /proc/27462/stack > [<ffffffff81249721>] sys_semtimedop+0x5a1/0x8b0 > [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f > [<00007f394bdddb77>] 0x7f394bdddb77 > [<ffffffffffffffff>] 0xffffffffffffffff > gjones@slipapp02:~> sudo cat /proc/27526/stack > [<ffffffff81249721>] sys_semtimedop+0x5a1/0x8b0 > [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f > [<00007f394bdddb77>] 0x7f394bdddb77 > [<ffffffffffffffff>] 0xffffffffffffffff > > > Additionally, in dmesg I see, for example, > > [774981.361149] (/usr/sbin/httpd,8266,3):ocfs2_unlink:951 ERROR: status = -2 > [775896.135467] > (/usr/sbin/httpd,8435,3):ocfs2_check_dir_for_entry:2119 ERROR: status > = -17 > [775896.135474] (/usr/sbin/httpd,8435,3):ocfs2_mknod:459 ERROR: status = -17 > [775896.135477] (/usr/sbin/httpd,8435,3):ocfs2_create:629 ERROR: status = -17 > [788406.624126] connection1:0: ping timeout of 5 secs expired, recv > timeout 5, last rx 4491991450, last ping 4491992701, now 4491993952 > [788406.624138] connection1:0: detected conn error (1011) > [788406.640132] connection2:0: ping timeout of 5 secs expired, recv > timeout 5, last rx 4491991451, last ping 4491992702, now 4491993956 > [788406.640142] connection2:0: detected conn error (1011) > [788406.928134] connection4:0: ping timeout of 5 secs expired, recv > timeout 5, last rx 4491991524, last ping 4491992775, now 4491994028 > [788406.928150] connection4:0: detected conn error (1011) > [788406.944147] connection5:0: ping timeout of 5 secs expired, recv > timeout 5, last rx 4491991528, last ping 4491992779, now 4491994032 > [788406.944165] connection5:0: detected conn error (1011) > [788408.640123] connection3:0: ping timeout of 5 secs expired, recv > timeout 5, last rx 4491991954, last ping 4491993205, now 4491994456 > [788408.640134] connection3:0: detected conn error (1011) > [788409.907968] connection1:0: detected conn error (1020) > [788409.908280] connection2:0: detected conn error (1020) > [788409.912683] connection4:0: detected conn error (1020) > [788409.913152] connection5:0: detected conn error (1020) > [788411.491818] connection3:0: detected conn error (1020) > > > that repeats for a bit and then I see > > [1952161.012214] INFO: task /usr/sbin/httpd:27491 blocked for more > than 480 seconds. > [1952161.012219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [1952161.012221] /usr/sbin/httpd D ffff88081fc52b40 0 27491 27449 0x00000000 > [1952161.012226] ffff88031a85dc50 0000000000000082 ffff880532a92640 > ffff88031a85dfd8 > [1952161.012231] ffff88031a85dfd8 ffff88031a85dfd8 ffff8807f791c300 > ffff880532a92640 > [1952161.012235] ffffffff8115f3ae ffff8802804bdd98 ffff880532a92640 > 00000000ffffffff > [1952161.012239] Call Trace: > [1952161.012251] [<ffffffff81538fea>] __mutex_lock_slowpath+0xca/0x140 > [1952161.012257] [<ffffffff81538b0a>] mutex_lock+0x1a/0x40 > [1952161.012262] [<ffffffff81160e80>] do_lookup+0x290/0x340 > [1952161.012269] [<ffffffff81161c7f>] path_lookupat+0x10f/0x700 > [1952161.012274] [<ffffffff8116229c>] do_path_lookup+0x2c/0xc0 > [1952161.012279] [<ffffffff8116372d>] user_path_at_empty+0x5d/0xb0 > [1952161.012283] [<ffffffff81158d9d>] vfs_fstatat+0x2d/0x70 > [1952161.012288] [<ffffffff81158fe2>] sys_newstat+0x12/0x30 > [1952161.012293] [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f > [1952161.012308] [<00007f394bdcfb05>] 0x7f394bdcfb04 > [1952161.012382] INFO: task /usr/sbin/httpd:27560 blocked for more > than 480 seconds. > [1952161.012384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [1952161.012385] /usr/sbin/httpd D ffff88081fd52b40 0 27560 27449 0x00000000 > [1952161.012389] ffff880224023c50 0000000000000086 ffff88024326e580 > ffff880224023fd8 > [1952161.012393] ffff880224023fd8 ffff880224023fd8 ffff8807f79cc800 > ffff88024326e580 > [1952161.012397] ffffffff8115f3ae ffff8802804bdd98 ffff88024326e580 > 00000000ffffffff > [1952161.012401] Call Trace: > [1952161.012406] [<ffffffff81538fea>] __mutex_lock_slowpath+0xca/0x140 > [1952161.012410] [<ffffffff81538b0a>] mutex_lock+0x1a/0x40 > [1952161.012415] [<ffffffff81160e80>] do_lookup+0x290/0x340 > [1952161.012420] [<ffffffff81161c7f>] path_lookupat+0x10f/0x700 > [1952161.012425] [<ffffffff8116229c>] do_path_lookup+0x2c/0xc0 > [1952161.012430] [<ffffffff8116372d>] user_path_at_empty+0x5d/0xb0 > [1952161.012434] [<ffffffff81158d9d>] vfs_fstatat+0x2d/0x70 > [1952161.012438] [<ffffffff81158fe2>] sys_newstat+0x12/0x30 > [1952161.012442] [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f > [1952161.012448] [<00007f394bdcfb05>] 0x7f394bdcfb04[1953601.012280] > INFO: task /usr/sbin/httpd:27506 blocked for more than 480 seconds. > [1953601.012282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [1953601.012284] /usr/sbin/httpd D ffff88081fd52b40 0 27506 27449 0x00000000 > [1953601.012287] ffff8804044cf988 0000000000000082 ffff8802e1bfc340 > ffff8804044cffd8 > [1953601.012291] ffff8804044cffd8 ffff8804044cffd8 ffff8807f79cc800 > ffff8802e1bfc340 > [1953601.012296] 0000000000012b40 ffff8804044cfb58 7fffffffffffffff > ffff8802e1bfc340 > [1953601.012300] Call Trace: > [1953601.012305] [<ffffffff815387a2>] schedule_timeout+0x272/0x2f0 > [1953601.012311] [<ffffffff815396e2>] wait_for_common+0xd2/0x180 > [1953601.012345] [<ffffffffa03780a2>] > __ocfs2_cluster_lock.isra.34+0x1f2/0x6d0 [ocfs2] > [1953601.012395] [<ffffffffa0379578>] > ocfs2_inode_lock_full_nested+0x168/0x9d0 [ocfs2] > [1953601.012436] [<ffffffffa037edd2>] ocfs2_permission+0x32/0xf0 [ocfs2] > [1953601.012466] [<ffffffff8115f503>] inode_permission+0x73/0x110 > [1953601.012472] [<ffffffff811612cd>] link_path_walk+0x22d/0x850 > [1953601.012477] [<ffffffff81161bc1>] path_lookupat+0x51/0x700 > [1953601.012482] [<ffffffff8116229c>] do_path_lookup+0x2c/0xc0 > [1953601.012486] [<ffffffff8116372d>] user_path_at_empty+0x5d/0xb0 > [1953601.012490] [<ffffffff81158d9d>] vfs_fstatat+0x2d/0x70 > [1953601.012494] [<ffffffff81158fe2>] sys_newstat+0x12/0x30 > [1953601.012498] [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f > [1953601.012504] [<00007f394bdcfb05>] 0x7f394bdcfb04 >
This is the most interesting backtrace. The process is definitely waiting for a dlm lock while holding a mutex from fstat() call. All the others are waiting for this process to finish. > > Your comment about the network got me thinking; I don't see on the > hosts via any other means or monitoring that the network interface is > going down. However, I was logged into the (EqualLogic) SAN (where > OCFS2 volumes reside) console yesterday and noticed some iSCSI session > messages like: > > iSCSI session to target ........ was closed. Load balancing request > was received on the array. > > Is it possible / probable that the iSCSI load balancing on the SAN is > causing OCFS2 problems? I'm wondering if exceeding the iSCSI ping > timeout leads to a connection error, which then causes OCFS2 to be > unsure of cluster state or have some DLM issue, when then leads to > hung processes. I know it seems like I'm shooting in the dark, but, I > don't have much else to go on. Do you have iSCSI and ocfs2 communication on the same network interface? If yes, the "connection4:0 " messages are from your iSCSI subsystem and your network does seem to have a problem there (don't know what though). > > Still working on permission for the DLM debugging, hope to have it > turned on and get some messages for you either today or tomorrow. > > Thanks, > > Gavin W. Jones > Where 2 Get It, Inc. > > On Thu, Jul 18, 2013 at 5:54 PM, Goldwyn Rodrigues <rgold...@suse.de> wrote: >> On 07/18/2013 11:42 AM, Gavin Jones wrote: >>> >>> Hello, >>> >>> Sure, I'd be happy to provide such information next time this occurs. >>> >>> Can you elaborate, or point me at documentation / procedure regarding >>> the DLM debug logs and what would be helpful to see? I have read >>> "Troubleshooting OCFS2" [1] and the section "Debugging File System >>> Locks" --is this what you're referring to? >> >> >> No. I was looking for a more proactive approach we use in debugging. >> # debugfs.ocfs2 -l >> will provide you a list of debug messages you can turn on/off. >> >> In order to turn on DLM_GLUE (the layer between ocfs2 and DLM) specific >> operations, issue >> >> # debugfs.ocfs2 -l DLM_GLUE allow >> >> Please note, this generates a lot of messages. >> >> >>> >>> Not sure if this will provide additional context or just muddy the >>> waters, but I thought to provide some syslog messages from an affected >>> server the last time this occurred. >>> >>> Jul 14 15:36:55 slipapp07 kernel: [2173588.704093] o2net: Connection >>> to node slipapp03 (num 2) at 172.16.40.122:7777 has been idle for >>> 30.97 secs, shutting it down. >>> Jul 14 15:36:55 slipapp07 kernel: [2173588.704146] o2net: No longer >>> connected to node slipapp03 (num 2) at 172.16.40.122:7777 >>> Jul 14 15:36:55 slipapp07 kernel: [2173588.704279] >>> (kworker/u:1,12787,4):dlm_do_assert_master:1665 ERROR: Error -112 when >>> sending message 502 (key 0xdc8be796) to node 2 >>> Jul 14 15:36:55 slipapp07 kernel: [2173588.704295] >>> (kworker/u:5,26056,5):dlm_do_master_request:1332 ERROR: link to 2 went >>> down! >>> Jul 14 15:36:55 slipapp07 kernel: [2173588.704301] >>> (kworker/u:5,26056,5):dlm_get_lock_resource:917 ERROR: status = -112 >>> Jul 14 15:37:25 slipapp07 kernel: [2173618.784153] o2net: No >>> connection established with node 2 after 30.0 seconds, giving up. >>> <snip> >>> Jul 14 15:39:14 slipapp07 kernel: [2173727.920793] >>> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -112 when >>> sending message 502 (key 0xdc8be796) to node 4 >>> Jul 14 15:39:14 slipapp07 kernel: [2173727.920833] >>> (/usr/sbin/httpd,5023,5):dlm_send_remote_lock_request:336 ERROR: >>> A08674A831ED4048B5136BD8613B21E0: res N000000000152a8da, Error -112 >>> send CREATE LOCK to node 4 >>> Jul 14 15:39:14 slipapp07 kernel: [2173727.930562] >>> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when >>> sending message 502 (key 0xdc8be796) to node 4 >>> Jul 14 15:39:14 slipapp07 kernel: [2173727.944998] >>> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when >>> sending message 502 (key 0xdc8be796) to node 4 >>> Jul 14 15:39:14 slipapp07 kernel: [2173727.951511] >>> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when >>> sending message 502 (key 0xdc8be796) to node 4 >>> Jul 14 15:39:14 slipapp07 kernel: [2173727.973848] >>> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when >>> sending message 502 (key 0xdc8be796) to node 4 >>> Jul 14 15:39:14 slipapp07 kernel: [2173727.990216] >>> (kworker/u:2,13894,7):dlm_do_assert_master:1665 ERROR: Error -107 when >>> sending message 502 (key 0xdc8be796) to node 4 >>> Jul 14 15:39:14 slipapp07 kernel: [2173728.024139] >>> (/usr/sbin/httpd,5023,5):dlm_send_remote_lock_request:336 ERROR: >>> A08674A831ED4048B5136BD8613B21E0: res N000000000152a8da, Error -107 >>> send CREATE LOCK to node 4 >>> <snip, many, many more like the above> >>> >>> Which I suppose would indicate DLM issues; I have previously tried to >>> investigate this (via abovementioned guide) but was unable to make >>> real headway. >> >> >> No, this means your network is the problem, which in turn is affecting your >> DLM operations. The problem could be anywhere from hardware to the network >> device to possibly a bug in the networking code. You may want to check if >> there are other indications that the network interface is down. >> >> >>> >>> I apologize for the rather basic questions... >>> >> No problem. I hope this helps resolve your issues. >> >> >> >>> Thanks, >>> >>> Gavin W. Jones >>> Where 2 Get It, Inc. >>> >>> [1]: http://docs.oracle.com/cd/E37670_01/E37355/html/ol_tshoot_ocfs2.html >>> >> >> <snipped> >> >> -- >> Goldwyn > > > -- Goldwyn _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users