I understand that. But that's not what the user experienced in this case.
One node ran into the o2hb timeout (and panic) that caused the o2net
message on the other node.
These are two separate issues. FWIW, I am trying to get the o2net config
backported to the 1.2 tree.
Andy Phillips wrote:
With respect sunil,
the observed problems I see normally go like this;
- o2net timeout - socket closes.
Aug 2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
Aug 2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1154545576.798263 now
- Upper layers realise they have no connection, and panic the box.
Aug 2 19:06:27 fred kernel: o2net: no longer connected to node barney
(num 0) at 172.16.6.10:7777
Aug 2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
fencing this node because it is connected to
a half-quorum of 1 out of 2 nodes which doesn't include the lowest
active node 0
Irrespective of that. The o2net message observed comes about due to the
value of O2NET_HEARTBEAT_TIMEOUT not the o2cb heartbeat.
The code that is probably giving you that error message is;
The function o2net_idle_timer, which is referenced in your error
message, is in ocfs2-1.2.3/fs/ocfs2/cluster/tcp.c
printk(KERN_INFO "o2net: connection to " SC_NODEF_FMT " has been idle
for 10 "
"seconds, shutting it down.\n", SC_NODEF_ARGS(sc));
mlog(ML_NOTICE, "here are some times that might help debug the "
"situation: (tmr %ld.%ld now %ld.%ld dr %ld.%ld adv "
"%ld.%ld:%ld.%ld func (%08x:%u) %ld.%ld:%ld.%ld)\n",
sc->sc_tv_timer.tv_sec, sc->sc_tv_timer.tv_usec,
now.tv_sec, now.tv_usec,
sc->sc_tv_data_ready.tv_sec, sc->sc_tv_data_ready.tv_usec,
sc->sc_tv_advance_start.tv_sec,
sc->sc_tv_advance_start.tv_usec,
sc->sc_tv_advance_stop.tv_sec,
sc->sc_tv_advance_stop.tv_usec,
sc->sc_msg_key, sc->sc_msg_type,
sc->sc_tv_func_start.tv_sec, sc->sc_tv_func_start.tv_usec,
sc->sc_tv_func_stop.tv_sec, sc->sc_tv_func_stop.tv_usec);
The original post only posted that error message, but the other error
messages usually follow. If I'm wrong, please email me directly and help
sort out my understanding.
Andy
On Mon, 2007-01-22 at 10:38 -0800, Sunil Mushran wrote:
o2net timeout cannot cause the o2hb panic. The two are totally
different. From the outputs, I would guess o2hb is timing out but
I cannot say for sure till I don't see the full logs.
Andy Phillips wrote:
Its worth pointing out that the o2net idle timer is triggering on the
network heartbeat, which is 10 seconds, in the current 1.2.x series.
O2CB_HEARTBEAT_THRESHOLD has no effect on this, because its another part
of the code which causes the problem.
see ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h
#define O2NET_IDLE_TIMEOUT_SECS 10
Andy
On Mon, 2007-01-22 at 09:29 -0800, Srinivas Eeda wrote:
problem appears to be that IO is taking more time than effective
O2CB_HEARTBEAT_THRESHOLD. Your configured value "31" doesn't seem to be
effective?
Index 6: took 1995 ms to do msleepIndex
Index 17: took 1996 ms to do msleep
Index 22: took 10001 ms to do waiting for read completion.
Can you please cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold and verify.
Thanks,
--Srini.
Consulente3 wrote:
Hi all,
my test environment, is composed by 2 server with centos 4.4
nodes is exporting with aoe6-43 + vblade-14
kernel-2.6.9-42.0.3.EL
ocfs2-tools-1.2.2-1
ocfs2console-1.2.2-1
ocfs2-2.6.9-42.0.3.EL-1.2.3-1
/dev/etherd/e2.0 on /ocfs2 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/etherd/e3.0 on /ocfs2_nfs type ocfs2 (rw,_netdev,heartbeat=local)
Device FS Nodes
/dev/etherd/e2.0 ocfs2 ocfs2, becks
/dev/etherd/e3.0 ocfs2 ocfs2, becks
Device FS UUID Label
/dev/etherd/e2.0 ocfs2 b24cc18d-af89-4980-a75e-a87530b1b878 test1
/dev/etherd/e3.0 ocfs2 101a92fd-b83b-4294-8bfc-fbaa069c3239 nfs4
O2CB_HEARTBEAT_THRESHOLD=31
when i try to make stress test:
Index 4: took 0 ms to do checking slots
Index 5: took 2 ms to do waiting for write completion
Index 6: took 1995 ms to do msleep
Index 7: took 0 ms to do allocating bios for read
Index 8: took 0 ms to do bio alloc read
Index 9: took 0 ms to do bio add page read
Index 10: took 0 ms to do submit_bio for read
Index 11: took 2 ms to do waiting for read completion
Index 12: took 0 ms to do bio alloc write
Index 13: took 0 ms to do bio add page write
Index 14: took 0 ms to do submit_bio for write
Index 15: took 0 ms to do checking slots
Index 16: took 1 ms to do waiting for write completion
Index 17: took 1996 ms to do msleep
Index 18: took 0 ms to do allocating bios for read
Index 19: took 0 ms to do bio allo read
Index 20: took 0 ms to do bio add page read
Index 21: took 0 ms to do submit_bio for read
Index 22: took 10001 ms to do waiting for read completion
(3,0):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active
regions.
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing
<6>o2net: connection to node ocfs2 (num 2) at 10.1.7.107:777 has been
idle for 10 seconds, shutting it down
(3,0): o2net_idle_timer:1309 here are some times that might help debug
the situation:
(tmr: 1169487957.71650 now 1169487967.69569 dr 1169487962.88883 adv
1169487957.71671:1159487957.71674
func 83bce37b2:505) 1169487901.984644:1169487901.984676)
the kernel panic occurs always on the same node, and the other node
still responding
thanks!
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users