I understand that. But that's not what the user experienced in this case.
One node ran into the o2hb timeout (and panic) that caused the o2net
message on the other node.

These are two separate issues. FWIW, I am trying to get the o2net config
backported to the 1.2 tree.

Andy Phillips wrote:
With respect sunil,
 the observed problems I see normally go like this;

- o2net timeout - socket closes.
Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1154545576.798263 now

- Upper layers realise they have no connection, and panic the box.
Aug 2 19:06:27 fred kernel: o2net: no longer connected to node barney
(num 0) at 172.16.6.10:7777
Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
fencing this node because it is connected to
a half-quorum of 1 out of 2 nodes which doesn't include the lowest
active node 0

Irrespective of that. The o2net message observed comes about due to the
value of O2NET_HEARTBEAT_TIMEOUT not the o2cb heartbeat.
The code that is probably giving you that error message is;

The function o2net_idle_timer, which is referenced in your error
message, is in ocfs2-1.2.3/fs/ocfs2/cluster/tcp.c

printk(KERN_INFO "o2net: connection to " SC_NODEF_FMT " has been idle
for 10 "
             "seconds, shutting it down.\n", SC_NODEF_ARGS(sc));
        mlog(ML_NOTICE, "here are some times that might help debug the "
             "situation: (tmr %ld.%ld now %ld.%ld dr %ld.%ld adv "
             "%ld.%ld:%ld.%ld func (%08x:%u) %ld.%ld:%ld.%ld)\n",
             sc->sc_tv_timer.tv_sec, sc->sc_tv_timer.tv_usec,
             now.tv_sec, now.tv_usec,
             sc->sc_tv_data_ready.tv_sec, sc->sc_tv_data_ready.tv_usec,
             sc->sc_tv_advance_start.tv_sec,
sc->sc_tv_advance_start.tv_usec,
             sc->sc_tv_advance_stop.tv_sec,
sc->sc_tv_advance_stop.tv_usec,
             sc->sc_msg_key, sc->sc_msg_type,
             sc->sc_tv_func_start.tv_sec, sc->sc_tv_func_start.tv_usec,
             sc->sc_tv_func_stop.tv_sec, sc->sc_tv_func_stop.tv_usec);

The original post only posted that error message, but the other error
messages usually follow. If I'm wrong, please email me directly and help
sort out my understanding.
Andy

On Mon, 2007-01-22 at 10:38 -0800, Sunil Mushran wrote:
o2net timeout cannot cause the o2hb panic. The two are totally
different. From the outputs, I would guess o2hb is timing out but
I cannot say for sure till I don't see the full logs.

Andy Phillips wrote:
Its worth pointing out that the o2net idle timer is triggering on the network heartbeat, which is 10 seconds, in the current 1.2.x series.


O2CB_HEARTBEAT_THRESHOLD has no effect on this, because its another part
of the code which causes the problem.

see ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h
#define O2NET_IDLE_TIMEOUT_SECS         10

Andy


On Mon, 2007-01-22 at 09:29 -0800, Srinivas Eeda wrote:
problem appears to be that IO is taking more time than effective 
O2CB_HEARTBEAT_THRESHOLD. Your configured value "31" doesn't seem to be 
effective?

Index 6: took 1995 ms to do msleepIndex Index 17: took 1996 ms to do msleep
Index 22: took 10001 ms to do waiting for read completion.

Can you please cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold and verify.
Thanks,
--Srini.




Consulente3 wrote:
Hi all,
my test environment, is composed by 2 server with centos 4.4
nodes is exporting with aoe6-43 + vblade-14

kernel-2.6.9-42.0.3.EL
ocfs2-tools-1.2.2-1
ocfs2console-1.2.2-1
ocfs2-2.6.9-42.0.3.EL-1.2.3-1

/dev/etherd/e2.0 on /ocfs2 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/etherd/e3.0 on /ocfs2_nfs type ocfs2 (rw,_netdev,heartbeat=local)

Device                FS     Nodes
/dev/etherd/e2.0      ocfs2  ocfs2, becks
/dev/etherd/e3.0      ocfs2  ocfs2, becks

Device                FS     UUID                                  Label
/dev/etherd/e2.0      ocfs2  b24cc18d-af89-4980-a75e-a87530b1b878  test1
/dev/etherd/e3.0      ocfs2  101a92fd-b83b-4294-8bfc-fbaa069c3239  nfs4

O2CB_HEARTBEAT_THRESHOLD=31

when i try to make stress test:

Index 4: took 0 ms to do checking slots
Index 5: took 2 ms to do waiting for write completion
Index 6: took 1995 ms to do msleep
Index 7: took 0 ms to do allocating bios for read
Index 8: took 0 ms to do bio alloc read
Index 9: took 0 ms to do bio add page read
Index 10: took 0 ms to do submit_bio for read
Index 11: took 2 ms to do waiting for read completion
Index 12: took 0 ms to do bio alloc write
Index 13: took 0 ms to do bio add page write
Index 14: took 0 ms to do submit_bio for write
Index 15: took 0 ms to do checking slots
Index 16: took 1 ms to do waiting for write completion
Index 17: took 1996 ms to do msleep
Index 18: took 0 ms to do allocating bios for read
Index 19: took 0 ms to do bio allo read
Index 20: took 0 ms to do bio add page read
Index 21: took 0 ms to do submit_bio for read
Index 22: took 10001 ms to do waiting for read completion
(3,0):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all active
regions.
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing


<6>o2net: connection to node ocfs2 (num 2) at 10.1.7.107:777 has been
idle for 10 seconds, shutting it down
(3,0): o2net_idle_timer:1309 here are some times that might help debug
the situation:
(tmr: 1169487957.71650 now 1169487967.69569 dr 1169487962.88883 adv
1169487957.71671:1159487957.71674
func 83bce37b2:505) 1169487901.984644:1169487901.984676)

the kernel panic occurs always on the same node, and the other node
still responding

thanks!
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to