Hi, first of all thank you very much for providing the patches to me so fast!
On Monday, December 3, 2007 7:18:12 PM Mark Fasheh wrote: > Attached is a pair of patches which applied more cleanly. Basically it > includes another tcp.c fix which the -EAGAIN fix built on top of. Both would > be good for you to have one way or the other. Fair warning though - I don't > really have the ability to test 2.6.18 fixes right now, so you're going to > have to be a bit of a beta tester ;) That said, they look pretty clean to me > so I have a relatively high confidence that they should work. I applied both patches as well as all the patches found on http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.18/ Further I applied the current stable openVZ patch for 2.6.18 as well as a little patch i wrote on my own for IPVS (both already applied to the last used unstable kernel). All the patches fit perfect and I have the kernel up and running now. At least it is already stable for some hours, but more about stability I can tell you only tomorrow. > I'm not sure why it would be always one node and not the other. We'd > probably need more detailing information about what's going on to figure > that out. Maybe some combination of user application + cluster stack > conspires to put a larger messaging load on it? > > Are there any other ocfs2 messages in your logs for that node? All I found is that it sometimes say dlm_send_remote_convert_request:395 ERROR: status = -112 instead of dlm_send_remote_convert_request:395 ERROR: status = -107 shortly before crash. Further I found some messages, but they are kinda historical. So I am not sure anymore if they were during normal operation or during examination of some other configuration: kernel: o2net: no longer connected to node webhost2 (num 1) at 10.2.0.71:7777 kernel: (6693,3):dlm_send_proxy_ast_msg:457 ERROR: status = -112 kernel: (6693,3):dlm_flush_asts:589 ERROR: status = -112 and kernel: o2net: no longer connected to node webhost2 (num 1) at 10.2.0.71:7777 kernel: (27088,2):dlm_do_master_request:1331 ERROR: link to 1 went down! kernel: (27088,2):dlm_get_lock_resource:915 ERROR: status = -112 You further asked for my cluster setup: Base is a DRBD 8.0.4 device in primary/primary mode. This is formated with OCFS2 as one partition. Inside this partition are the private areas of openVZ virtual enviroments (VPS). Inside these VPS run mostly webservers but also some other network services. Between this two cluster nodes I have an ultramonkey heartbeat that manages an IPVS load balancer for the webservers that are located inside the VPS on both cluster nodes on the OCFS2 filesystem. The crashing machine is always the one, that is the hot standby for IPVS. I will further test if this changes if I make the other node the hot standby. > If the two patches here work for you, I'll probably just add them to that > directory for others to use. Until now your patches work pretty good for me, but if they really solve my stability problem I can only tell you tomorrow when I hopefully see that both nodes survived the night ;-) Thanks very much for you expert help, - Rainer ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
