Cyril Brulebois <[email protected]> (15/01/2009): > Errm, now that I'm rebooting on a loopy fashion, it looks like those > patches don't cure the problem totally, so I guess I'm back to > debugging.
OK, my patches aren't actually fixing the issue, trash them. :) I've found [1] by Άλκης Γεωργόπουλος, which definitely describes and fixes my problem. With an automated reboot every 90 seconds, I haven't been able to reach the timeout. 1. http://www.zytor.com/pipermail/klibc/2008-June/002319.html It'd by very nice to see this included. Also, as noted in [2], I'm reaching the retry code on every boot (it's easier for me than for Άλκης to reproduce, I believe due to the sync'd boots), which means that e.g. when 3 boxes out of 4 are stuck, one is completing the DHCP handshake upon each retry: Meaning that box1 boots up, box2 to box4 are waiting. After 10 seconds, box3 completes the handshake. After 10 other seconds, box4 completes. And finally after 10 other seconds, box2 completes. For our cluster use, we'll probably lower the 10 seconds delay to a single or two seconds, but it'd be nice to see this other problem fixed too. I'll try and get back to you with full traces. The relevant excerpt from [2] describing the problem: | Output of ipconfig-1.5.10-patched receiving an ARP packet, | considering it an error and delaying for 10 secs. It didn't drop any | packets before the error (as the other versions did), the error | happend before the offer (rare - took me many minutes to reproduce). | So this ARP error is on all versions. The delay depends on the errors | received, I've seen all versions needing from 1 sec to some minutes. | http://users.sch.gr/alkisg/temp/output3.txt 2. http://www.zytor.com/pipermail/klibc/2008-June/002322.html Cheers, -- Cyril Brulebois
signature.asc
Description: Digital signature

