On Tue, Jul 02, 2019 at 05:13:43PM +0000, Stuart Henderson wrote: > On 2019-07-02, Raimo Niskanen <raimo+open...@erix.ericsson.se> wrote: > > Hi misc@! > > > > If anyone has got some tips about how to debug two hanging machines we have > > in our test lab I am eager to learn. > > > > The machines runs 6.5, amd64 and are patched up to 005_libssl using M:Tier's > > openup. Other than that they are rather different, one small Zotac > > ZBox-AD02 with AMD E-350 at 1.6 GHz, and one rack mounted Dell PowerEdge > > R230 with Intel Xeon E3-1220. > > > > The overall symptoms are that it is possible to switch screens using > > Alt+Ctrl+F1..Fn, but when logging in as root the greeting prints but no > > prompt. Alt+Ctrl+Del does not work. The power button does not work. I > > have to long press the power button to force power off. > > > > This happens during our nightly tests, that are quite resource intesive. > > > > In /var/log/messages I find suspicious entries "/bsd: proc: table is full" > > possibly before the machines become inresponsive, but these entries appear > > many more times before that point. And after this "table is full" message > > there are many syslog entries; on one machine smartd constatly complains > > about > > an unreadable (pending) sector and atascsi_passthru_done timeout, and on > > the other the kernel complains about a probed monitor but no|invalid EDID. > > > > So it seems the machine is out of some resource and fails to spawn a login > > shell. Any clues to how I can find more details and a remedy? I suspect a > > full process table, but wonder how to detect and|or avoid that. > > > > I have considered having systat running on a console screen but do not know > > which systat display that might tell me anything. > > > > Best regards > > "/bsd: proc: table is full" means that the process table is full, but it > doesn't > tell you what caused this. > > The process table size is controlled by kern.maxproc, it is possible > that the default is insufficient for your needs, but it's also possible > that there was a build-up of processes that didn't exit due to another > problem on the system. > > I would leave top(1) running on the system, and also save "ps ax" output > regularly, then look at that output in the run-up to a failure, to see > if that gives clues. >
Great! I will do that... -- / Raimo Niskanen, Erlang/OTP, Ericsson AB