On 2019-07-02, Raimo Niskanen <raimo+open...@erix.ericsson.se> wrote: > Hi misc@! > > If anyone has got some tips about how to debug two hanging machines we have > in our test lab I am eager to learn. > > The machines runs 6.5, amd64 and are patched up to 005_libssl using M:Tier's > openup. Other than that they are rather different, one small Zotac > ZBox-AD02 with AMD E-350 at 1.6 GHz, and one rack mounted Dell PowerEdge > R230 with Intel Xeon E3-1220. > > The overall symptoms are that it is possible to switch screens using > Alt+Ctrl+F1..Fn, but when logging in as root the greeting prints but no > prompt. Alt+Ctrl+Del does not work. The power button does not work. I > have to long press the power button to force power off. > > This happens during our nightly tests, that are quite resource intesive. > > In /var/log/messages I find suspicious entries "/bsd: proc: table is full" > possibly before the machines become inresponsive, but these entries appear > many more times before that point. And after this "table is full" message > there are many syslog entries; on one machine smartd constatly complains about > an unreadable (pending) sector and atascsi_passthru_done timeout, and on > the other the kernel complains about a probed monitor but no|invalid EDID. > > So it seems the machine is out of some resource and fails to spawn a login > shell. Any clues to how I can find more details and a remedy? I suspect a > full process table, but wonder how to detect and|or avoid that. > > I have considered having systat running on a console screen but do not know > which systat display that might tell me anything. > > Best regards
"/bsd: proc: table is full" means that the process table is full, but it doesn't tell you what caused this. The process table size is controlled by kern.maxproc, it is possible that the default is insufficient for your needs, but it's also possible that there was a build-up of processes that didn't exit due to another problem on the system. I would leave top(1) running on the system, and also save "ps ax" output regularly, then look at that output in the run-up to a failure, to see if that gives clues.