Zach van Rijn wrote:
On Wed, 2022-11-30 at 21:21 -0600, Jacob Bachmeyer wrote:
...
Do you have logs farther back?


Yes; I've attached some going back about ten days. Thank you for
the analysis, by the way. It is an interesting theory. I would
tend to agree with Bruno that memory should be checked before
drawing any conclusions from the logs.

I had tried to find a possible cause other than bad hardware, since you had stated that you did not want to jump to that conclusion.

While I am not intricately familiar with Linux-on-SPARC, the logs you attached appear to contain a large number of oopses, which combined with...

When I mentioned failed reboots, I meant that when the system
first comes up, the kernel panics like so:

    https://paste.debian.net/plainh/fb20bd17

This happens between 20-80% of the time and requires resetting
over and over again until it boots into userspace.

...those panics during early boot, strongly suggest bad RAM as Bruno Haible suggested. If the machine actually has OpenFirmware, you could (or so I understand) write a small RAM tester in OF Forth, feed it in at the boot monitor console, and pin the problem down to the bad module, but I do not know the details of programming that environment, or if later SPARC machines actually still have those capabilities, or if OpenFirmware can actually reach all memory on larger SPARC systems. (But your problem seems to be in relatively low memory, in an area the kernel allocates for its data early on, before access to higher memory (if that is an issue) is set up, if I am reading the log and guessing correctly.)

Once it's up it's fine, except for this recent soft lock error.

Most of the oopses I noticed in a quick look at those logs seem to be associated with the process table, (which is not an actual contiguous table in Linux) suggesting that there is bad RAM in an area where the kernel allocates its data structures. (Linux's task structures are quite large and thus have a fairly good chance to span a faulty RAM cell compared to smaller structures.) The panic you mentioned is the kernel detecting stack corruption, so if you can identify the physical addresses and corresponding module(s) used for that kernel stack, you should be able to pull it/them.

Can the machine operate with reduced RAM, or does it need every module currently installed to start? If it needs all the modules, you might still be able to shuffle them and move the fault away from the kernel's data area and into user areas, then either write a small program to run at early boot that allocates memory until it gets the bad pages and holds onto them while releasing the rest, or use the Linux "badRAM" feature/patch if it is available on SPARC and tell the kernel not to give those pages to userspace, which should at least hold long enough for you to be able to get more RAM modules. :-/

Speculation: Whatever contains the first 1.5GB or so of RAM likely has the fault, since the kernel appears to be claiming about that much for itself during early boot (" Memory: 1547008K/133671000K available (8913K kernel code, 1456K rwdata, 2464K rodata, 672K init, 530K bss, 1355504K reserved, 0K cma-reserved)") and I am guessing that that gets allocated from the low end of the physical address space. Swapping it with the top bank is most likely (guessing that memory is first used working from low-to-high or high-to-low) to give the machine better chances to boot, but will cause user programs to see unstable memory once the kernel starts handing out the faulty pages to userspace. Swapping the first and last banks also hedges the guess that Linux takes its RAM from the low end of the address space against the possibility of Linux starting at the top. If the machine is stabilized by this, you would then be able to use the test program Bruno Haible suggested to find the exact fault. As things stand right now, if the fault is an area (as appears to be so) that the kernel reserves for itself, I do not think a user space program would be able to find it.


-- Jacob
_______________________________________________
cfarm-users mailing list
cfarm-users@lists.tetaneutral.net
https://lists.tetaneutral.net/listinfo/cfarm-users

Reply via email to