On 09/07/2012 08:02 AM, Jeff Squyres wrote:
On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote:
Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it
always the same node that crashes? And so on.
Another thought on hardware errors... I actually have seen bad RAM cause
spontaneous reboots with no Linux warnings.
Do you have any hardware diagnostics from your server
vendor that you can run?
If you don't have a vendor provided diagnostic tool,
you or your sys admin could try Advanced Clustering "breakin":
http://www.advancedclustering.com/our-software/view-category.html
Download the ISO version, burn a CD, put in the node CD drive,
assuming it has one, reboot, chose breakin in the menu options.
If there is no CD drive, there is an alternative with network boot,
although more involved.
I hope it helps,
Gus Correa
A simple way to test your RAM (it's not completely comprehensive, but it does
check for a surprisingly wide array of memory issues) is to do something like
this (pseudocode):
-----
size_t i, size, increment;
increment = 1GB;
size = 1GB;
int *ptr;
// Find the biggest amount of memory that you can malloc
while (increment>= 1024) {
ptr = malloc(size);
if (NULL != ptr) {
free(ptr);
size += increment;
} else {
size -= increment;
increment /= 2;
}
}
printf("I can malloc %lu bytes\n", size);
// Malloc that huge chunk of memory
ptr = malloc(size);
for (i = 0; i< size / sizeof(int); ++i, ++ptr) {
*ptr = 37;
if (*ptr != 37) {
printf("Readback error!\n");
}
}
printf("All done\n");
-----
Depending on how much memory you have,
that might take a little while to run
(all the memory has to be paged in, etc.).
You might want to add a status output to show progress,
and/or write/read a page at a time for better efficiency, etc.
But you get the idea.
Hope that helps.