Dear Jeff,
Of course I was thinking to execute memtester on each node on the same
time and gather the outputs. However executing memtester on a node with
48GB memory it takes a lot of time (more than 1-2 hours, I don't
remember exactly, maybe even more because I cancelled its execution) and
I have to consume resources just for testing. So I was curious if you
know a tool/procedure that works much faster. Of course filling the
memory with an application works also but I don't know how right it is.
Best regards,
George Markomanolis
On 11/26/2012 06:09 PM, Jeff Squyres wrote:
On Nov 26, 2012, at 4:02 AM, George Markomanolis wrote:
Another more generic question, is about discovering nodes with faulty memory.
Is there any way to identify nodes with faulty memory? I found accidentally
that a node with exact the same hardware couldn't execute an MPI application
when it was using more than 12GB of ram while the second one could use all of
the 48GB of memory. If I have 500+ nodes is difficult to check all of them and
I am not familiar with any efficient solution. Initially I thought about
memtester but it takes a lot of time. I know that this does not apply exactly
on this mailing list but I thought that maybe an OpenMPI user knows something
about.
You really do want something like a memory tester. MPI applications *might*
beat on your memory to identify errors, but that's really just a side effect of
HPC access patterns. You really want a dedicated memory tester.
If such a memory tester takes a long time, you might want to use mpirun to
launch it on multiple nodes simultaneously to save some time...?