On Nov 26, 2012, at 4:02 AM, George Markomanolis wrote:

> Another more generic question, is about discovering nodes with faulty memory. 
> Is there any way to identify nodes with faulty memory? I found accidentally 
> that a node with exact the same hardware couldn't execute an MPI application 
> when it was using more than 12GB of ram while the second one could use all of 
> the 48GB of memory. If I have 500+ nodes is difficult to check all of them 
> and I am not familiar with any efficient solution. Initially I thought about 
> memtester but it takes a lot of time. I know that this does not apply exactly 
> on this mailing list but I thought that maybe an OpenMPI user knows something 
> about.

You really do want something like a memory tester.  MPI applications *might* 
beat on your memory to identify errors, but that's really just a side effect of 
HPC access patterns.  You really want a dedicated memory tester.

If such a memory tester takes a long time, you might want to use mpirun to 
launch it on multiple nodes simultaneously to save some time...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to