On Nov 26, 2012, at 4:02 AM, George Markomanolis wrote: > Another more generic question, is about discovering nodes with faulty memory. > Is there any way to identify nodes with faulty memory? I found accidentally > that a node with exact the same hardware couldn't execute an MPI application > when it was using more than 12GB of ram while the second one could use all of > the 48GB of memory. If I have 500+ nodes is difficult to check all of them > and I am not familiar with any efficient solution. Initially I thought about > memtester but it takes a lot of time. I know that this does not apply exactly > on this mailing list but I thought that maybe an OpenMPI user knows something > about.
You really do want something like a memory tester. MPI applications *might* beat on your memory to identify errors, but that's really just a side effect of HPC access patterns. You really want a dedicated memory tester. If such a memory tester takes a long time, you might want to use mpirun to launch it on multiple nodes simultaneously to save some time...? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/