Dear all,
Initially I would like an advice of how to identify the maximum number
of MPI processes that can be executed on a node with oversubscribing.
When I try to execute an application with 4096 MPI processes on a
24-cores node with 48GB of memory, I have an error "Unknown error: 1"
while the memory is not even at the half. I can execute the same
application with 2048 MPI processes in less than one minute. I have
checked linux settings about maximum number of processes and it is much
bigger than 4096.
Another more generic question, is about discovering nodes with faulty
memory. Is there any way to identify nodes with faulty memory? I found
accidentally that a node with exact the same hardware couldn't execute
an MPI application when it was using more than 12GB of ram while the
second one could use all of the 48GB of memory. If I have 500+ nodes is
difficult to check all of them and I am not familiar with any efficient
solution. Initially I thought about memtester but it takes a lot of
time. I know that this does not apply exactly on this mailing list but I
thought that maybe an OpenMPI user knows something about.
Best regards,
George Markomanolis