Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-07 Thread Randolph Pullen
Yevgeny, The ibstat results: CA 'mthca0'     CA type: MT25208 (MT23108 compat mode)     Number of ports: 2     Firmware version: 4.7.600     Hardware version: a0     Node GUID: 0x0005ad0c21e0     System image GUID: 0x0005ad000100d050     Port 1:     State

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-07 Thread Randolph Pullen
One system is actually an i5-2400 - maybe its throttling back on 2 cores to save power? The other(I7) shows consistent CPU MHz on all cores From: Yevgeny Kliteynik To: Randolph Pullen ; OpenMPI Users Sent: Thursday, 6 September 2012 6:03 PM Subject: Re: [OMP

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Andrea Negri
George, I hace done some modifications to the code, however this is the first part my zmp_list: !ZEUSMP2 CONFIGURATION FILE &GEOMCONF LGEOM= 2, LDIMEN = 2 / &PHYSCONF LRAD = 0, XHYDRO = .TRUE., XFORCE = .TRUE., XMHD

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Jeff Squyres
On Sep 5, 2012, at 3:59 AM, Andrea Negri wrote: > I have tried with these flags (I use gcc 4.7 and open mpi 1.6), but > the program doesn't crash, a node go down and the rest of them remain > to wait a signal (there is an ALLREDUCE in the code). > > Anyway, yesterday some processes died (without

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Jeff Squyres
On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: > Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is > it always the same node that crashes? And so on. Another thought on hardware errors... I actually have seen bad RAM cause spontaneous reboots with no Linux warnings

Re: [OMPI users] problem with rankfile

2012-09-07 Thread Siegmar Gross
Hi, are the following outputs helpful to find the error with a rankfile on Solaris? I wrapped long lines so that they are easier to read. Have you had time to look at the segmentation fault with a rankfile which I reported in my last email (see below)? "tyr" is a two processor single core machine

Re: [OMPI users] problem with rankfile

2012-09-07 Thread Ralph Castain
On Sep 7, 2012, at 5:41 AM, Siegmar Gross wrote: > Hi, > > are the following outputs helpful to find the error with > a rankfile on Solaris? If you can't bind on the new Solaris machine, then the rankfile won't do you any good. It looks like we are getting the incorrect number of cores on th

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Gus Correa
On 09/03/2012 04:39 PM, Andrea Negri wrote: max locked memory (kbytes, -l) 32 max memory size(kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 s

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Gus Correa
On 09/07/2012 08:02 AM, Jeff Squyres wrote: On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it always the same node that crashes? And so on. Another thought on hardware errors... I actually have seen bad RAM cause