Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-09 Thread Andrea Negri
runs on node aa, bound to socket 1, cores 0-2. >>>>>>>> Rank 1 runs on node bb, bound to socket 0, cores 0 and 1. >>>>>>>> Rank 2 runs on node cc, bound to cores 1 and 2. >>>>>>>> >>>>>>>> Does it mean that the process with rank 0 should be bo

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Gus Correa
On 09/07/2012 08:02 AM, Jeff Squyres wrote: On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it always the same node that crashes? And so on. Another thought on hardware errors... I actually have seen bad RAM cause

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Gus Correa
On 09/03/2012 04:39 PM, Andrea Negri wrote: max locked memory (kbytes, -l) 32 max memory size(kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 s

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Jeff Squyres
On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: > Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is > it always the same node that crashes? And so on. Another thought on hardware errors... I actually have seen bad RAM cause spontaneous reboots with no Linux warnings

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Jeff Squyres
On Sep 5, 2012, at 3:59 AM, Andrea Negri wrote: > I have tried with these flags (I use gcc 4.7 and open mpi 1.6), but > the program doesn't crash, a node go down and the rest of them remain > to wait a signal (there is an ALLREDUCE in the code). > > Anyway, yesterday some processes died (without

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Andrea Negri
gt;>>> wonder if there's some kind of bad interaction going on here... >>>>> >>>>> The transparent hugepage is "transparent", which means it is >>>>> automatically applied to all applications unless it is explicitly told >>&g

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-05 Thread George Bosilca
Andrea, As suggested by the previous answers I guess the size of your problem is too large for the memory available on the nodes. I can runs ZeusMP without any issues up to 64 processes, both over Ethernet and Infiniband. I tried the 1.6 and the current trunk, and both perform as expected. Wha

[OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-05 Thread Andrea Negri
ls >>> /var/log/messages return >>> acpid cron.4 messages.3 secure.4 >>> anaconda.logcupsmessages.4 spooler >>> anaconda.syslog dmesg mpi_uninstall.log spooler.1 >>&g

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-04 Thread David Warren
ohn Hearns) 2. Re: users Digest, Vol 2339, Issue 5 (Andrea Negri) -- Message: 1 Date: Sat, 1 Sep 2012 08:48:56 +0100 From: John Hearns Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Ralph Castain
-- >> There are no nodes allocated to this job. >> ------------------ >> >> all the time. >> >> == >> >> I configured with: >> >> ./config

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Andrea Negri
-- > > all the time. > > == > > I configured with: > > ./configure --prefix=$HOME/local/... --enable-static --disable-shared > --with-sge > > and adjusted my PATHs accordingly (at least: I hope so). > > -- Reuti > > > -- > > Messag

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Ralph Castain
339, Issue 5 (Andrea Negri) >> >> >> ---------- >> >> Message: 1 >> Date: Sat, 1 Sep 2012 08:48:56 +0100 >> From: John Hearns >> Subject: Re: [OMPI users] some mpi processes "disap

[OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-03 Thread Andrea Negri
------------- > > Message: 1 > Date: Sat, 1 Sep 2012 08:48:56 +0100 > From: John Hearns > Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster > of servers > To: Open MPI Users > Message-ID: &

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-01 Thread Andrea Negri
I have tried to run with a single process (i.e. the entire grid is contained by one process) and the the command free -m on the compute node returns total used free sharedbuffers cached Mem: 3913 1540 2372 0 49 1234 -

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-01 Thread John Hearns
Apologies, I have not taken the time to read your comprehensive diagnostics! As Gus says, this sounds like a memory problem. My suspicion would be the kernel Out Of Memory (OOM) killer. Log into those nodes (or ask your systems manager to do this). Look closely at /var/log/messages where there wil

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-08-31 Thread Gus Correa
Hi Andrea I would guess this is a memory problem. Do you know how much memory each node has? Do you know the memory that each MPI process in the CFD code requires? If the program starts swapping/paging into disk, because of low memory, those interesting things that you described can happen. I wo

[OMPI users] some mpi processes "disappear" on a cluster of servers

2012-08-31 Thread Andrea Negri
Hi, I have been in trouble for a year. I run a pure MPI (no openMP) Fortran fluid dynamical code on a cluster of server, and I obtain a strange behaviour by running the code on multiple nodes. The cluster is formed by 16 pc (1 pc is a node) with a dual core processor. Basically, I'm able to run th