On Thu, Jul 08, 2010 at 11:04:09AM -0700, Avneesh Pant wrote: > Anton, > On the node that you saw the failure (u02n065) > can you verify what the max locked memory limit > is set to? In a bash shell you can do this with > ulimit -l. It should be set to at least 128K. > Also please verify that the available memory on > the node (/proc/meminfo shows this) is sufficient > as it may be possible that some zombie > processes on that node are consuming memory.
Avneesh, many thanks bigblue3> ssh u02n065 Last login: Fri Jul 9 12:24:17 2010 from bigblue3.cvos.cluster u02n065> bash - bash-3.2$ ulimit -l unlimited bash-3.2$ This seems to be an intermittent failure. I run this test on 8 nodes once and got bigblue3> cat z.sh.o335046 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. /cvos/local/apps/torque/current/spool/aux//335046.bluequeue1.cvos.cluster u02n077.cvos.cluster u02n072.cvos.cluster u02n074.cvos.cluster u02n091.cvos.cluster u03n061.cvos.cluster u01n003.cvos.cluster u01n057.cvos.cluster u01n080.cvos.cluster Warning: Permanently added 'u01n003,10.141.1.3' (RSA) to the list of known hosts. Warning: Permanently added 'u01n057,10.141.1.57' (RSA) to the list of known hosts. Warning: Permanently added 'u02n072,10.141.2.72' (RSA) to the list of known hosts. Warning: Permanently added 'u03n061,10.141.3.61' (RSA) to the list of known hosts. Warning: Permanently added 'u01n080,10.141.1.80' (RSA) to the list of known hosts. Warning: Permanently added 'u02n074,10.141.2.74' (RSA) to the list of known hosts. Warning: Permanently added 'u02n091,10.141.2.91' (RSA) to the list of known hosts. u01n003:5.ipath_userinit: userinit command failed: Cannot allocate memory u01n003:5.Driver initialization failure on /dev/ipath MPIRUN.u02n077: 7 ranks have not yet exited 60 seconds after rank 5 (node u01n003) exited wit out reaching MPI_Finalize(). MPIRUN.u02n077: Waiting at most another 60 seconds for the remaining ranks to do a clean shut own before terminating 7 node processes real 1m15.435s user 0m0.061s sys 0m0.151s Warning: Permanently added 'u02n077.cvos.cluster,10.141.2.77' (RSA) to the list of known host . Warning: Permanently added 'u02n072.cvos.cluster' (RSA) to the list of known hosts. Warning: Permanently added 'u02n074.cvos.cluster' (RSA) to the list of known hosts. Warning: Permanently added 'u02n091.cvos.cluster' (RSA) to the list of known hosts. Warning: Permanently added 'u03n061.cvos.cluster' (RSA) to the list of known hosts. Warning: Permanently added 'u01n003.cvos.cluster' (RSA) to the list of known hosts. Warning: Permanently added 'u01n057.cvos.cluster' (RSA) to the list of known hosts. Warning: Permanently added 'u01n080.cvos.cluster' (RSA) to the list of known hosts. bigblue3> I run it again a few minutes later and it worked ok: bigblue3> cat z.sh.o335165 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. /cvos/local/apps/torque/current/spool/aux//335165.bluequeue1.cvos.cluster u02n072.cvos.cluster u02n077.cvos.cluster u02n091.cvos.cluster u03n061.cvos.cluster u01n003.cvos.cluster u02n074.cvos.cluster u01n057.cvos.cluster u01n080.cvos.cluster Warning: Permanently added 'u02n077' (RSA) to the list of known hosts. Number of tasks= 8 My rank= 0 Number of tasks= 8 My rank= 7 Number of tasks= 8 My rank= 1 Number of tasks= 8 My rank= 3 Number of tasks= 8 My rank= 5 Number of tasks= 8 My rank= 6 Number of tasks= 8 My rank= 2 Number of tasks= 8 My rank= 4 real 0m1.590s user 0m0.070s sys 0m0.182s bigblue3> I'll ask my sysadmin about this. As I'm just starting MPI, I was worried I messed up something in my MPI program. This seems ok now. Many thanks for your help. anton > > Avneesh > > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Anton Shterenlikht > Sent: Thursday, July 08, 2010 9:07 AM > To: us...@open-mpi.org > Subject: [OMPI users] ipath_userinit: userinit command failed: Cannot > allocate memory > > I'm trying to use MPI with fortran on Linux 2.6.18-164.6.1.el5 x86_64 I > compiled this trivial code with mpif90: > > program simple > include 'mpif.h' > > integer numtasks, rank, ierr, rc > > rc=1 > > call MPI_INIT(ierr) > if (ierr .ne. 0) then > print *,'Error starting MPI program. Terminating.' > call MPI_ABORT(MPI_COMM_WORLD, rc, ierr) > end if > > call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) > call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr) > print *, 'Number of tasks=',numtasks,' My rank=',rank > > ! ****** do some work ****** > > call MPI_FINALIZE(ierr) > > end > > I run it with mpirun. > > When I use 2 cpus or less, all is fine. > > When I try to specify more than 2 cpus I get this error: > > u02n065:0.ipath_userinit: userinit command failed: Cannot allocate memory > u02n065:0.Driver initialization failure on /dev/ipath > > where u02n065 is the node name. > > Please advise > > many thanks > anton > > > -- > Anton Shterenlikht > Room 2.6, Queen's Building > Mech Eng Dept > Bristol University > University Walk, Bristol BS8 1TR, UK > Tel: +44 (0)117 331 5944 > Fax: +44 (0)117 929 4423 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Anton Shterenlikht Room 2.6, Queen's Building Mech Eng Dept Bristol University University Walk, Bristol BS8 1TR, UK Tel: +44 (0)117 331 5944 Fax: +44 (0)117 929 4423