On Thu, Jul 08, 2010 at 11:04:09AM -0700, Avneesh Pant wrote:
> Anton,
> On the node that you saw the failure (u02n065)
> can you verify what the max locked memory limit
>  is set to? In a bash  shell you can do this with
> ulimit -l. It should be set to at least 128K.
>  Also please verify that the available memory on
> the node (/proc/meminfo shows this) is sufficient
> as it may be possible that some zombie
> processes on that node are consuming memory.

Avneesh, many thanks

bigblue3> ssh u02n065
Last login: Fri Jul  9 12:24:17 2010 from bigblue3.cvos.cluster
u02n065> bash -
bash-3.2$ ulimit -l
unlimited
bash-3.2$ 


This seems to be an intermittent failure.
I run this test on 8 nodes once and got

bigblue3> cat z.sh.o335046 
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
/cvos/local/apps/torque/current/spool/aux//335046.bluequeue1.cvos.cluster
u02n077.cvos.cluster
u02n072.cvos.cluster
u02n074.cvos.cluster
u02n091.cvos.cluster
u03n061.cvos.cluster
u01n003.cvos.cluster
u01n057.cvos.cluster
u01n080.cvos.cluster
Warning: Permanently added 'u01n003,10.141.1.3' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u01n057,10.141.1.57' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u02n072,10.141.2.72' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u03n061,10.141.3.61' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u01n080,10.141.1.80' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u02n074,10.141.2.74' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u02n091,10.141.2.91' (RSA) to the list of known 
hosts.
u01n003:5.ipath_userinit: userinit command failed: Cannot allocate memory
u01n003:5.Driver initialization failure on /dev/ipath
MPIRUN.u02n077: 7 ranks have not yet exited 60 seconds after rank 5 (node 
u01n003) exited wit out reaching MPI_Finalize().
MPIRUN.u02n077: Waiting at most another 60 seconds for the remaining ranks to 
do a clean shut own before terminating 7 node processes

real    1m15.435s
user    0m0.061s
sys     0m0.151s
Warning: Permanently added 'u02n077.cvos.cluster,10.141.2.77' (RSA) to the list 
of known host .
Warning: Permanently added 'u02n072.cvos.cluster' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u02n074.cvos.cluster' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u02n091.cvos.cluster' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u03n061.cvos.cluster' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u01n003.cvos.cluster' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u01n057.cvos.cluster' (RSA) to the list of known 
hosts.
Warning: Permanently added 'u01n080.cvos.cluster' (RSA) to the list of known 
hosts.
bigblue3> 


I run it again a few minutes later and it worked ok:


bigblue3> cat z.sh.o335165 
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
/cvos/local/apps/torque/current/spool/aux//335165.bluequeue1.cvos.cluster
u02n072.cvos.cluster
u02n077.cvos.cluster
u02n091.cvos.cluster
u03n061.cvos.cluster
u01n003.cvos.cluster
u02n074.cvos.cluster
u01n057.cvos.cluster
u01n080.cvos.cluster
Warning: Permanently added 'u02n077' (RSA) to the list of known hosts.
 Number of tasks=           8  My rank=           0
 Number of tasks=           8  My rank=           7
 Number of tasks=           8  My rank=           1
 Number of tasks=           8  My rank=           3
 Number of tasks=           8  My rank=           5
 Number of tasks=           8  My rank=           6
 Number of tasks=           8  My rank=           2
 Number of tasks=           8  My rank=           4

real    0m1.590s
user    0m0.070s
sys     0m0.182s
bigblue3> 

I'll ask my sysadmin about this.

As I'm just starting MPI, I was worried
I messed up something in my MPI program.
This seems ok now.

Many thanks for your help.
anton




> 
> Avneesh
> 
> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Anton Shterenlikht
> Sent: Thursday, July 08, 2010 9:07 AM
> To: us...@open-mpi.org
> Subject: [OMPI users] ipath_userinit: userinit command failed: Cannot 
> allocate memory
> 
> I'm trying to use MPI with fortran on Linux 2.6.18-164.6.1.el5 x86_64 I 
> compiled this trivial code with mpif90:
> 
>      program simple
>      include 'mpif.h'
> 
>      integer numtasks, rank, ierr, rc
> 
>         rc=1
> 
>      call MPI_INIT(ierr)
>      if (ierr .ne. 0) then
>         print *,'Error starting MPI program. Terminating.'
>         call MPI_ABORT(MPI_COMM_WORLD, rc, ierr)
>      end if
> 
>      call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
>      call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
>      print *, 'Number of tasks=',numtasks,' My rank=',rank
> 
> !    ****** do some work ******
> 
>      call MPI_FINALIZE(ierr)
> 
>      end
> 
> I run it with mpirun.
> 
> When I use 2 cpus or less, all is fine.
> 
> When I try to specify more than 2 cpus I get this error:
> 
> u02n065:0.ipath_userinit: userinit command failed: Cannot allocate memory 
> u02n065:0.Driver initialization failure on /dev/ipath
> 
> where u02n065 is the node name.
> 
> Please advise
> 
> many thanks
> anton
> 
> 
> --
> Anton Shterenlikht
> Room 2.6, Queen's Building
> Mech Eng Dept
> Bristol University
> University Walk, Bristol BS8 1TR, UK
> Tel: +44 (0)117 331 5944
> Fax: +44 (0)117 929 4423
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Anton Shterenlikht
Room 2.6, Queen's Building
Mech Eng Dept
Bristol University
University Walk, Bristol BS8 1TR, UK
Tel: +44 (0)117 331 5944
Fax: +44 (0)117 929 4423

Reply via email to