Seems like something is going wrong with processor binding. Can you run with -mca plm_base_verbose 100 . Might shed some light on why it thinks there are not enough slots.
-Nathan Hjelm Application Readiness, HPC-5, LANL On Tue, Nov 26, 2013 at 09:18:14PM +0000, Teranishi, Keita wrote: > Nathan, > > Now I remove strip_prefix stuff, which was applied to the other versions > of OpenMPI. > I still have the same problem with msubrun command. > > knteran@mzlogin01:~> msub -lnodes=2:ppn=16 -I > qsub: waiting for job 7754058.sdb to start > qsub: job 7754058.sdb ready > > knteran@mzlogin01:~> cd test-openmpi/ > knteran@mzlogin01:~/test-openmpi> !mp > mpicc cpi.c -o cpi > knteran@mzlogin01:~/test-openmpi> mpirun -np 4 ./cpi > -------------------------------------------------------------------------- > There are not enough slots available in the system to satisfy the 4 slots > that were requested by the application: > ./cpi > > Either request fewer slots for your application, or make more slots > available > for use. > -------------------------------------------------------------------------- > > I set PATH and LD_LIBRARY_PATH to match with my own OpenMPI installation. > knteran@mzlogin01:~/test-openmpi> which mpirun > /home/knteran/openmpi/bin/mpirun > > > > > Thanks, > > --------------------------------------------------------------------------- > -- > Keita Teranishi > Principal Member of Technical Staff > Scalable Modeling and Analysis Systems > Sandia National Laboratories > Livermore, CA 94551 > +1 (925) 294-3738 > > > > > > On 11/26/13 12:52 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: > > >Weird. That is the same configuration we have deployed on Cielito and > >Cielo. Does > >it work under an msub allocation? > > > >BTW, with that configuration you should not set > >plm_base_strip_prefix_from_node_names > >to 0. That will confuse orte since the node hostname will not match what > >was > >supplied by alps. > > > >-Nathan > > > >On Tue, Nov 26, 2013 at 08:38:51PM +0000, Teranishi, Keita wrote: > >> Nathan, > >> > >> (Please forget about the segfault. It was my mistake). > >> I use OpenMPI-1.7.2 (build with gcc-4.7.2) to run the program. I used > >> contrib/platform/lanl/cray_xe6/optimized_lustre and > >> --enable-mpirun-prefix-by-default for configuration. As I said, it > >>works > >> fine with aprun, but fails with mpirun/mpiexec. > >> > >> > >> knteran@mzlogin01:~/test-openmpi> ~/openmpi/bin/mpirun -np 4 ./a.out > >> > >>------------------------------------------------------------------------- > >>- > >> There are not enough slots available in the system to satisfy the 4 > >>slots > >> that were requested by the application: > >> ./a.out > >> > >> Either request fewer slots for your application, or make more slots > >> available > >> for use. > >> > >> > >>------------------------------------------------------------------------- > >>-- > >> - > >> > >> Thanks, > >> > >> > >>------------------------------------------------------------------------- > >>-- > >> -- > >> Keita Teranishi > >> Principal Member of Technical Staff > >> Scalable Modeling and Analysis Systems > >> Sandia National Laboratories > >> Livermore, CA 94551 > >> +1 (925) 294-3738 > >> > >> > >> > >> > >> > >> On 11/25/13 12:55 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: > >> > >> >Ok, that should have worked. I just double-checked it to me sure. > >> > > >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ mpirun -np 32 > >>./bcast > >> >App launch reported: 17 (out of 3) daemons - 0 (out of 32) procs > >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ > >> > > >> > > >> >How did you configure Open MPI and what version are you using? > >> > > >> >-Nathan > >> > > >> >On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, Keita wrote: > >> >> Hi Natan, > >> >> > >> >> I tried qsub option you > >> >> > >> >> mpirun -np 4 --mca plm_base_strip_prefix_from_node_names= 0 ./cpi > >> >> > >> > >>>>----------------------------------------------------------------------- > >>>>-- > >> >>- > >> >> There are not enough slots available in the system to satisfy the 4 > >> >>slots > >> >> that were requested by the application: > >> >> ./cpi > >> >> > >> >> Either request fewer slots for your application, or make more slots > >> >> available > >> >> for use. > >> >> > >> > >>>>----------------------------------------------------------------------- > >>>>-- > >> >>- > >> >> > >> >> > >> >> Here is I got from aprun > >> >> aprun -n 32 ./cpi > >> >> Process 8 of 32 is on nid00011 > >> >> Process 5 of 32 is on nid00011 > >> >> Process 12 of 32 is on nid00011 > >> >> Process 9 of 32 is on nid00011 > >> >> Process 11 of 32 is on nid00011 > >> >> Process 13 of 32 is on nid00011 > >> >> Process 0 of 32 is on nid00011 > >> >> Process 6 of 32 is on nid00011 > >> >> Process 3 of 32 is on nid00011 > >> >> : > >> >> > >> >> : > >> >> > >> >> Also, I found a strange error in the end of the program > >>(MPI_Finalize?) > >> >> Can you tell me what is wrong with that? > >> >> [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0) > >>[0x2aaaacbbb7c0] > >> >> [nid00010:23511] [ 1] > >> >> > >> > >>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int_free+0x > >>>>57 > >> >>) > >> >> [0x2aaaaaf38ec7] > >> >> [nid00010:23511] [ 2] > >> >> > >>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0xc3) > >> >> [0x2aaaaaf3b6c3] > >> >> [nid00010:23511] [ 3] > >> >> /home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2) > >> >> [0x2aaaaae717b2] > >> >> [nid00010:23511] [ 4] > >> >> /home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333) > >> >> [0x2aaaaad7be23] > >> >> [nid00010:23511] [ 5] ./cpi() [0x400e23] > >> >> [nid00010:23511] [ 6] /lib64/libc.so.6(__libc_start_main+0xe6) > >> >> [0x2aaaacde7c36] > >> >> [nid00010:23511] [ 7] ./cpi() [0x400b09] > >> >> > >> >> > >> >> > >> >> Thanks, > >> >> > >> >> > >> > >>>>----------------------------------------------------------------------- > >>>>-- > >> >>-- > >> >> -- > >> >> Keita Teranishi > >> >> > >> >> Principal Member of Technical Staff > >> >> Scalable Modeling and Analysis Systems > >> >> Sandia National Laboratories > >> >> Livermore, CA 94551 > >> >> +1 (925) 294-3738 > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On 11/25/13 12:28 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: > >> >> > >> >> >Just talked with our local Cray rep. Sounds like that torque syntax > >>is > >> >> >broken. You can continue > >> >> >to use qsub (though qsub use is strongly discouraged) if you use the > >> >>msub > >> >> >options. > >> >> > > >> >> >Ex: > >> >> > > >> >> >qsub -lnodes=2:ppn=16 > >> >> > > >> >> >Works. > >> >> > > >> >> >-Nathan > >> >> > > >> >> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote: > >> >> >> Hmm, this seems like either a bug in qsub (torque is full of > >>serious > >> >> >>bugs) or a bug > >> >> >> in alps. I got an allocation using that command and alps only > >>sees 1 > >> >> >>node: > >> >> >> > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS > >> >> >>configuration file: "/etc/sysconfig/alps" > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS > >> >> >>configuration file: "/etc/alps.conf" > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > >> >> >>parser_separated_columns > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS > >> >>scheduler > >> >> >>file: "/ufs/alps_shared/appinfo" > >> >> >> [ct-login1.localdomain:06010] > >> >> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: begin processing > >> >> >>appinfo file > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: file > >> >> >>/ufs/alps_shared/appinfo read > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in > >>file > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3492 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3492 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3541 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3541 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3560 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3560 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3561 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3561 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3566 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3566 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3573 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3573 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3588 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3588 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3598 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3598 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3599 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3599 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3622 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3622 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3635 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3635 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3640 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3640 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3641 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3641 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3642 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3642 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3647 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3647 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3651 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3651 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3653 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3653 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3659 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3659 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3662 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3662 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3665 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3665 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for > >>resId > >> >> >>3668 - myId 3668 > >> >> >> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): > >> >>processing > >> >> >>NID 29 with 16 slots > >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: success > >> >> >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert > >> >> >>inserting 1 nodes > >> >> >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert > >> >>node 29 > >> >> >> > >> >> >> ====================== ALLOCATED NODES ====================== > >> >> >> > >> >> >> Data for node: 29 Num slots: 16 Max slots: 0 > >> >> >> > >> >> >> ================================================================= > >> >> >> > >> >> >> > >> >> >> Torque also shows only one node with 16 PPN: > >> >> >> > >> >> >> $ env | grep PBS > >> >> >> ... > >> >> >> PBS_NUM_PPN=16 > >> >> >> > >> >> >> > >> >> >> $ cat /var/spool/torque/aux//915289.sdb > >> >> >> login1 > >> >> >> > >> >> >> Which is wrong! I will have to ask Cray what is going on here. I > >> >> >>recommend you switch to > >> >> >> msub to get an allocation. Moab has fewer bugs. I can't even get > >> >>aprun > >> >> >>to work: > >> >> >> > >> >> >> $ aprun -n 2 -N 1 hostname > >> >> >> apsched: claim exceeds reservation's node-count > >> >> >> > >> >> >> $ aprun -n 32 hostname > >> >> >> apsched: claim exceeds reservation's node-count > >> >> >> > >> >> >> > >> >> >> To get an interactive session 2 nodes with 16 ppn on each run: > >> >> >> > >> >> >> msub -I -lnodes=2:ppn=16 > >> >> >> > >> >> >> Open MPI should then work correctly. > >> >> >> > >> >> >> -Nathan Hjelm > >> >> >> HPC-5, LANL > >> >> >> > >> >> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote: > >> >> >> > Hi, > >> >> >> > I installed OpenMPI on our small XE6 using the configure > >>options > >> >> >>under > >> >> >> > /contrib directory. It appears it is working fine, but it > >> >>ignores > >> >> >>MCA > >> >> >> > parameters (set in env var). So I switched to mpirun (in > >> >>OpenMPI) > >> >> >>and it > >> >> >> > can handle MCA parameters somehow. However, mpirun fails to > >> >> >>allocate > >> >> >> > process by cores. For example, I allocated 32 cores (on 2 > >> >>nodes) > >> >> >>by "qsub > >> >> >> > -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots. > >> >> >>Is it > >> >> >> > possible to mpirun to handle mluticore nodes of XE6 properly > >>or > >> >>is > >> >> >>there > >> >> >> > any options to handle MCA parameters for aprun? > >> >> >> > Regards, > >> >> >> > > >> >> > >> > >>>>>>--------------------------------------------------------------------- > >>>>>>-- > >> >>>>-- > >> >> >>---- > >> >> >> > Keita Teranishi > >> >> >> > Principal Member of Technical Staff > >> >> >> > Scalable Modeling and Analysis Systems > >> >> >> > Sandia National Laboratories > >> >> >> > Livermore, CA 94551 > >> >> >> > +1 (925) 294-3738 > >> >> >> > >> >> >> > _______________________________________________ > >> >> >> > users mailing list > >> >> >> > us...@open-mpi.org > >> >> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> >> _______________________________________________ > >> >> >> users mailing list > >> >> >> us...@open-mpi.org > >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> >> > > >> >> > >> >> _______________________________________________ > >> >> users mailing list > >> >> us...@open-mpi.org > >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
pgp82LgIPMrL3.pgp
Description: PGP signature