Ok, that should have worked. I just double-checked it to me sure. ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ mpirun -np 32 ./bcast App launch reported: 17 (out of 3) daemons - 0 (out of 32) procs ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$
How did you configure Open MPI and what version are you using? -Nathan On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, Keita wrote: > Hi Natan, > > I tried qsub option you > > mpirun -np 4 --mca plm_base_strip_prefix_from_node_names= 0 ./cpi > -------------------------------------------------------------------------- > There are not enough slots available in the system to satisfy the 4 slots > that were requested by the application: > ./cpi > > Either request fewer slots for your application, or make more slots > available > for use. > -------------------------------------------------------------------------- > > > Here is I got from aprun > aprun -n 32 ./cpi > Process 8 of 32 is on nid00011 > Process 5 of 32 is on nid00011 > Process 12 of 32 is on nid00011 > Process 9 of 32 is on nid00011 > Process 11 of 32 is on nid00011 > Process 13 of 32 is on nid00011 > Process 0 of 32 is on nid00011 > Process 6 of 32 is on nid00011 > Process 3 of 32 is on nid00011 > : > > : > > Also, I found a strange error in the end of the program (MPI_Finalize?) > Can you tell me what is wrong with that? > [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2aaaacbbb7c0] > [nid00010:23511] [ 1] > /home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int_free+0x57) > [0x2aaaaaf38ec7] > [nid00010:23511] [ 2] > /home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0xc3) > [0x2aaaaaf3b6c3] > [nid00010:23511] [ 3] > /home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2) > [0x2aaaaae717b2] > [nid00010:23511] [ 4] > /home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333) > [0x2aaaaad7be23] > [nid00010:23511] [ 5] ./cpi() [0x400e23] > [nid00010:23511] [ 6] /lib64/libc.so.6(__libc_start_main+0xe6) > [0x2aaaacde7c36] > [nid00010:23511] [ 7] ./cpi() [0x400b09] > > > > Thanks, > > --------------------------------------------------------------------------- > -- > Keita Teranishi > > Principal Member of Technical Staff > Scalable Modeling and Analysis Systems > Sandia National Laboratories > Livermore, CA 94551 > +1 (925) 294-3738 > > > > > > On 11/25/13 12:28 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: > > >Just talked with our local Cray rep. Sounds like that torque syntax is > >broken. You can continue > >to use qsub (though qsub use is strongly discouraged) if you use the msub > >options. > > > >Ex: > > > >qsub -lnodes=2:ppn=16 > > > >Works. > > > >-Nathan > > > >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote: > >> Hmm, this seems like either a bug in qsub (torque is full of serious > >>bugs) or a bug > >> in alps. I got an allocation using that command and alps only sees 1 > >>node: > >> > >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS > >>configuration file: "/etc/sysconfig/alps" > >> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini > >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS > >>configuration file: "/etc/alps.conf" > >> [ct-login1.localdomain:06010] ras:alps:allocate: > >>parser_separated_columns > >> [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS scheduler > >>file: "/ufs/alps_shared/appinfo" > >> [ct-login1.localdomain:06010] > >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10 > >> [ct-login1.localdomain:06010] ras:alps:allocate: begin processing > >>appinfo file > >> [ct-login1.localdomain:06010] ras:alps:allocate: file > >>/ufs/alps_shared/appinfo read > >> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in file > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3492 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3492 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3541 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3541 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3560 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3560 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3561 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3561 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3566 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3566 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3573 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3573 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3588 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3588 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3598 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3598 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3599 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3599 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3622 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3622 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3635 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3635 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3640 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3640 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3641 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3641 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3642 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3642 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3647 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3647 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3651 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3651 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3653 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3653 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3659 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3659 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3662 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3662 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3665 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3665 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId > >>3668 - myId 3668 > >> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): processing > >>NID 29 with 16 slots > >> [ct-login1.localdomain:06010] ras:alps:allocate: success > >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert > >>inserting 1 nodes > >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert node 29 > >> > >> ====================== ALLOCATED NODES ====================== > >> > >> Data for node: 29 Num slots: 16 Max slots: 0 > >> > >> ================================================================= > >> > >> > >> Torque also shows only one node with 16 PPN: > >> > >> $ env | grep PBS > >> ... > >> PBS_NUM_PPN=16 > >> > >> > >> $ cat /var/spool/torque/aux//915289.sdb > >> login1 > >> > >> Which is wrong! I will have to ask Cray what is going on here. I > >>recommend you switch to > >> msub to get an allocation. Moab has fewer bugs. I can't even get aprun > >>to work: > >> > >> $ aprun -n 2 -N 1 hostname > >> apsched: claim exceeds reservation's node-count > >> > >> $ aprun -n 32 hostname > >> apsched: claim exceeds reservation's node-count > >> > >> > >> To get an interactive session 2 nodes with 16 ppn on each run: > >> > >> msub -I -lnodes=2:ppn=16 > >> > >> Open MPI should then work correctly. > >> > >> -Nathan Hjelm > >> HPC-5, LANL > >> > >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote: > >> > Hi, > >> > I installed OpenMPI on our small XE6 using the configure options > >>under > >> > /contrib directory. It appears it is working fine, but it ignores > >>MCA > >> > parameters (set in env var). So I switched to mpirun (in OpenMPI) > >>and it > >> > can handle MCA parameters somehow. However, mpirun fails to > >>allocate > >> > process by cores. For example, I allocated 32 cores (on 2 nodes) > >>by "qsub > >> > -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots. > >>Is it > >> > possible to mpirun to handle mluticore nodes of XE6 properly or is > >>there > >> > any options to handle MCA parameters for aprun? > >> > Regards, > >> > > >>------------------------------------------------------------------------- > >>---- > >> > Keita Teranishi > >> > Principal Member of Technical Staff > >> > Scalable Modeling and Analysis Systems > >> > Sandia National Laboratories > >> > Livermore, CA 94551 > >> > +1 (925) 294-3738 > >> > >> > _______________________________________________ > >> > users mailing list > >> > us...@open-mpi.org > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > > > > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
pgpIxV6fnkIsf.pgp
Description: PGP signature