Hi Natan, I tried qsub option you
mpirun -np 4 --mca plm_base_strip_prefix_from_node_names= 0 ./cpi -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 4 slots that were requested by the application: ./cpi Either request fewer slots for your application, or make more slots available for use. -------------------------------------------------------------------------- Here is I got from aprun aprun -n 32 ./cpi Process 8 of 32 is on nid00011 Process 5 of 32 is on nid00011 Process 12 of 32 is on nid00011 Process 9 of 32 is on nid00011 Process 11 of 32 is on nid00011 Process 13 of 32 is on nid00011 Process 0 of 32 is on nid00011 Process 6 of 32 is on nid00011 Process 3 of 32 is on nid00011 : : Also, I found a strange error in the end of the program (MPI_Finalize?) Can you tell me what is wrong with that? [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2aaaacbbb7c0] [nid00010:23511] [ 1] /home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int_free+0x57) [0x2aaaaaf38ec7] [nid00010:23511] [ 2] /home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0xc3) [0x2aaaaaf3b6c3] [nid00010:23511] [ 3] /home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2) [0x2aaaaae717b2] [nid00010:23511] [ 4] /home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333) [0x2aaaaad7be23] [nid00010:23511] [ 5] ./cpi() [0x400e23] [nid00010:23511] [ 6] /lib64/libc.so.6(__libc_start_main+0xe6) [0x2aaaacde7c36] [nid00010:23511] [ 7] ./cpi() [0x400b09] Thanks, --------------------------------------------------------------------------- -- Keita Teranishi Principal Member of Technical Staff Scalable Modeling and Analysis Systems Sandia National Laboratories Livermore, CA 94551 +1 (925) 294-3738 On 11/25/13 12:28 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: >Just talked with our local Cray rep. Sounds like that torque syntax is >broken. You can continue >to use qsub (though qsub use is strongly discouraged) if you use the msub >options. > >Ex: > >qsub -lnodes=2:ppn=16 > >Works. > >-Nathan > >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote: >> Hmm, this seems like either a bug in qsub (torque is full of serious >>bugs) or a bug >> in alps. I got an allocation using that command and alps only sees 1 >>node: >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS >>configuration file: "/etc/sysconfig/alps" >> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS >>configuration file: "/etc/alps.conf" >> [ct-login1.localdomain:06010] ras:alps:allocate: >>parser_separated_columns >> [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS scheduler >>file: "/ufs/alps_shared/appinfo" >> [ct-login1.localdomain:06010] >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10 >> [ct-login1.localdomain:06010] ras:alps:allocate: begin processing >>appinfo file >> [ct-login1.localdomain:06010] ras:alps:allocate: file >>/ufs/alps_shared/appinfo read >> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in file >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3492 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3492 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3541 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3541 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3560 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3560 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3561 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3561 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3566 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3566 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3573 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3573 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3588 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3588 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3598 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3598 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3599 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3599 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3622 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3622 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3635 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3635 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3640 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3640 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3641 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3641 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3642 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3642 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3647 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3647 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3651 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3651 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3653 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3653 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3659 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3659 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3662 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3662 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3665 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3665 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId >>3668 - myId 3668 >> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): processing >>NID 29 with 16 slots >> [ct-login1.localdomain:06010] ras:alps:allocate: success >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert >>inserting 1 nodes >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert node 29 >> >> ====================== ALLOCATED NODES ====================== >> >> Data for node: 29 Num slots: 16 Max slots: 0 >> >> ================================================================= >> >> >> Torque also shows only one node with 16 PPN: >> >> $ env | grep PBS >> ... >> PBS_NUM_PPN=16 >> >> >> $ cat /var/spool/torque/aux//915289.sdb >> login1 >> >> Which is wrong! I will have to ask Cray what is going on here. I >>recommend you switch to >> msub to get an allocation. Moab has fewer bugs. I can't even get aprun >>to work: >> >> $ aprun -n 2 -N 1 hostname >> apsched: claim exceeds reservation's node-count >> >> $ aprun -n 32 hostname >> apsched: claim exceeds reservation's node-count >> >> >> To get an interactive session 2 nodes with 16 ppn on each run: >> >> msub -I -lnodes=2:ppn=16 >> >> Open MPI should then work correctly. >> >> -Nathan Hjelm >> HPC-5, LANL >> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote: >> > Hi, >> > I installed OpenMPI on our small XE6 using the configure options >>under >> > /contrib directory. It appears it is working fine, but it ignores >>MCA >> > parameters (set in env var). So I switched to mpirun (in OpenMPI) >>and it >> > can handle MCA parameters somehow. However, mpirun fails to >>allocate >> > process by cores. For example, I allocated 32 cores (on 2 nodes) >>by "qsub >> > -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots. >>Is it >> > possible to mpirun to handle mluticore nodes of XE6 properly or is >>there >> > any options to handle MCA parameters for aprun? >> > Regards, >> > >>------------------------------------------------------------------------- >>---- >> > Keita Teranishi >> > Principal Member of Technical Staff >> > Scalable Modeling and Analysis Systems >> > Sandia National Laboratories >> > Livermore, CA 94551 >> > +1 (925) 294-3738 >> >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >