Ok, that should have worked. I just double-checked it to me sure. 

ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ mpirun -np 32 ./bcast
App launch reported: 17 (out of 3) daemons - 0 (out of 32) procs
ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$


How did you configure Open MPI and what version are you using?

-Nathan

On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, Keita wrote:
> Hi Natan,
> 
> I tried qsub option you
> 
> mpirun -np 4  --mca plm_base_strip_prefix_from_node_names= 0 ./cpi
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 4 slots
> that were requested by the application:
>   ./cpi
> 
> Either request fewer slots for your application, or make more slots
> available
> for use.
> --------------------------------------------------------------------------
> 
> 
> Here is I got from aprun
> aprun  -n 32 ./cpi
> Process 8 of 32 is on nid00011
> Process 5 of 32 is on nid00011
> Process 12 of 32 is on nid00011
> Process 9 of 32 is on nid00011
> Process 11 of 32 is on nid00011
> Process 13 of 32 is on nid00011
> Process 0 of 32 is on nid00011
> Process 6 of 32 is on nid00011
> Process 3 of 32 is on nid00011
> :
> 
> :
> 
> Also, I found a strange error in the end of the program (MPI_Finalize?)
> Can you tell me what is wrong with that?
> [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2aaaacbbb7c0]
> [nid00010:23511] [ 1]
> /home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int_free+0x57)
> [0x2aaaaaf38ec7]
> [nid00010:23511] [ 2]
> /home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0xc3)
> [0x2aaaaaf3b6c3]
> [nid00010:23511] [ 3]
> /home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2)
> [0x2aaaaae717b2]
> [nid00010:23511] [ 4]
> /home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333)
> [0x2aaaaad7be23]
> [nid00010:23511] [ 5] ./cpi() [0x400e23]
> [nid00010:23511] [ 6] /lib64/libc.so.6(__libc_start_main+0xe6)
> [0x2aaaacde7c36]
> [nid00010:23511] [ 7] ./cpi() [0x400b09]
> 
> 
> 
> Thanks,
> 
> ---------------------------------------------------------------------------
> --
> Keita Teranishi
> 
> Principal Member of Technical Staff
> Scalable Modeling and Analysis Systems
> Sandia National Laboratories
> Livermore, CA 94551
> +1 (925) 294-3738
> 
> 
> 
> 
> 
> On 11/25/13 12:28 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote:
> 
> >Just talked with our local Cray rep. Sounds like that torque syntax is
> >broken. You can continue
> >to use qsub (though qsub use is strongly discouraged) if you use the msub
> >options.
> >
> >Ex:
> >
> >qsub -lnodes=2:ppn=16
> >
> >Works.
> >
> >-Nathan
> >
> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote:
> >> Hmm, this seems like either a bug in qsub (torque is full of serious
> >>bugs) or a bug
> >> in alps. I got an allocation using that command and alps only sees 1
> >>node:
> >> 
> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
> >>configuration file: "/etc/sysconfig/alps"
> >> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini
> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
> >>configuration file: "/etc/alps.conf"
> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>parser_separated_columns
> >> [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS scheduler
> >>file: "/ufs/alps_shared/appinfo"
> >> [ct-login1.localdomain:06010]
> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10
> >> [ct-login1.localdomain:06010] ras:alps:allocate: begin processing
> >>appinfo file
> >> [ct-login1.localdomain:06010] ras:alps:allocate: file
> >>/ufs/alps_shared/appinfo read
> >> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in file
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3492 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3492 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3541 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3541 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3560 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3560 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3561 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3561 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3566 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3566 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3573 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3573 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3588 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3588 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3598 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3598 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3599 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3599 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3622 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3622 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3635 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3635 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3640 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3640 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3641 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3641 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3642 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3642 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3647 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3647 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3651 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3651 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3653 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3653 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3659 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3659 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3662 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3662 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3665 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3665 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3668 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): processing
> >>NID 29 with 16 slots
> >> [ct-login1.localdomain:06010] ras:alps:allocate: success
> >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert
> >>inserting 1 nodes
> >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert node 29
> >> 
> >> ======================   ALLOCATED NODES   ======================
> >> 
> >>  Data for node: 29 Num slots: 16   Max slots: 0
> >> 
> >> =================================================================
> >> 
> >> 
> >> Torque also shows only one node with 16 PPN:
> >> 
> >> $ env | grep PBS
> >> ...
> >> PBS_NUM_PPN=16
> >> 
> >> 
> >> $ cat /var/spool/torque/aux//915289.sdb
> >> login1
> >> 
> >> Which is wrong! I will have to ask Cray what is going on here. I
> >>recommend you switch to
> >> msub to get an allocation. Moab has fewer bugs. I can't even get aprun
> >>to work:
> >> 
> >> $ aprun -n 2 -N 1 hostname
> >> apsched: claim exceeds reservation's node-count
> >> 
> >> $ aprun -n 32 hostname
> >> apsched: claim exceeds reservation's node-count
> >> 
> >> 
> >> To get an interactive session 2 nodes with 16 ppn on each run:
> >> 
> >> msub -I -lnodes=2:ppn=16
> >> 
> >> Open MPI should then work correctly.
> >> 
> >> -Nathan Hjelm
> >> HPC-5, LANL
> >> 
> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote:
> >> >    Hi,
> >> >    I installed OpenMPI on our small XE6 using the configure options
> >>under
> >> >    /contrib directory.  It appears it is working fine, but it ignores
> >>MCA
> >> >    parameters (set in env var).  So I switched to mpirun (in OpenMPI)
> >>and it
> >> >    can handle MCA parameters somehow.  However,  mpirun fails to
> >>allocate
> >> >    process by cores.  For example, I allocated 32 cores (on 2 nodes)
> >>by "qsub
> >> >    -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots.
> >>Is it
> >> >    possible to mpirun to handle mluticore nodes of XE6 properly or is
> >>there
> >> >    any options to handle MCA parameters for aprun?
> >> >    Regards,
> >> >    
> >>-------------------------------------------------------------------------
> >>----
> >> >    Keita Teranishi
> >> >    Principal Member of Technical Staff
> >> >    Scalable Modeling and Analysis Systems
> >> >    Sandia National Laboratories
> >> >    Livermore, CA 94551
> >> >    +1 (925) 294-3738
> >> 
> >> > _______________________________________________
> >> > users mailing list
> >> > us...@open-mpi.org
> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >
> >
> >
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Attachment: pgpIxV6fnkIsf.pgp
Description: PGP signature

Reply via email to