Well, no hints as to the error there. Looks identical to the output on my XE-6. 
How
about setting -mca rmaps_base_verbose 100 . See what is going on with the 
mapper.

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Nov 26, 2013 at 09:33:20PM +0000, Teranishi, Keita wrote:
> Nathan,
> 
> Please see the attached obtained from two cases (-np 2 and -np 4).
> 
> Thanks,
> ---------------------------------------------------------------------------
> --
> Keita Teranishi
> Principal Member of Technical Staff
> Scalable Modeling and Analysis Systems
> Sandia National Laboratories
> Livermore, CA 94551
> +1 (925) 294-3738
> 
> 
> 
> 
> 
> On 11/26/13 1:26 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote:
> 
> >Seems like something is going wrong with processor binding. Can you run
> >with
> >-mca plm_base_verbose 100 . Might shed some light on why it thinks there
> >are
> >not enough slots.
> >
> >-Nathan Hjelm
> >Application Readiness, HPC-5, LANL
> >
> >On Tue, Nov 26, 2013 at 09:18:14PM +0000, Teranishi, Keita wrote:
> >> Nathan,
> >> 
> >> Now I remove strip_prefix stuff, which was applied to the other versions
> >> of OpenMPI.  
> >> I still have the same problem with msubrun command.
> >> 
> >> knteran@mzlogin01:~> msub -lnodes=2:ppn=16 -I
> >> qsub: waiting for job 7754058.sdb to start
> >> qsub: job 7754058.sdb ready
> >> 
> >> knteran@mzlogin01:~> cd test-openmpi/
> >> knteran@mzlogin01:~/test-openmpi> !mp
> >> mpicc cpi.c -o cpi
> >> knteran@mzlogin01:~/test-openmpi> mpirun -np 4 ./cpi
> >> 
> >>-------------------------------------------------------------------------
> >>-
> >> There are not enough slots available in the system to satisfy the 4
> >>slots
> >> that were requested by the application:
> >>   ./cpi
> >> 
> >> Either request fewer slots for your application, or make more slots
> >> available
> >> for use.
> >> 
> >>-------------------------------------------------------------------------
> >>-
> >> 
> >> I set PATH and LD_LIBRARY_PATH to match with my own OpenMPI
> >>installation.
> >> knteran@mzlogin01:~/test-openmpi> which mpirun
> >> /home/knteran/openmpi/bin/mpirun
> >> 
> >> 
> >> 
> >> 
> >> Thanks,
> >> 
> >> 
> >>-------------------------------------------------------------------------
> >>--
> >> --
> >> Keita Teranishi
> >> Principal Member of Technical Staff
> >> Scalable Modeling and Analysis Systems
> >> Sandia National Laboratories
> >> Livermore, CA 94551
> >> +1 (925) 294-3738
> >> 
> >> 
> >> 
> >> 
> >> 
> >> On 11/26/13 12:52 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote:
> >> 
> >> >Weird. That is the same configuration we have deployed on Cielito and
> >> >Cielo. Does
> >> >it work under an msub allocation?
> >> >
> >> >BTW, with that configuration you should not set
> >> >plm_base_strip_prefix_from_node_names
> >> >to 0. That will confuse orte since the node hostname will not match
> >>what
> >> >was
> >> >supplied by alps.
> >> >
> >> >-Nathan
> >> >
> >> >On Tue, Nov 26, 2013 at 08:38:51PM +0000, Teranishi, Keita wrote:
> >> >> Nathan,
> >> >> 
> >> >> (Please forget about the segfault. It was my mistake).
> >> >> I use OpenMPI-1.7.2 (build with gcc-4.7.2) to run the program.  I
> >>used
> >> >> contrib/platform/lanl/cray_xe6/optimized_lustre and
> >> >> --enable-mpirun-prefix-by-default for configuration.  As I said, it
> >> >>works
> >> >> fine with aprun, but fails with mpirun/mpiexec.
> >> >> 
> >> >> 
> >> >> knteran@mzlogin01:~/test-openmpi> ~/openmpi/bin/mpirun -np 4 ./a.out
> >> >> 
> >> 
> >>>>-----------------------------------------------------------------------
> >>>>--
> >> >>-
> >> >> There are not enough slots available in the system to satisfy the 4
> >> >>slots
> >> >> that were requested by the application:
> >> >>   ./a.out
> >> >> 
> >> >> Either request fewer slots for your application, or make more slots
> >> >> available
> >> >> for use.
> >> >> 
> >> >> 
> >> 
> >>>>-----------------------------------------------------------------------
> >>>>--
> >> >>--
> >> >> -
> >> >> 
> >> >> Thanks,
> >> >> 
> >> >> 
> >> 
> >>>>-----------------------------------------------------------------------
> >>>>--
> >> >>--
> >> >> --
> >> >> Keita Teranishi
> >> >> Principal Member of Technical Staff
> >> >> Scalable Modeling and Analysis Systems
> >> >> Sandia National Laboratories
> >> >> Livermore, CA 94551
> >> >> +1 (925) 294-3738
> >> >> 
> >> >> 
> >> >> 
> >> >> 
> >> >> 
> >> >> On 11/25/13 12:55 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote:
> >> >> 
> >> >> >Ok, that should have worked. I just double-checked it to me sure.
> >> >> >
> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ mpirun -np 32
> >> >>./bcast
> >> >> >App launch reported: 17 (out of 3) daemons - 0 (out of 32) procs
> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$
> >> >> >
> >> >> >
> >> >> >How did you configure Open MPI and what version are you using?
> >> >> >
> >> >> >-Nathan
> >> >> >
> >> >> >On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, Keita wrote:
> >> >> >> Hi Natan,
> >> >> >> 
> >> >> >> I tried qsub option you
> >> >> >> 
> >> >> >> mpirun -np 4  --mca plm_base_strip_prefix_from_node_names= 0 ./cpi
> >> >> >> 
> >> >> 
> >> 
> >>>>>>---------------------------------------------------------------------
> >>>>>>--
> >> >>>>--
> >> >> >>-
> >> >> >> There are not enough slots available in the system to satisfy the
> >>4
> >> >> >>slots
> >> >> >> that were requested by the application:
> >> >> >>   ./cpi
> >> >> >> 
> >> >> >> Either request fewer slots for your application, or make more
> >>slots
> >> >> >> available
> >> >> >> for use.
> >> >> >> 
> >> >> 
> >> 
> >>>>>>---------------------------------------------------------------------
> >>>>>>--
> >> >>>>--
> >> >> >>-
> >> >> >> 
> >> >> >> 
> >> >> >> Here is I got from aprun
> >> >> >> aprun  -n 32 ./cpi
> >> >> >> Process 8 of 32 is on nid00011
> >> >> >> Process 5 of 32 is on nid00011
> >> >> >> Process 12 of 32 is on nid00011
> >> >> >> Process 9 of 32 is on nid00011
> >> >> >> Process 11 of 32 is on nid00011
> >> >> >> Process 13 of 32 is on nid00011
> >> >> >> Process 0 of 32 is on nid00011
> >> >> >> Process 6 of 32 is on nid00011
> >> >> >> Process 3 of 32 is on nid00011
> >> >> >> :
> >> >> >> 
> >> >> >> :
> >> >> >> 
> >> >> >> Also, I found a strange error in the end of the program
> >> >>(MPI_Finalize?)
> >> >> >> Can you tell me what is wrong with that?
> >> >> >> [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0)
> >> >>[0x2aaaacbbb7c0]
> >> >> >> [nid00010:23511] [ 1]
> >> >> >> 
> >> >> 
> >> 
> >>>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int_free+
> >>>>>>0x
> >> >>>>57
> >> >> >>)
> >> >> >> [0x2aaaaaf38ec7]
> >> >> >> [nid00010:23511] [ 2]
> >> >> >> 
> >> >>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0xc3)
> >> >> >> [0x2aaaaaf3b6c3]
> >> >> >> [nid00010:23511] [ 3]
> >> >> >> /home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2)
> >> >> >> [0x2aaaaae717b2]
> >> >> >> [nid00010:23511] [ 4]
> >> >> >> /home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333)
> >> >> >> [0x2aaaaad7be23]
> >> >> >> [nid00010:23511] [ 5] ./cpi() [0x400e23]
> >> >> >> [nid00010:23511] [ 6] /lib64/libc.so.6(__libc_start_main+0xe6)
> >> >> >> [0x2aaaacde7c36]
> >> >> >> [nid00010:23511] [ 7] ./cpi() [0x400b09]
> >> >> >> 
> >> >> >> 
> >> >> >> 
> >> >> >> Thanks,
> >> >> >> 
> >> >> >> 
> >> >> 
> >> 
> >>>>>>---------------------------------------------------------------------
> >>>>>>--
> >> >>>>--
> >> >> >>--
> >> >> >> --
> >> >> >> Keita Teranishi
> >> >> >> 
> >> >> >> Principal Member of Technical Staff
> >> >> >> Scalable Modeling and Analysis Systems
> >> >> >> Sandia National Laboratories
> >> >> >> Livermore, CA 94551
> >> >> >> +1 (925) 294-3738
> >> >> >> 
> >> >> >> 
> >> >> >> 
> >> >> >> 
> >> >> >> 
> >> >> >> On 11/25/13 12:28 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote:
> >> >> >> 
> >> >> >> >Just talked with our local Cray rep. Sounds like that torque
> >>syntax
> >> >>is
> >> >> >> >broken. You can continue
> >> >> >> >to use qsub (though qsub use is strongly discouraged) if you use
> >>the
> >> >> >>msub
> >> >> >> >options.
> >> >> >> >
> >> >> >> >Ex:
> >> >> >> >
> >> >> >> >qsub -lnodes=2:ppn=16
> >> >> >> >
> >> >> >> >Works.
> >> >> >> >
> >> >> >> >-Nathan
> >> >> >> >
> >> >> >> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote:
> >> >> >> >> Hmm, this seems like either a bug in qsub (torque is full of
> >> >>serious
> >> >> >> >>bugs) or a bug
> >> >> >> >> in alps. I got an allocation using that command and alps only
> >> >>sees 1
> >> >> >> >>node:
> >> >> >> >> 
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
> >> >> >> >>configuration file: "/etc/sysconfig/alps"
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
> >> >> >> >>configuration file: "/etc/alps.conf"
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >> >> >> >>parser_separated_columns
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS
> >> >> >>scheduler
> >> >> >> >>file: "/ufs/alps_shared/appinfo"
> >> >> >> >> [ct-login1.localdomain:06010]
> >> >> >> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: begin
> >>processing
> >> >> >> >>appinfo file
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: file
> >> >> >> >>/ufs/alps_shared/appinfo read
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in
> >> >>file
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3492 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3492 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3541 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3541 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3560 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3560 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3561 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3561 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3566 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3566 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3573 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3573 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3588 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3588 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3598 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3598 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3599 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3599 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3622 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3622 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3635 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3635 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3640 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3640 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3641 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3641 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3642 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3642 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3647 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3647 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3651 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3651 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3653 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3653 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3659 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3659 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3662 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3662 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3665 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3665 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
> >> >>resId
> >> >> >> >>3668 - myId 3668
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern):
> >> >> >>processing
> >> >> >> >>NID 29 with 16 slots
> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: success
> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0]
> >>ras:base:node_insert
> >> >> >> >>inserting 1 nodes
> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0]
> >>ras:base:node_insert
> >> >> >>node 29
> >> >> >> >> 
> >> >> >> >> ======================   ALLOCATED NODES
> >>======================
> >> >> >> >> 
> >> >> >> >>  Data for node: 29        Num slots: 16   Max slots: 0
> >> >> >> >> 
> >> >> >> >> 
> >>=================================================================
> >> >> >> >> 
> >> >> >> >> 
> >> >> >> >> Torque also shows only one node with 16 PPN:
> >> >> >> >> 
> >> >> >> >> $ env | grep PBS
> >> >> >> >> ...
> >> >> >> >> PBS_NUM_PPN=16
> >> >> >> >> 
> >> >> >> >> 
> >> >> >> >> $ cat /var/spool/torque/aux//915289.sdb
> >> >> >> >> login1
> >> >> >> >> 
> >> >> >> >> Which is wrong! I will have to ask Cray what is going on here.
> >>I
> >> >> >> >>recommend you switch to
> >> >> >> >> msub to get an allocation. Moab has fewer bugs. I can't even
> >>get
> >> >> >>aprun
> >> >> >> >>to work:
> >> >> >> >> 
> >> >> >> >> $ aprun -n 2 -N 1 hostname
> >> >> >> >> apsched: claim exceeds reservation's node-count
> >> >> >> >> 
> >> >> >> >> $ aprun -n 32 hostname
> >> >> >> >> apsched: claim exceeds reservation's node-count
> >> >> >> >> 
> >> >> >> >> 
> >> >> >> >> To get an interactive session 2 nodes with 16 ppn on each run:
> >> >> >> >> 
> >> >> >> >> msub -I -lnodes=2:ppn=16
> >> >> >> >> 
> >> >> >> >> Open MPI should then work correctly.
> >> >> >> >> 
> >> >> >> >> -Nathan Hjelm
> >> >> >> >> HPC-5, LANL
> >> >> >> >> 
> >> >> >> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita
> >>wrote:
> >> >> >> >> >    Hi,
> >> >> >> >> >    I installed OpenMPI on our small XE6 using the configure
> >> >>options
> >> >> >> >>under
> >> >> >> >> >    /contrib directory.  It appears it is working fine, but it
> >> >> >>ignores
> >> >> >> >>MCA
> >> >> >> >> >    parameters (set in env var).  So I switched to mpirun (in
> >> >> >>OpenMPI)
> >> >> >> >>and it
> >> >> >> >> >    can handle MCA parameters somehow.  However,  mpirun
> >>fails to
> >> >> >> >>allocate
> >> >> >> >> >    process by cores.  For example, I allocated 32 cores (on 2
> >> >> >>nodes)
> >> >> >> >>by "qsub
> >> >> >> >> >    -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2
> >>slots.
> >> >> >> >>Is it
> >> >> >> >> >    possible to mpirun to handle mluticore nodes of XE6
> >>properly
> >> >>or
> >> >> >>is
> >> >> >> >>there
> >> >> >> >> >    any options to handle MCA parameters for aprun?
> >> >> >> >> >    Regards,
> >> >> >> >> >    
> >> >> >> 
> >> >> 
> >> 
> >>>>>>>>-------------------------------------------------------------------
> >>>>>>>>--
> >> >>>>>>--
> >> >> >>>>--
> >> >> >> >>----
> >> >> >> >> >    Keita Teranishi
> >> >> >> >> >    Principal Member of Technical Staff
> >> >> >> >> >    Scalable Modeling and Analysis Systems
> >> >> >> >> >    Sandia National Laboratories
> >> >> >> >> >    Livermore, CA 94551
> >> >> >> >> >    +1 (925) 294-3738
> >> >> >> >> 
> >> >> >> >> > _______________________________________________
> >> >> >> >> > users mailing list
> >> >> >> >> > us...@open-mpi.org
> >> >> >> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >> >> 
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >> _______________________________________________
> >> >> >> >> users mailing list
> >> >> >> >> us...@open-mpi.org
> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >> >
> >> >> >> 
> >> >> >> _______________________________________________
> >> >> >> users mailing list
> >> >> >> us...@open-mpi.org
> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> 
> >> >> _______________________________________________
> >> >> users mailing list
> >> >> us...@open-mpi.org
> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 



> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Attachment: pgpziCRiDUtt5.pgp
Description: PGP signature

Reply via email to