Here is what we can see:

knteran@mzlogin01e:~> ls -l /opt/cray/xe-sysroot
total 8
drwxr-xr-x 6 bin  bin  4096 2012-02-04 11:05 4.0.36.securitypatch.20111221
drwxr-xr-x 6 bin  bin  4096 2013-01-11 15:17 4.1.40
lrwxrwxrwx 1 root root    6 2013-01-11 15:19 default -> 4.1.40

Thanks,
Keita




On 11/26/13 3:19 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote:

>??? Alps reports that the two nodes each have one slot. What PE release
>are you using. A quick way to find out is ls -l /opt/cray/xe-sysroot on
>the
>external login node (this directory does not exist on the internal login
>nodes.)
>
>-Nathan
>
>On Tue, Nov 26, 2013 at 11:07:36PM +0000, Teranishi, Keita wrote:
>> Nathan,
>> 
>> Here it is.
>> 
>> Keita
>> 
>> 
>> 
>> 
>> 
>> On 11/26/13 3:02 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote:
>> 
>> >Ok, that sheds a little more light on the situation. For some reason it
>> >sees 2 nodes
>> >apparently with one slot each. One more set out outputs would be
>>helpful.
>> >Please run
>> >with -mca ras_base_verbose 100 . That way I can see what was read from
>> >alps.
>> >
>> >-Nathan
>> >
>> >On Tue, Nov 26, 2013 at 10:14:11PM +0000, Teranishi, Keita wrote:
>> >> Nathan,
>> >> 
>> >> I am hoping these files would help you.
>> >> 
>> >> Thanks,
>> >> Keita
>> >> 
>> >> 
>> >> 
>> >> On 11/26/13 1:41 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote:
>> >> 
>> >> >Well, no hints as to the error there. Looks identical to the output
>>on
>> >>my
>> >> >XE-6. How
>> >> >about setting -mca rmaps_base_verbose 100 . See what is going on
>>with
>> >>the
>> >> >mapper.
>> >> >
>> >> >-Nathan Hjelm
>> >> >Application Readiness, HPC-5, LANL
>> >> >
>> >> >On Tue, Nov 26, 2013 at 09:33:20PM +0000, Teranishi, Keita wrote:
>> >> >> Nathan,
>> >> >> 
>> >> >> Please see the attached obtained from two cases (-np 2 and -np 4).
>> >> >> 
>> >> >> Thanks,
>> >> >> 
>> >> 
>> 
>>>>>>---------------------------------------------------------------------
>>>>>>--
>> >>>>--
>> >> >>--
>> >> >> --
>> >> >> Keita Teranishi
>> >> >> Principal Member of Technical Staff
>> >> >> Scalable Modeling and Analysis Systems
>> >> >> Sandia National Laboratories
>> >> >> Livermore, CA 94551
>> >> >> +1 (925) 294-3738
>> >> >> 
>> >> >> 
>> >> >> 
>> >> >> 
>> >> >> 
>> >> >> On 11/26/13 1:26 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote:
>> >> >> 
>> >> >> >Seems like something is going wrong with processor binding. Can
>>you
>> >>run
>> >> >> >with
>> >> >> >-mca plm_base_verbose 100 . Might shed some light on why it
>>thinks
>> >> >>there
>> >> >> >are
>> >> >> >not enough slots.
>> >> >> >
>> >> >> >-Nathan Hjelm
>> >> >> >Application Readiness, HPC-5, LANL
>> >> >> >
>> >> >> >On Tue, Nov 26, 2013 at 09:18:14PM +0000, Teranishi, Keita wrote:
>> >> >> >> Nathan,
>> >> >> >> 
>> >> >> >> Now I remove strip_prefix stuff, which was applied to the other
>> >> >>versions
>> >> >> >> of OpenMPI.
>> >> >> >> I still have the same problem with msubrun command.
>> >> >> >> 
>> >> >> >> knteran@mzlogin01:~> msub -lnodes=2:ppn=16 -I
>> >> >> >> qsub: waiting for job 7754058.sdb to start
>> >> >> >> qsub: job 7754058.sdb ready
>> >> >> >> 
>> >> >> >> knteran@mzlogin01:~> cd test-openmpi/
>> >> >> >> knteran@mzlogin01:~/test-openmpi> !mp
>> >> >> >> mpicc cpi.c -o cpi
>> >> >> >> knteran@mzlogin01:~/test-openmpi> mpirun -np 4 ./cpi
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>-------------------------------------------------------------------
>>>>>>>>--
>> >>>>>>--
>> >> >>>>--
>> >> >> >>-
>> >> >> >> There are not enough slots available in the system to satisfy
>>the
>> >>4
>> >> >> >>slots
>> >> >> >> that were requested by the application:
>> >> >> >>   ./cpi
>> >> >> >> 
>> >> >> >> Either request fewer slots for your application, or make more
>> >>slots
>> >> >> >> available
>> >> >> >> for use.
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>-------------------------------------------------------------------
>>>>>>>>--
>> >>>>>>--
>> >> >>>>--
>> >> >> >>-
>> >> >> >> 
>> >> >> >> I set PATH and LD_LIBRARY_PATH to match with my own OpenMPI
>> >> >> >>installation.
>> >> >> >> knteran@mzlogin01:~/test-openmpi> which mpirun
>> >> >> >> /home/knteran/openmpi/bin/mpirun
>> >> >> >> 
>> >> >> >> 
>> >> >> >> 
>> >> >> >> 
>> >> >> >> Thanks,
>> >> >> >> 
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>-------------------------------------------------------------------
>>>>>>>>--
>> >>>>>>--
>> >> >>>>--
>> >> >> >>--
>> >> >> >> --
>> >> >> >> Keita Teranishi
>> >> >> >> Principal Member of Technical Staff
>> >> >> >> Scalable Modeling and Analysis Systems
>> >> >> >> Sandia National Laboratories
>> >> >> >> Livermore, CA 94551
>> >> >> >> +1 (925) 294-3738
>> >> >> >> 
>> >> >> >> 
>> >> >> >> 
>> >> >> >> 
>> >> >> >> 
>> >> >> >> On 11/26/13 12:52 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote:
>> >> >> >> 
>> >> >> >> >Weird. That is the same configuration we have deployed on
>>Cielito
>> >> >>and
>> >> >> >> >Cielo. Does
>> >> >> >> >it work under an msub allocation?
>> >> >> >> >
>> >> >> >> >BTW, with that configuration you should not set
>> >> >> >> >plm_base_strip_prefix_from_node_names
>> >> >> >> >to 0. That will confuse orte since the node hostname will not
>> >>match
>> >> >> >>what
>> >> >> >> >was
>> >> >> >> >supplied by alps.
>> >> >> >> >
>> >> >> >> >-Nathan
>> >> >> >> >
>> >> >> >> >On Tue, Nov 26, 2013 at 08:38:51PM +0000, Teranishi, Keita
>>wrote:
>> >> >> >> >> Nathan,
>> >> >> >> >> 
>> >> >> >> >> (Please forget about the segfault. It was my mistake).
>> >> >> >> >> I use OpenMPI-1.7.2 (build with gcc-4.7.2) to run the
>>program.
>> >> I
>> >> >> >>used
>> >> >> >> >> contrib/platform/lanl/cray_xe6/optimized_lustre and
>> >> >> >> >> --enable-mpirun-prefix-by-default for configuration.  As I
>> >>said,
>> >> >>it
>> >> >> >> >>works
>> >> >> >> >> fine with aprun, but fails with mpirun/mpiexec.
>> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> >> knteran@mzlogin01:~/test-openmpi> ~/openmpi/bin/mpirun -np 4
>> >> >>./a.out
>> >> >> >> >> 
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>>>-----------------------------------------------------------------
>>>>>>>>>>--
>> >>>>>>>>--
>> >> >>>>>>--
>> >> >> >>>>--
>> >> >> >> >>-
>> >> >> >> >> There are not enough slots available in the system to
>>satisfy
>> >>the
>> >> >>4
>> >> >> >> >>slots
>> >> >> >> >> that were requested by the application:
>> >> >> >> >>   ./a.out
>> >> >> >> >> 
>> >> >> >> >> Either request fewer slots for your application, or make
>>more
>> >> >>slots
>> >> >> >> >> available
>> >> >> >> >> for use.
>> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>>>-----------------------------------------------------------------
>>>>>>>>>>--
>> >>>>>>>>--
>> >> >>>>>>--
>> >> >> >>>>--
>> >> >> >> >>--
>> >> >> >> >> -
>> >> >> >> >> 
>> >> >> >> >> Thanks,
>> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>>>-----------------------------------------------------------------
>>>>>>>>>>--
>> >>>>>>>>--
>> >> >>>>>>--
>> >> >> >>>>--
>> >> >> >> >>--
>> >> >> >> >> --
>> >> >> >> >> Keita Teranishi
>> >> >> >> >> Principal Member of Technical Staff
>> >> >> >> >> Scalable Modeling and Analysis Systems
>> >> >> >> >> Sandia National Laboratories
>> >> >> >> >> Livermore, CA 94551
>> >> >> >> >> +1 (925) 294-3738
>> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> >> On 11/25/13 12:55 PM, "Nathan Hjelm" <hje...@lanl.gov>
>>wrote:
>> >> >> >> >> 
>> >> >> >> >> >Ok, that should have worked. I just double-checked it to me
>> >>sure.
>> >> >> >> >> >
>> >> >> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ mpirun
>>-np
>> >>32
>> >> >> >> >>./bcast
>> >> >> >> >> >App launch reported: 17 (out of 3) daemons - 0 (out of 32)
>> >>procs
>> >> >> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >How did you configure Open MPI and what version are you
>>using?
>> >> >> >> >> >
>> >> >> >> >> >-Nathan
>> >> >> >> >> >
>> >> >> >> >> >On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, Keita
>> >>wrote:
>> >> >> >> >> >> Hi Natan,
>> >> >> >> >> >> 
>> >> >> >> >> >> I tried qsub option you
>> >> >> >> >> >> 
>> >> >> >> >> >> mpirun -np 4  --mca
>>plm_base_strip_prefix_from_node_names= 0
>> >> >>./cpi
>> >> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>>>>>---------------------------------------------------------------
>>>>>>>>>>>>--
>> >>>>>>>>>>--
>> >> >>>>>>>>--
>> >> >> >>>>>>--
>> >> >> >> >>>>--
>> >> >> >> >> >>-
>> >> >> >> >> >> There are not enough slots available in the system to
>> >>satisfy
>> >> >>the
>> >> >> >>4
>> >> >> >> >> >>slots
>> >> >> >> >> >> that were requested by the application:
>> >> >> >> >> >>   ./cpi
>> >> >> >> >> >> 
>> >> >> >> >> >> Either request fewer slots for your application, or make
>> >>more
>> >> >> >>slots
>> >> >> >> >> >> available
>> >> >> >> >> >> for use.
>> >> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>>>>>---------------------------------------------------------------
>>>>>>>>>>>>--
>> >>>>>>>>>>--
>> >> >>>>>>>>--
>> >> >> >>>>>>--
>> >> >> >> >>>>--
>> >> >> >> >> >>-
>> >> >> >> >> >> 
>> >> >> >> >> >> 
>> >> >> >> >> >> Here is I got from aprun
>> >> >> >> >> >> aprun  -n 32 ./cpi
>> >> >> >> >> >> Process 8 of 32 is on nid00011
>> >> >> >> >> >> Process 5 of 32 is on nid00011
>> >> >> >> >> >> Process 12 of 32 is on nid00011
>> >> >> >> >> >> Process 9 of 32 is on nid00011
>> >> >> >> >> >> Process 11 of 32 is on nid00011
>> >> >> >> >> >> Process 13 of 32 is on nid00011
>> >> >> >> >> >> Process 0 of 32 is on nid00011
>> >> >> >> >> >> Process 6 of 32 is on nid00011
>> >> >> >> >> >> Process 3 of 32 is on nid00011
>> >> >> >> >> >> :
>> >> >> >> >> >> 
>> >> >> >> >> >> :
>> >> >> >> >> >> 
>> >> >> >> >> >> Also, I found a strange error in the end of the program
>> >> >> >> >>(MPI_Finalize?)
>> >> >> >> >> >> Can you tell me what is wrong with that?
>> >> >> >> >> >> [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0)
>> >> >> >> >>[0x2aaaacbbb7c0]
>> >> >> >> >> >> [nid00010:23511] [ 1]
>> >> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int
>>>>>>>>>>>>_f
>> >>>>>>>>>>re
>> >> >>>>>>>>e+
>> >> >> >>>>>>0x
>> >> >> >> >>>>57
>> >> >> >> >> >>)
>> >> >> >> >> >> [0x2aaaaaf38ec7]
>> >> >> >> >> >> [nid00010:23511] [ 2]
>> >> >> >> >> >> 
>> >> >> >> 
>> >> 
>> 
>>>>>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0x
>>>>>>>>c3
>> >>>>>>)
>> >> >> >> >> >> [0x2aaaaaf3b6c3]
>> >> >> >> >> >> [nid00010:23511] [ 3]
>> >> >> >> >> >> 
>> >>/home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2)
>> >> >> >> >> >> [0x2aaaaae717b2]
>> >> >> >> >> >> [nid00010:23511] [ 4]
>> >> >> >> >> >> 
>> >>/home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333)
>> >> >> >> >> >> [0x2aaaaad7be23]
>> >> >> >> >> >> [nid00010:23511] [ 5] ./cpi() [0x400e23]
>> >> >> >> >> >> [nid00010:23511] [ 6]
>> >>/lib64/libc.so.6(__libc_start_main+0xe6)
>> >> >> >> >> >> [0x2aaaacde7c36]
>> >> >> >> >> >> [nid00010:23511] [ 7] ./cpi() [0x400b09]
>> >> >> >> >> >> 
>> >> >> >> >> >> 
>> >> >> >> >> >> 
>> >> >> >> >> >> Thanks,
>> >> >> >> >> >> 
>> >> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>>>>>---------------------------------------------------------------
>>>>>>>>>>>>--
>> >>>>>>>>>>--
>> >> >>>>>>>>--
>> >> >> >>>>>>--
>> >> >> >> >>>>--
>> >> >> >> >> >>--
>> >> >> >> >> >> --
>> >> >> >> >> >> Keita Teranishi
>> >> >> >> >> >> 
>> >> >> >> >> >> Principal Member of Technical Staff
>> >> >> >> >> >> Scalable Modeling and Analysis Systems
>> >> >> >> >> >> Sandia National Laboratories
>> >> >> >> >> >> Livermore, CA 94551
>> >> >> >> >> >> +1 (925) 294-3738
>> >> >> >> >> >> 
>> >> >> >> >> >> 
>> >> >> >> >> >> 
>> >> >> >> >> >> 
>> >> >> >> >> >> 
>> >> >> >> >> >> On 11/25/13 12:28 PM, "Nathan Hjelm" <hje...@lanl.gov>
>> >>wrote:
>> >> >> >> >> >> 
>> >> >> >> >> >> >Just talked with our local Cray rep. Sounds like that
>> >>torque
>> >> >> >>syntax
>> >> >> >> >>is
>> >> >> >> >> >> >broken. You can continue
>> >> >> >> >> >> >to use qsub (though qsub use is strongly discouraged) if
>> >>you
>> >> >>use
>> >> >> >>the
>> >> >> >> >> >>msub
>> >> >> >> >> >> >options.
>> >> >> >> >> >> >
>> >> >> >> >> >> >Ex:
>> >> >> >> >> >> >
>> >> >> >> >> >> >qsub -lnodes=2:ppn=16
>> >> >> >> >> >> >
>> >> >> >> >> >> >Works.
>> >> >> >> >> >> >
>> >> >> >> >> >> >-Nathan
>> >> >> >> >> >> >
>> >> >> >> >> >> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm
>> >>wrote:
>> >> >> >> >> >> >> Hmm, this seems like either a bug in qsub (torque is
>> >>full of
>> >> >> >> >>serious
>> >> >> >> >> >> >>bugs) or a bug
>> >> >> >> >> >> >> in alps. I got an allocation using that command and
>>alps
>> >> >>only
>> >> >> >> >>sees 1
>> >> >> >> >> >> >>node:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
>>Trying
>> >>ALPS
>> >> >> >> >> >> >>configuration file: "/etc/sysconfig/alps"
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
>> >>parser_ini
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
>>Trying
>> >>ALPS
>> >> >> >> >> >> >>configuration file: "/etc/alps.conf"
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
>> >> >> >> >> >> >>parser_separated_columns
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
>>Located
>> >> >>ALPS
>> >> >> >> >> >>scheduler
>> >> >> >> >> >> >>file: "/ufs/alps_shared/appinfo"
>> >> >> >> >> >> >> [ct-login1.localdomain:06010]
>> >> >> >> >> >> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: begin
>> >> >> >>processing
>> >> >> >> >> >> >>appinfo file
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: file
>> >> >> >> >> >> >>/ufs/alps_shared/appinfo read
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: 47
>> >>entries
>> >> >>in
>> >> >> >> >>file
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3492 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3492 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3541 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3541 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3560 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3560 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3561 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3561 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3566 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3566 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3573 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3573 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3588 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3588 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3598 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3598 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3599 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3599 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3622 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3622 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3635 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3635 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3640 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3640 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3641 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3641 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3642 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3642 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3647 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3647 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3651 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3651 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3653 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3653 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3659 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3659 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3662 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3662 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3665 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3665 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read
>> >>data
>> >> >>for
>> >> >> >> >>resId
>> >> >> >> >> >> >>3668 - myId 3668
>> >> >> >> >> >> >> [ct-login1.localdomain:06010]
>> >>ras:alps:read_appinfo(modern):
>> >> >> >> >> >>processing
>> >> >> >> >> >> >>NID 29 with 16 slots
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
>>success
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0]
>> >> >> >>ras:base:node_insert
>> >> >> >> >> >> >>inserting 1 nodes
>> >> >> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0]
>> >> >> >>ras:base:node_insert
>> >> >> >> >> >>node 29
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> ======================   ALLOCATED NODES
>> >> >> >>======================
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>  Data for node: 29 Num slots: 16   Max slots: 0
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> 
>>>>=================================================================
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Torque also shows only one node with 16 PPN:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> $ env | grep PBS
>> >> >> >> >> >> >> ...
>> >> >> >> >> >> >> PBS_NUM_PPN=16
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> $ cat /var/spool/torque/aux//915289.sdb
>> >> >> >> >> >> >> login1
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Which is wrong! I will have to ask Cray what is going
>>on
>> >> >>here.
>> >> >> >>I
>> >> >> >> >> >> >>recommend you switch to
>> >> >> >> >> >> >> msub to get an allocation. Moab has fewer bugs. I
>>can't
>> >>even
>> >> >> >>get
>> >> >> >> >> >>aprun
>> >> >> >> >> >> >>to work:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> $ aprun -n 2 -N 1 hostname
>> >> >> >> >> >> >> apsched: claim exceeds reservation's node-count
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> $ aprun -n 32 hostname
>> >> >> >> >> >> >> apsched: claim exceeds reservation's node-count
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> To get an interactive session 2 nodes with 16 ppn on
>>each
>> >> >>run:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> msub -I -lnodes=2:ppn=16
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Open MPI should then work correctly.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> -Nathan Hjelm
>> >> >> >> >> >> >> HPC-5, LANL
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi,
>> >>Keita
>> >> >> >>wrote:
>> >> >> >> >> >> >> >    Hi,
>> >> >> >> >> >> >> >    I installed OpenMPI on our small XE6 using the
>> >> >>configure
>> >> >> >> >>options
>> >> >> >> >> >> >>under
>> >> >> >> >> >> >> >    /contrib directory.  It appears it is working
>>fine,
>> >> >>but it
>> >> >> >> >> >>ignores
>> >> >> >> >> >> >>MCA
>> >> >> >> >> >> >> >    parameters (set in env var).  So I switched to
>> >>mpirun
>> >> >>(in
>> >> >> >> >> >>OpenMPI)
>> >> >> >> >> >> >>and it
>> >> >> >> >> >> >> >    can handle MCA parameters somehow.  However,
>>mpirun
>> >> >> >>fails to
>> >> >> >> >> >> >>allocate
>> >> >> >> >> >> >> >    process by cores.  For example, I allocated 32
>>cores
>> >> >>(on 2
>> >> >> >> >> >>nodes)
>> >> >> >> >> >> >>by "qsub
>> >> >> >> >> >> >> >    -lmppwidth=32 -lmppnppn=16", mpirun recognizes it
>> >>as 2
>> >> >> >>slots.
>> >> >> >> >> >> >>Is it
>> >> >> >> >> >> >> >    possible to mpirun to handle mluticore nodes of
>>XE6
>> >> >> >>properly
>> >> >> >> >>or
>> >> >> >> >> >>is
>> >> >> >> >> >> >>there
>> >> >> >> >> >> >> >    any options to handle MCA parameters for aprun?
>> >> >> >> >> >> >> >    Regards,
>> >> >> >> >> >> >> >
>> >> >> >> >> >> 
>> >> >> >> >> 
>> >> >> >> 
>> >> >> 
>> >> 
>> 
>>>>>>>>>>>>>>-------------------------------------------------------------
>>>>>>>>>>>>>>--
>> >>>>>>>>>>>>--
>> >> >>>>>>>>>>--
>> >> >> >>>>>>>>--
>> >> >> >> >>>>>>--
>> >> >> >> >> >>>>--
>> >> >> >> >> >> >>----
>> >> >> >> >> >> >> >    Keita Teranishi
>> >> >> >> >> >> >> >    Principal Member of Technical Staff
>> >> >> >> >> >> >> >    Scalable Modeling and Analysis Systems
>> >> >> >> >> >> >> >    Sandia National Laboratories
>> >> >> >> >> >> >> >    Livermore, CA 94551
>> >> >> >> >> >> >> >    +1 (925) 294-3738
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > _______________________________________________
>> >> >> >> >> >> >> > users mailing list
>> >> >> >> >> >> >> > us...@open-mpi.org
>> >> >> >> >> >> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >> >> >> >> >>
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> >> _______________________________________________
>> >> >> >> >> >> >> users mailing list
>> >> >> >> >> >> >> us...@open-mpi.org
>> >> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >> >> >> >> >
>> >> >> >> >> >> 
>> >> >> >> >> >> _______________________________________________
>> >> >> >> >> >> users mailing list
>> >> >> >> >> >> us...@open-mpi.org
>> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >> >> >> 
>> >> >> >> >> _______________________________________________
>> >> >> >> >> users mailing list
>> >> >> >> >> us...@open-mpi.org
>> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >> >> 
>> >> >> >> _______________________________________________
>> >> >> >> users mailing list
>> >> >> >> us...@open-mpi.org
>> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >> 
>> >> >
>> >> >
>> >> >
>> >> >> _______________________________________________
>> >> >> users mailing list
>> >> >> us...@open-mpi.org
>> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >
>> >> 
>> >
>> >
>> >
>> >> _______________________________________________
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> 
>
>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to