Re: [OMPI users] [EXTERNAL] Re: (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?

Teranishi, Keita Tue, 26 Nov 2013 16:18:30 -0500 (EST)

Nathan,

Now I remove strip_prefix stuff, which was applied to the other versions
of OpenMPI.  
I still have the same problem with msubrun command.


knteran@mzlogin01:~> msub -lnodes=2:ppn=16 -I
qsub: waiting for job 7754058.sdb to start
qsub: job 7754058.sdb ready

knteran@mzlogin01:~> cd test-openmpi/
knteran@mzlogin01:~/test-openmpi> !mp
mpicc cpi.c -o cpi
knteran@mzlogin01:~/test-openmpi> mpirun -np 4 ./cpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
  ./cpi

Either request fewer slots for your application, or make more slots
available
for use.
--------------------------------------------------------------------------

I set PATH and LD_LIBRARY_PATH to match with my own OpenMPI installation.
knteran@mzlogin01:~/test-openmpi> which mpirun
/home/knteran/openmpi/bin/mpirun




Thanks,

---------------------------------------------------------------------------
--
Keita Teranishi
Principal Member of Technical Staff
Scalable Modeling and Analysis Systems
Sandia National Laboratories
Livermore, CA 94551
+1 (925) 294-3738





On 11/26/13 12:52 PM, "Nathan Hjelm" <[email protected]> wrote:

>Weird. That is the same configuration we have deployed on Cielito and
>Cielo. Does
>it work under an msub allocation?
>
>BTW, with that configuration you should not set
>plm_base_strip_prefix_from_node_names
>to 0. That will confuse orte since the node hostname will not match what
>was
>supplied by alps.
>
>-Nathan
>
>On Tue, Nov 26, 2013 at 08:38:51PM +0000, Teranishi, Keita wrote:
>> Nathan,
>> 
>> (Please forget about the segfault. It was my mistake).
>> I use OpenMPI-1.7.2 (build with gcc-4.7.2) to run the program.  I used
>> contrib/platform/lanl/cray_xe6/optimized_lustre and
>> --enable-mpirun-prefix-by-default for configuration.  As I said, it
>>works
>> fine with aprun, but fails with mpirun/mpiexec.
>> 
>> 
>> knteran@mzlogin01:~/test-openmpi> ~/openmpi/bin/mpirun -np 4 ./a.out
>> 
>>-------------------------------------------------------------------------
>>-
>> There are not enough slots available in the system to satisfy the 4
>>slots
>> that were requested by the application:
>>   ./a.out
>> 
>> Either request fewer slots for your application, or make more slots
>> available
>> for use.
>> 
>> 
>>-------------------------------------------------------------------------
>>--
>> -
>> 
>> Thanks,
>> 
>> 
>>-------------------------------------------------------------------------
>>--
>> --
>> Keita Teranishi
>> Principal Member of Technical Staff
>> Scalable Modeling and Analysis Systems
>> Sandia National Laboratories
>> Livermore, CA 94551
>> +1 (925) 294-3738
>> 
>> 
>> 
>> 
>> 
>> On 11/25/13 12:55 PM, "Nathan Hjelm" <[email protected]> wrote:
>> 
>> >Ok, that should have worked. I just double-checked it to me sure.
>> >
>> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ mpirun -np 32
>>./bcast
>> >App launch reported: 17 (out of 3) daemons - 0 (out of 32) procs
>> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$
>> >
>> >
>> >How did you configure Open MPI and what version are you using?
>> >
>> >-Nathan
>> >
>> >On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, Keita wrote:
>> >> Hi Natan,
>> >> 
>> >> I tried qsub option you
>> >> 
>> >> mpirun -np 4  --mca plm_base_strip_prefix_from_node_names= 0 ./cpi
>> >> 
>> 
>>>>-----------------------------------------------------------------------
>>>>--
>> >>-
>> >> There are not enough slots available in the system to satisfy the 4
>> >>slots
>> >> that were requested by the application:
>> >>   ./cpi
>> >> 
>> >> Either request fewer slots for your application, or make more slots
>> >> available
>> >> for use.
>> >> 
>> 
>>>>-----------------------------------------------------------------------
>>>>--
>> >>-
>> >> 
>> >> 
>> >> Here is I got from aprun
>> >> aprun  -n 32 ./cpi
>> >> Process 8 of 32 is on nid00011
>> >> Process 5 of 32 is on nid00011
>> >> Process 12 of 32 is on nid00011
>> >> Process 9 of 32 is on nid00011
>> >> Process 11 of 32 is on nid00011
>> >> Process 13 of 32 is on nid00011
>> >> Process 0 of 32 is on nid00011
>> >> Process 6 of 32 is on nid00011
>> >> Process 3 of 32 is on nid00011
>> >> :
>> >> 
>> >> :
>> >> 
>> >> Also, I found a strange error in the end of the program
>>(MPI_Finalize?)
>> >> Can you tell me what is wrong with that?
>> >> [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0)
>>[0x2aaaacbbb7c0]
>> >> [nid00010:23511] [ 1]
>> >> 
>> 
>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int_free+0x
>>>>57
>> >>)
>> >> [0x2aaaaaf38ec7]
>> >> [nid00010:23511] [ 2]
>> >> 
>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0xc3)
>> >> [0x2aaaaaf3b6c3]
>> >> [nid00010:23511] [ 3]
>> >> /home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2)
>> >> [0x2aaaaae717b2]
>> >> [nid00010:23511] [ 4]
>> >> /home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333)
>> >> [0x2aaaaad7be23]
>> >> [nid00010:23511] [ 5] ./cpi() [0x400e23]
>> >> [nid00010:23511] [ 6] /lib64/libc.so.6(__libc_start_main+0xe6)
>> >> [0x2aaaacde7c36]
>> >> [nid00010:23511] [ 7] ./cpi() [0x400b09]
>> >> 
>> >> 
>> >> 
>> >> Thanks,
>> >> 
>> >> 
>> 
>>>>-----------------------------------------------------------------------
>>>>--
>> >>--
>> >> --
>> >> Keita Teranishi
>> >> 
>> >> Principal Member of Technical Staff
>> >> Scalable Modeling and Analysis Systems
>> >> Sandia National Laboratories
>> >> Livermore, CA 94551
>> >> +1 (925) 294-3738
>> >> 
>> >> 
>> >> 
>> >> 
>> >> 
>> >> On 11/25/13 12:28 PM, "Nathan Hjelm" <[email protected]> wrote:
>> >> 
>> >> >Just talked with our local Cray rep. Sounds like that torque syntax
>>is
>> >> >broken. You can continue
>> >> >to use qsub (though qsub use is strongly discouraged) if you use the
>> >>msub
>> >> >options.
>> >> >
>> >> >Ex:
>> >> >
>> >> >qsub -lnodes=2:ppn=16
>> >> >
>> >> >Works.
>> >> >
>> >> >-Nathan
>> >> >
>> >> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote:
>> >> >> Hmm, this seems like either a bug in qsub (torque is full of
>>serious
>> >> >>bugs) or a bug
>> >> >> in alps. I got an allocation using that command and alps only
>>sees 1
>> >> >>node:
>> >> >> 
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
>> >> >>configuration file: "/etc/sysconfig/alps"
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
>> >> >>configuration file: "/etc/alps.conf"
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
>> >> >>parser_separated_columns
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS
>> >>scheduler
>> >> >>file: "/ufs/alps_shared/appinfo"
>> >> >> [ct-login1.localdomain:06010]
>> >> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: begin processing
>> >> >>appinfo file
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: file
>> >> >>/ufs/alps_shared/appinfo read
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in
>>file
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3492 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3492 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3541 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3541 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3560 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3560 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3561 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3561 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3566 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3566 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3573 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3573 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3588 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3588 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3598 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3598 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3599 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3599 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3622 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3622 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3635 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3635 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3640 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3640 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3641 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3641 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3642 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3642 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3647 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3647 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3651 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3651 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3653 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3653 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3659 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3659 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3662 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3662 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3665 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3665 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for
>>resId
>> >> >>3668 - myId 3668
>> >> >> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern):
>> >>processing
>> >> >>NID 29 with 16 slots
>> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: success
>> >> >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert
>> >> >>inserting 1 nodes
>> >> >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert
>> >>node 29
>> >> >> 
>> >> >> ======================   ALLOCATED NODES   ======================
>> >> >> 
>> >> >>  Data for node: 29     Num slots: 16   Max slots: 0
>> >> >> 
>> >> >> =================================================================
>> >> >> 
>> >> >> 
>> >> >> Torque also shows only one node with 16 PPN:
>> >> >> 
>> >> >> $ env | grep PBS
>> >> >> ...
>> >> >> PBS_NUM_PPN=16
>> >> >> 
>> >> >> 
>> >> >> $ cat /var/spool/torque/aux//915289.sdb
>> >> >> login1
>> >> >> 
>> >> >> Which is wrong! I will have to ask Cray what is going on here. I
>> >> >>recommend you switch to
>> >> >> msub to get an allocation. Moab has fewer bugs. I can't even get
>> >>aprun
>> >> >>to work:
>> >> >> 
>> >> >> $ aprun -n 2 -N 1 hostname
>> >> >> apsched: claim exceeds reservation's node-count
>> >> >> 
>> >> >> $ aprun -n 32 hostname
>> >> >> apsched: claim exceeds reservation's node-count
>> >> >> 
>> >> >> 
>> >> >> To get an interactive session 2 nodes with 16 ppn on each run:
>> >> >> 
>> >> >> msub -I -lnodes=2:ppn=16
>> >> >> 
>> >> >> Open MPI should then work correctly.
>> >> >> 
>> >> >> -Nathan Hjelm
>> >> >> HPC-5, LANL
>> >> >> 
>> >> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote:
>> >> >> >    Hi,
>> >> >> >    I installed OpenMPI on our small XE6 using the configure
>>options
>> >> >>under
>> >> >> >    /contrib directory.  It appears it is working fine, but it
>> >>ignores
>> >> >>MCA
>> >> >> >    parameters (set in env var).  So I switched to mpirun (in
>> >>OpenMPI)
>> >> >>and it
>> >> >> >    can handle MCA parameters somehow.  However,  mpirun fails to
>> >> >>allocate
>> >> >> >    process by cores.  For example, I allocated 32 cores (on 2
>> >>nodes)
>> >> >>by "qsub
>> >> >> >    -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots.
>> >> >>Is it
>> >> >> >    possible to mpirun to handle mluticore nodes of XE6 properly
>>or
>> >>is
>> >> >>there
>> >> >> >    any options to handle MCA parameters for aprun?
>> >> >> >    Regards,
>> >> >> >    
>> >> 
>> 
>>>>>>---------------------------------------------------------------------
>>>>>>--
>> >>>>--
>> >> >>----
>> >> >> >    Keita Teranishi
>> >> >> >    Principal Member of Technical Staff
>> >> >> >    Scalable Modeling and Analysis Systems
>> >> >> >    Sandia National Laboratories
>> >> >> >    Livermore, CA 94551
>> >> >> >    +1 (925) 294-3738
>> >> >> 
>> >> >> > _______________________________________________
>> >> >> > users mailing list
>> >> >> > [email protected]
>> >> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >> 
>> >> >
>> >> >
>> >> >
>> >> >> _______________________________________________
>> >> >> users mailing list
>> >> >> [email protected]
>> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >
>> >> 
>> >> _______________________________________________
>> >> users mailing list
>> >> [email protected]
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] [EXTERNAL] Re: (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?

Reply via email to