Re: [OMPI users] (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?

Nathan Hjelm Mon, 25 Nov 2013 15:11:30 -0500 (EST)

Hmm, this seems like either a bug in qsub (torque is full of serious bugs) or a 
bug
in alps. I got an allocation using that command and alps only sees 1 node:


[ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS configuration 
file: "/etc/sysconfig/alps"
[ct-login1.localdomain:06010] ras:alps:allocate: parser_ini
[ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS configuration 
file: "/etc/alps.conf"
[ct-login1.localdomain:06010] ras:alps:allocate: parser_separated_columns
[ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS scheduler file: 
"/ufs/alps_shared/appinfo"
[ct-login1.localdomain:06010] ras:alps:orte_ras_alps_get_appinfo_attempts: 10
[ct-login1.localdomain:06010] ras:alps:allocate: begin processing appinfo file
[ct-login1.localdomain:06010] ras:alps:allocate: file /ufs/alps_shared/appinfo 
read
[ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in file
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3492 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3492 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3541 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3541 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3560 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3560 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3561 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3561 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3566 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3566 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3573 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3573 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3588 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3588 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3598 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3598 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3599 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3599 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3622 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3622 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3635 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3635 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3640 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3640 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3641 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3641 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3642 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3642 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3647 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3647 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3651 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3651 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3653 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3653 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3659 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3659 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3662 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3662 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3665 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3665 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3668 - 
myId 3668
[ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): processing NID 29 
with 16 slots
[ct-login1.localdomain:06010] ras:alps:allocate: success
[ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert inserting 1 
nodes
[ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert node 29

======================   ALLOCATED NODES   ======================

 Data for node: 29      Num slots: 16   Max slots: 0

=================================================================


Torque also shows only one node with 16 PPN:

$ env | grep PBS
...
PBS_NUM_PPN=16


$ cat /var/spool/torque/aux//915289.sdb
login1

Which is wrong! I will have to ask Cray what is going on here. I recommend you 
switch to
msub to get an allocation. Moab has fewer bugs. I can't even get aprun to work:

$ aprun -n 2 -N 1 hostname
apsched: claim exceeds reservation's node-count

$ aprun -n 32 hostname
apsched: claim exceeds reservation's node-count


To get an interactive session 2 nodes with 16 ppn on each run:

msub -I -lnodes=2:ppn=16

Open MPI should then work correctly.

-Nathan Hjelm
HPC-5, LANL

On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote:
>    Hi,
>    I installed OpenMPI on our small XE6 using the configure options under
>    /contrib directory.  It appears it is working fine, but it ignores MCA
>    parameters (set in env var).  So I switched to mpirun (in OpenMPI) and it
>    can handle MCA parameters somehow.  However,  mpirun fails to allocate
>    process by cores.  For example, I allocated 32 cores (on 2 nodes) by "qsub
>    -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots.    Is it
>    possible to mpirun to handle mluticore nodes of XE6 properly or is there
>    any options to handle MCA parameters for aprun?
>    Regards,
>    
> -----------------------------------------------------------------------------
>    Keita Teranishi
>    Principal Member of Technical Staff
>    Scalable Modeling and Analysis Systems
>    Sandia National Laboratories
>    Livermore, CA 94551
>    +1 (925) 294-3738

> _______________________________________________
> users mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

pgppv4VD434Fw.pgp
Description: PGP signature

Re: [OMPI users] (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?

Reply via email to