Hmm, this seems like either a bug in qsub (torque is full of serious bugs) or a bug in alps. I got an allocation using that command and alps only sees 1 node:
[ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS configuration file: "/etc/sysconfig/alps" [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS configuration file: "/etc/alps.conf" [ct-login1.localdomain:06010] ras:alps:allocate: parser_separated_columns [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS scheduler file: "/ufs/alps_shared/appinfo" [ct-login1.localdomain:06010] ras:alps:orte_ras_alps_get_appinfo_attempts: 10 [ct-login1.localdomain:06010] ras:alps:allocate: begin processing appinfo file [ct-login1.localdomain:06010] ras:alps:allocate: file /ufs/alps_shared/appinfo read [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in file [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3492 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3492 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3541 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3541 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3560 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3560 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3561 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3561 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3566 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3566 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3573 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3573 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3588 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3588 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3598 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3598 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3599 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3599 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3622 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3622 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3635 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3635 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3640 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3640 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3641 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3641 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3642 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3642 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3647 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3647 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3651 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3651 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3653 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3653 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3659 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3659 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3662 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3662 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3665 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3665 - myId 3668 [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3668 - myId 3668 [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): processing NID 29 with 16 slots [ct-login1.localdomain:06010] ras:alps:allocate: success [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert inserting 1 nodes [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert node 29 ====================== ALLOCATED NODES ====================== Data for node: 29 Num slots: 16 Max slots: 0 ================================================================= Torque also shows only one node with 16 PPN: $ env | grep PBS ... PBS_NUM_PPN=16 $ cat /var/spool/torque/aux//915289.sdb login1 Which is wrong! I will have to ask Cray what is going on here. I recommend you switch to msub to get an allocation. Moab has fewer bugs. I can't even get aprun to work: $ aprun -n 2 -N 1 hostname apsched: claim exceeds reservation's node-count $ aprun -n 32 hostname apsched: claim exceeds reservation's node-count To get an interactive session 2 nodes with 16 ppn on each run: msub -I -lnodes=2:ppn=16 Open MPI should then work correctly. -Nathan Hjelm HPC-5, LANL On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote: > Hi, > I installed OpenMPI on our small XE6 using the configure options under > /contrib directory. It appears it is working fine, but it ignores MCA > parameters (set in env var). So I switched to mpirun (in OpenMPI) and it > can handle MCA parameters somehow. However, mpirun fails to allocate > process by cores. For example, I allocated 32 cores (on 2 nodes) by "qsub > -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots. Is it > possible to mpirun to handle mluticore nodes of XE6 properly or is there > any options to handle MCA parameters for aprun? > Regards, > > ----------------------------------------------------------------------------- > Keita Teranishi > Principal Member of Technical Staff > Scalable Modeling and Analysis Systems > Sandia National Laboratories > Livermore, CA 94551 > +1 (925) 294-3738 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
pgppv4VD434Fw.pgp
Description: PGP signature