FWIW: that has been fixed with the current head of the 1.7 branch (will be in 1.7.4 release)
On Mon, Dec 2, 2013 at 2:28 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > Ack, forgot about that. There is a bug in 1.7.3 that breaks one of LANL's > default > settings. Just change the line in > contrib/platform/lanl/cray_xe6/optimized-common > > from: > > enable_orte_static_ports=no > > to: > > enable_orte_static_ports=yes > > > That should work. > > -Nathan > > On Wed, Nov 27, 2013 at 08:05:48PM +0000, Teranishi, Keita wrote: > > Nathan, > > > > I got a compile-time error (see below). I use a script from > > contrib/platform/lanl/cray_xe6 with gcc-4.7.2. Is there any problem in > my > > environment? > > > > Thanks, > > Keita > > > > CC oob_tcp.lo > > oob_tcp.c:353:7: error: expected identifier or '(' before 'else' > > oob_tcp.c:358:5: warning: data definition has no type or storage class > > [enabled by default] > > oob_tcp.c:358:5: warning: type defaults to 'int' in declaration of > > 'mca_oob_tcp_ipv4_dynamic_ports' [enabled by default] > > oob_tcp.c:358:5: error: conflicting types for > > 'mca_oob_tcp_ipv4_dynamic_ports' > > oob_tcp.c:140:14: note: previous definition of > > 'mca_oob_tcp_ipv4_dynamic_ports' was here > > oob_tcp.c:358:38: warning: initialization makes integer from pointer > > without a cast [enabled by default] > > oob_tcp.c:359:6: error: expected identifier or '(' before 'void' > > oob_tcp.c:367:5: error: expected identifier or '(' before 'if' > > oob_tcp.c:380:7: error: expected identifier or '(' before 'else' > > oob_tcp.c:384:26: error: expected '=', ',', ';', 'asm' or '__attribute__' > > before '.' token > > oob_tcp.c:385:30: error: expected declaration specifiers or '...' before > > string constant > > oob_tcp.c:385:48: error: expected declaration specifiers or '...' before > > 'disable_family_values' > > oob_tcp.c:385:71: error: expected declaration specifiers or '...' before > > '&' token > > oob_tcp.c:386:6: error: expected identifier or '(' before 'void' > > oob_tcp.c:391:5: error: expected identifier or '(' before 'do' > > oob_tcp.c:391:5: error: expected identifier or '(' before 'while' > > oob_tcp.c:448:5: error: expected identifier or '(' before 'return' > > oob_tcp.c:449:1: error: expected identifier or '(' before '}' token > > make[2]: *** [oob_tcp.lo] Error 1 > > make[2]: Leaving directory > > `/ufs/home/knteran/openmpi-1.7.3/orte/mca/oob/tcp' > > make[1]: *** [all-recursive] Error 1 > > make[1]: Leaving directory `/ufs/home/knteran/openmpi-1.7.3/orte' > > > > > > > > > > > > On 11/26/13 3:54 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: > > > > >Alright, everything is identical to Cielito but it looks like you are > > >getting > > >bad data from alps. > > > > > >I think we changed some of the alps parsing for 1.7.3. Can you give that > > >version a try and let me know if it resolves your issue. If not I can > add > > >better debugging to the ras/alps module. > > > > > >-Nathan > > > > > >On Tue, Nov 26, 2013 at 11:50:00PM +0000, Teranishi, Keita wrote: > > >> Here is what we can see: > > >> > > >> knteran@mzlogin01e:~> ls -l /opt/cray/xe-sysroot > > >> total 8 > > >> drwxr-xr-x 6 bin bin 4096 2012-02-04 11:05 > > >>4.0.36.securitypatch.20111221 > > >> drwxr-xr-x 6 bin bin 4096 2013-01-11 15:17 4.1.40 > > >> lrwxrwxrwx 1 root root 6 2013-01-11 15:19 default -> 4.1.40 > > >> > > >> Thanks, > > >> Keita > > >> > > >> > > >> > > >> > > >> On 11/26/13 3:19 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: > > >> > > >> >??? Alps reports that the two nodes each have one slot. What PE > release > > >> >are you using. A quick way to find out is ls -l /opt/cray/xe-sysroot > on > > >> >the > > >> >external login node (this directory does not exist on the internal > > >>login > > >> >nodes.) > > >> > > > >> >-Nathan > > >> > > > >> >On Tue, Nov 26, 2013 at 11:07:36PM +0000, Teranishi, Keita wrote: > > >> >> Nathan, > > >> >> > > >> >> Here it is. > > >> >> > > >> >> Keita > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> On 11/26/13 3:02 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: > > >> >> > > >> >> >Ok, that sheds a little more light on the situation. For some > > >>reason it > > >> >> >sees 2 nodes > > >> >> >apparently with one slot each. One more set out outputs would be > > >> >>helpful. > > >> >> >Please run > > >> >> >with -mca ras_base_verbose 100 . That way I can see what was read > > >>from > > >> >> >alps. > > >> >> > > > >> >> >-Nathan > > >> >> > > > >> >> >On Tue, Nov 26, 2013 at 10:14:11PM +0000, Teranishi, Keita wrote: > > >> >> >> Nathan, > > >> >> >> > > >> >> >> I am hoping these files would help you. > > >> >> >> > > >> >> >> Thanks, > > >> >> >> Keita > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> On 11/26/13 1:41 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: > > >> >> >> > > >> >> >> >Well, no hints as to the error there. Looks identical to the > > >>output > > >> >>on > > >> >> >>my > > >> >> >> >XE-6. How > > >> >> >> >about setting -mca rmaps_base_verbose 100 . See what is going > on > > >> >>with > > >> >> >>the > > >> >> >> >mapper. > > >> >> >> > > > >> >> >> >-Nathan Hjelm > > >> >> >> >Application Readiness, HPC-5, LANL > > >> >> >> > > > >> >> >> >On Tue, Nov 26, 2013 at 09:33:20PM +0000, Teranishi, Keita > wrote: > > >> >> >> >> Nathan, > > >> >> >> >> > > >> >> >> >> Please see the attached obtained from two cases (-np 2 and > -np > > >>4). > > >> >> >> >> > > >> >> >> >> Thanks, > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>------------------------------------------------------------------- > > >>>>>>>>-- > > >> >>>>>>-- > > >> >> >>>>-- > > >> >> >> >>-- > > >> >> >> >> -- > > >> >> >> >> Keita Teranishi > > >> >> >> >> Principal Member of Technical Staff > > >> >> >> >> Scalable Modeling and Analysis Systems > > >> >> >> >> Sandia National Laboratories > > >> >> >> >> Livermore, CA 94551 > > >> >> >> >> +1 (925) 294-3738 > > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > >> >> >> >> On 11/26/13 1:26 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: > > >> >> >> >> > > >> >> >> >> >Seems like something is going wrong with processor binding. > > >>Can > > >> >>you > > >> >> >>run > > >> >> >> >> >with > > >> >> >> >> >-mca plm_base_verbose 100 . Might shed some light on why it > > >> >>thinks > > >> >> >> >>there > > >> >> >> >> >are > > >> >> >> >> >not enough slots. > > >> >> >> >> > > > >> >> >> >> >-Nathan Hjelm > > >> >> >> >> >Application Readiness, HPC-5, LANL > > >> >> >> >> > > > >> >> >> >> >On Tue, Nov 26, 2013 at 09:18:14PM +0000, Teranishi, Keita > > >>wrote: > > >> >> >> >> >> Nathan, > > >> >> >> >> >> > > >> >> >> >> >> Now I remove strip_prefix stuff, which was applied to the > > >>other > > >> >> >> >>versions > > >> >> >> >> >> of OpenMPI. > > >> >> >> >> >> I still have the same problem with msubrun command. > > >> >> >> >> >> > > >> >> >> >> >> knteran@mzlogin01:~> msub -lnodes=2:ppn=16 -I > > >> >> >> >> >> qsub: waiting for job 7754058.sdb to start > > >> >> >> >> >> qsub: job 7754058.sdb ready > > >> >> >> >> >> > > >> >> >> >> >> knteran@mzlogin01:~> cd test-openmpi/ > > >> >> >> >> >> knteran@mzlogin01:~/test-openmpi> !mp > > >> >> >> >> >> mpicc cpi.c -o cpi > > >> >> >> >> >> knteran@mzlogin01:~/test-openmpi> mpirun -np 4 ./cpi > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>----------------------------------------------------------------- > > >>>>>>>>>>-- > > >> >>>>>>>>-- > > >> >> >>>>>>-- > > >> >> >> >>>>-- > > >> >> >> >> >>- > > >> >> >> >> >> There are not enough slots available in the system to > > >>satisfy > > >> >>the > > >> >> >>4 > > >> >> >> >> >>slots > > >> >> >> >> >> that were requested by the application: > > >> >> >> >> >> ./cpi > > >> >> >> >> >> > > >> >> >> >> >> Either request fewer slots for your application, or make > > >>more > > >> >> >>slots > > >> >> >> >> >> available > > >> >> >> >> >> for use. > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>----------------------------------------------------------------- > > >>>>>>>>>>-- > > >> >>>>>>>>-- > > >> >> >>>>>>-- > > >> >> >> >>>>-- > > >> >> >> >> >>- > > >> >> >> >> >> > > >> >> >> >> >> I set PATH and LD_LIBRARY_PATH to match with my own > OpenMPI > > >> >> >> >> >>installation. > > >> >> >> >> >> knteran@mzlogin01:~/test-openmpi> which mpirun > > >> >> >> >> >> /home/knteran/openmpi/bin/mpirun > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> Thanks, > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>----------------------------------------------------------------- > > >>>>>>>>>>-- > > >> >>>>>>>>-- > > >> >> >>>>>>-- > > >> >> >> >>>>-- > > >> >> >> >> >>-- > > >> >> >> >> >> -- > > >> >> >> >> >> Keita Teranishi > > >> >> >> >> >> Principal Member of Technical Staff > > >> >> >> >> >> Scalable Modeling and Analysis Systems > > >> >> >> >> >> Sandia National Laboratories > > >> >> >> >> >> Livermore, CA 94551 > > >> >> >> >> >> +1 (925) 294-3738 > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> On 11/26/13 12:52 PM, "Nathan Hjelm" <hje...@lanl.gov> > > >>wrote: > > >> >> >> >> >> > > >> >> >> >> >> >Weird. That is the same configuration we have deployed on > > >> >>Cielito > > >> >> >> >>and > > >> >> >> >> >> >Cielo. Does > > >> >> >> >> >> >it work under an msub allocation? > > >> >> >> >> >> > > > >> >> >> >> >> >BTW, with that configuration you should not set > > >> >> >> >> >> >plm_base_strip_prefix_from_node_names > > >> >> >> >> >> >to 0. That will confuse orte since the node hostname will > > >>not > > >> >> >>match > > >> >> >> >> >>what > > >> >> >> >> >> >was > > >> >> >> >> >> >supplied by alps. > > >> >> >> >> >> > > > >> >> >> >> >> >-Nathan > > >> >> >> >> >> > > > >> >> >> >> >> >On Tue, Nov 26, 2013 at 08:38:51PM +0000, Teranishi, > Keita > > >> >>wrote: > > >> >> >> >> >> >> Nathan, > > >> >> >> >> >> >> > > >> >> >> >> >> >> (Please forget about the segfault. It was my mistake). > > >> >> >> >> >> >> I use OpenMPI-1.7.2 (build with gcc-4.7.2) to run the > > >> >>program. > > >> >> >> I > > >> >> >> >> >>used > > >> >> >> >> >> >> contrib/platform/lanl/cray_xe6/optimized_lustre and > > >> >> >> >> >> >> --enable-mpirun-prefix-by-default for configuration. > As > > >>I > > >> >> >>said, > > >> >> >> >>it > > >> >> >> >> >> >>works > > >> >> >> >> >> >> fine with aprun, but fails with mpirun/mpiexec. > > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> knteran@mzlogin01:~/test-openmpi> ~/openmpi/bin/mpirun > > >>-np 4 > > >> >> >> >>./a.out > > >> >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>>>--------------------------------------------------------------- > > >>>>>>>>>>>>-- > > >> >>>>>>>>>>-- > > >> >> >>>>>>>>-- > > >> >> >> >>>>>>-- > > >> >> >> >> >>>>-- > > >> >> >> >> >> >>- > > >> >> >> >> >> >> There are not enough slots available in the system to > > >> >>satisfy > > >> >> >>the > > >> >> >> >>4 > > >> >> >> >> >> >>slots > > >> >> >> >> >> >> that were requested by the application: > > >> >> >> >> >> >> ./a.out > > >> >> >> >> >> >> > > >> >> >> >> >> >> Either request fewer slots for your application, or > make > > >> >>more > > >> >> >> >>slots > > >> >> >> >> >> >> available > > >> >> >> >> >> >> for use. > > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>>>--------------------------------------------------------------- > > >>>>>>>>>>>>-- > > >> >>>>>>>>>>-- > > >> >> >>>>>>>>-- > > >> >> >> >>>>>>-- > > >> >> >> >> >>>>-- > > >> >> >> >> >> >>-- > > >> >> >> >> >> >> - > > >> >> >> >> >> >> > > >> >> >> >> >> >> Thanks, > > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>>>--------------------------------------------------------------- > > >>>>>>>>>>>>-- > > >> >>>>>>>>>>-- > > >> >> >>>>>>>>-- > > >> >> >> >>>>>>-- > > >> >> >> >> >>>>-- > > >> >> >> >> >> >>-- > > >> >> >> >> >> >> -- > > >> >> >> >> >> >> Keita Teranishi > > >> >> >> >> >> >> Principal Member of Technical Staff > > >> >> >> >> >> >> Scalable Modeling and Analysis Systems > > >> >> >> >> >> >> Sandia National Laboratories > > >> >> >> >> >> >> Livermore, CA 94551 > > >> >> >> >> >> >> +1 (925) 294-3738 > > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> On 11/25/13 12:55 PM, "Nathan Hjelm" <hje...@lanl.gov> > > >> >>wrote: > > >> >> >> >> >> >> > > >> >> >> >> >> >> >Ok, that should have worked. I just double-checked it > > >>to me > > >> >> >>sure. > > >> >> >> >> >> >> > > > >> >> >> >> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ > > >>mpirun > > >> >>-np > > >> >> >>32 > > >> >> >> >> >> >>./bcast > > >> >> >> >> >> >> >App launch reported: 17 (out of 3) daemons - 0 (out of > > >>32) > > >> >> >>procs > > >> >> >> >> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ > > >> >> >> >> >> >> > > > >> >> >> >> >> >> > > > >> >> >> >> >> >> >How did you configure Open MPI and what version are > you > > >> >>using? > > >> >> >> >> >> >> > > > >> >> >> >> >> >> >-Nathan > > >> >> >> >> >> >> > > > >> >> >> >> >> >> >On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, > > >>Keita > > >> >> >>wrote: > > >> >> >> >> >> >> >> Hi Natan, > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> I tried qsub option you > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> mpirun -np 4 --mca > > >> >>plm_base_strip_prefix_from_node_names= 0 > > >> >> >> >>./cpi > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>>>>>------------------------------------------------------------- > > >>>>>>>>>>>>>>-- > > >> >>>>>>>>>>>>-- > > >> >> >>>>>>>>>>-- > > >> >> >> >>>>>>>>-- > > >> >> >> >> >>>>>>-- > > >> >> >> >> >> >>>>-- > > >> >> >> >> >> >> >>- > > >> >> >> >> >> >> >> There are not enough slots available in the system > to > > >> >> >>satisfy > > >> >> >> >>the > > >> >> >> >> >>4 > > >> >> >> >> >> >> >>slots > > >> >> >> >> >> >> >> that were requested by the application: > > >> >> >> >> >> >> >> ./cpi > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> Either request fewer slots for your application, or > > >>make > > >> >> >>more > > >> >> >> >> >>slots > > >> >> >> >> >> >> >> available > > >> >> >> >> >> >> >> for use. > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>>>>>------------------------------------------------------------- > > >>>>>>>>>>>>>>-- > > >> >>>>>>>>>>>>-- > > >> >> >>>>>>>>>>-- > > >> >> >> >>>>>>>>-- > > >> >> >> >> >>>>>>-- > > >> >> >> >> >> >>>>-- > > >> >> >> >> >> >> >>- > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> Here is I got from aprun > > >> >> >> >> >> >> >> aprun -n 32 ./cpi > > >> >> >> >> >> >> >> Process 8 of 32 is on nid00011 > > >> >> >> >> >> >> >> Process 5 of 32 is on nid00011 > > >> >> >> >> >> >> >> Process 12 of 32 is on nid00011 > > >> >> >> >> >> >> >> Process 9 of 32 is on nid00011 > > >> >> >> >> >> >> >> Process 11 of 32 is on nid00011 > > >> >> >> >> >> >> >> Process 13 of 32 is on nid00011 > > >> >> >> >> >> >> >> Process 0 of 32 is on nid00011 > > >> >> >> >> >> >> >> Process 6 of 32 is on nid00011 > > >> >> >> >> >> >> >> Process 3 of 32 is on nid00011 > > >> >> >> >> >> >> >> : > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> : > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> Also, I found a strange error in the end of the > > >>program > > >> >> >> >> >> >>(MPI_Finalize?) > > >> >> >> >> >> >> >> Can you tell me what is wrong with that? > > >> >> >> >> >> >> >> [nid00010:23511] [ 0] > /lib64/libpthread.so.0(+0xf7c0) > > >> >> >> >> >> >>[0x2aaaacbbb7c0] > > >> >> >> >> >> >> >> [nid00010:23511] [ 1] > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_i > > >>>>>>>>>>>>>>nt > > >> >>>>>>>>>>>>_f > > >> >> >>>>>>>>>>re > > >> >> >> >>>>>>>>e+ > > >> >> >> >> >>>>>>0x > > >> >> >> >> >> >>>>57 > > >> >> >> >> >> >> >>) > > >> >> >> >> >> >> >> [0x2aaaaaf38ec7] > > >> >> >> >> >> >> >> [nid00010:23511] [ 2] > > >> >> >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+ > > >>>>>>>>>>0x > > >> >>>>>>>>c3 > > >> >> >>>>>>) > > >> >> >> >> >> >> >> [0x2aaaaaf3b6c3] > > >> >> >> >> >> >> >> [nid00010:23511] [ 3] > > >> >> >> >> >> >> >> > > >> >> >>/home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2) > > >> >> >> >> >> >> >> [0x2aaaaae717b2] > > >> >> >> >> >> >> >> [nid00010:23511] [ 4] > > >> >> >> >> >> >> >> > > >> >> >>/home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333) > > >> >> >> >> >> >> >> [0x2aaaaad7be23] > > >> >> >> >> >> >> >> [nid00010:23511] [ 5] ./cpi() [0x400e23] > > >> >> >> >> >> >> >> [nid00010:23511] [ 6] > > >> >> >>/lib64/libc.so.6(__libc_start_main+0xe6) > > >> >> >> >> >> >> >> [0x2aaaacde7c36] > > >> >> >> >> >> >> >> [nid00010:23511] [ 7] ./cpi() [0x400b09] > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> Thanks, > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>>>>>------------------------------------------------------------- > > >>>>>>>>>>>>>>-- > > >> >>>>>>>>>>>>-- > > >> >> >>>>>>>>>>-- > > >> >> >> >>>>>>>>-- > > >> >> >> >> >>>>>>-- > > >> >> >> >> >> >>>>-- > > >> >> >> >> >> >> >>-- > > >> >> >> >> >> >> >> -- > > >> >> >> >> >> >> >> Keita Teranishi > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> Principal Member of Technical Staff > > >> >> >> >> >> >> >> Scalable Modeling and Analysis Systems > > >> >> >> >> >> >> >> Sandia National Laboratories > > >> >> >> >> >> >> >> Livermore, CA 94551 > > >> >> >> >> >> >> >> +1 (925) 294-3738 > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> On 11/25/13 12:28 PM, "Nathan Hjelm" < > hje...@lanl.gov> > > >> >> >>wrote: > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >Just talked with our local Cray rep. Sounds like > that > > >> >> >>torque > > >> >> >> >> >>syntax > > >> >> >> >> >> >>is > > >> >> >> >> >> >> >> >broken. You can continue > > >> >> >> >> >> >> >> >to use qsub (though qsub use is strongly > > >>discouraged) if > > >> >> >>you > > >> >> >> >>use > > >> >> >> >> >>the > > >> >> >> >> >> >> >>msub > > >> >> >> >> >> >> >> >options. > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> >Ex: > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> >qsub -lnodes=2:ppn=16 > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> >Works. > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> >-Nathan > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan > > >>Hjelm > > >> >> >>wrote: > > >> >> >> >> >> >> >> >> Hmm, this seems like either a bug in qsub (torque > > >>is > > >> >> >>full of > > >> >> >> >> >> >>serious > > >> >> >> >> >> >> >> >>bugs) or a bug > > >> >> >> >> >> >> >> >> in alps. I got an allocation using that command > and > > >> >>alps > > >> >> >> >>only > > >> >> >> >> >> >>sees 1 > > >> >> >> >> >> >> >> >>node: > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >> >>Trying > > >> >> >>ALPS > > >> >> >> >> >> >> >> >>configuration file: "/etc/sysconfig/alps" > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >> >> >>parser_ini > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >> >>Trying > > >> >> >>ALPS > > >> >> >> >> >> >> >> >>configuration file: "/etc/alps.conf" > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >> >> >> >> >> >> >> >>parser_separated_columns > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >> >>Located > > >> >> >> >>ALPS > > >> >> >> >> >> >> >>scheduler > > >> >> >> >> >> >> >> >>file: "/ufs/alps_shared/appinfo" > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] > > >> >> >> >> >> >> >> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>begin > > >> >> >> >> >>processing > > >> >> >> >> >> >> >> >>appinfo file > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>file > > >> >> >> >> >> >> >> >>/ufs/alps_shared/appinfo read > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > 47 > > >> >> >>entries > > >> >> >> >>in > > >> >> >> >> >> >>file > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3492 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3492 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3541 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3541 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3560 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3560 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3561 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3561 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3566 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3566 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3573 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3573 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3588 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3588 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3598 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3598 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3599 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3599 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3622 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3622 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3635 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3635 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3640 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3640 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3641 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3641 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3642 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3642 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3647 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3647 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3651 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3651 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3653 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3653 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3659 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3659 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3662 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3662 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3665 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3665 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >>read > > >> >> >>data > > >> >> >> >>for > > >> >> >> >> >> >>resId > > >> >> >> >> >> >> >> >>3668 - myId 3668 > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] > > >> >> >>ras:alps:read_appinfo(modern): > > >> >> >> >> >> >> >>processing > > >> >> >> >> >> >> >> >>NID 29 with 16 slots > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: > > >> >>success > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0] > > >> >> >> >> >>ras:base:node_insert > > >> >> >> >> >> >> >> >>inserting 1 nodes > > >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0] > > >> >> >> >> >>ras:base:node_insert > > >> >> >> >> >> >> >>node 29 > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> ====================== ALLOCATED NODES > > >> >> >> >> >>====================== > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> Data for node: 29 Num slots: 16 Max slots: > 0 > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > > >> >> >> >> > > >> >>>>================================================================= > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> Torque also shows only one node with 16 PPN: > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> $ env | grep PBS > > >> >> >> >> >> >> >> >> ... > > >> >> >> >> >> >> >> >> PBS_NUM_PPN=16 > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> $ cat /var/spool/torque/aux//915289.sdb > > >> >> >> >> >> >> >> >> login1 > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> Which is wrong! I will have to ask Cray what is > > >>going > > >> >>on > > >> >> >> >>here. > > >> >> >> >> >>I > > >> >> >> >> >> >> >> >>recommend you switch to > > >> >> >> >> >> >> >> >> msub to get an allocation. Moab has fewer bugs. I > > >> >>can't > > >> >> >>even > > >> >> >> >> >>get > > >> >> >> >> >> >> >>aprun > > >> >> >> >> >> >> >> >>to work: > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> $ aprun -n 2 -N 1 hostname > > >> >> >> >> >> >> >> >> apsched: claim exceeds reservation's node-count > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> $ aprun -n 32 hostname > > >> >> >> >> >> >> >> >> apsched: claim exceeds reservation's node-count > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> To get an interactive session 2 nodes with 16 ppn > > >>on > > >> >>each > > >> >> >> >>run: > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> msub -I -lnodes=2:ppn=16 > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> Open MPI should then work correctly. > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> -Nathan Hjelm > > >> >> >> >> >> >> >> >> HPC-5, LANL > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, > > >>Teranishi, > > >> >> >>Keita > > >> >> >> >> >>wrote: > > >> >> >> >> >> >> >> >> > Hi, > > >> >> >> >> >> >> >> >> > I installed OpenMPI on our small XE6 using > the > > >> >> >> >>configure > > >> >> >> >> >> >>options > > >> >> >> >> >> >> >> >>under > > >> >> >> >> >> >> >> >> > /contrib directory. It appears it is > working > > >> >>fine, > > >> >> >> >>but it > > >> >> >> >> >> >> >>ignores > > >> >> >> >> >> >> >> >>MCA > > >> >> >> >> >> >> >> >> > parameters (set in env var). So I switched > to > > >> >> >>mpirun > > >> >> >> >>(in > > >> >> >> >> >> >> >>OpenMPI) > > >> >> >> >> >> >> >> >>and it > > >> >> >> >> >> >> >> >> > can handle MCA parameters somehow. However, > > >> >>mpirun > > >> >> >> >> >>fails to > > >> >> >> >> >> >> >> >>allocate > > >> >> >> >> >> >> >> >> > process by cores. For example, I allocated > 32 > > >> >>cores > > >> >> >> >>(on 2 > > >> >> >> >> >> >> >>nodes) > > >> >> >> >> >> >> >> >>by "qsub > > >> >> >> >> >> >> >> >> > -lmppwidth=32 -lmppnppn=16", mpirun > > >>recognizes it > > >> >> >>as 2 > > >> >> >> >> >>slots. > > >> >> >> >> >> >> >> >>Is it > > >> >> >> >> >> >> >> >> > possible to mpirun to handle mluticore nodes > > >>of > > >> >>XE6 > > >> >> >> >> >>properly > > >> >> >> >> >> >>or > > >> >> >> >> >> >> >>is > > >> >> >> >> >> >> >> >>there > > >> >> >> >> >> >> >> >> > any options to handle MCA parameters for > > >>aprun? > > >> >> >> >> >> >> >> >> > Regards, > > >> >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > >>>>>>>>>>>>>>>>----------------------------------------------------------- > > >>>>>>>>>>>>>>>>-- > > >> >>>>>>>>>>>>>>-- > > >> >> >>>>>>>>>>>>-- > > >> >> >> >>>>>>>>>>-- > > >> >> >> >> >>>>>>>>-- > > >> >> >> >> >> >>>>>>-- > > >> >> >> >> >> >> >>>>-- > > >> >> >> >> >> >> >> >>---- > > >> >> >> >> >> >> >> >> > Keita Teranishi > > >> >> >> >> >> >> >> >> > Principal Member of Technical Staff > > >> >> >> >> >> >> >> >> > Scalable Modeling and Analysis Systems > > >> >> >> >> >> >> >> >> > Sandia National Laboratories > > >> >> >> >> >> >> >> >> > Livermore, CA 94551 > > >> >> >> >> >> >> >> >> > +1 (925) 294-3738 > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> >> > _______________________________________________ > > >> >> >> >> >> >> >> >> > users mailing list > > >> >> >> >> >> >> >> >> > us...@open-mpi.org > > >> >> >> >> >> >> >> >> > > > >>http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> >> _______________________________________________ > > >> >> >> >> >> >> >> >> users mailing list > > >> >> >> >> >> >> >> >> us...@open-mpi.org > > >> >> >> >> >> >> >> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> >> >> >> >> >> >> > > > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> _______________________________________________ > > >> >> >> >> >> >> >> users mailing list > > >> >> >> >> >> >> >> us...@open-mpi.org > > >> >> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> >> >> >> >> >> > > >> >> >> >> >> >> _______________________________________________ > > >> >> >> >> >> >> users mailing list > > >> >> >> >> >> >> us...@open-mpi.org > > >> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> >> >> >> >> > > >> >> >> >> >> _______________________________________________ > > >> >> >> >> >> users mailing list > > >> >> >> >> >> us...@open-mpi.org > > >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> >> >> >> > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> >> _______________________________________________ > > >> >> >> >> users mailing list > > >> >> >> >> us...@open-mpi.org > > >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> >> >> > > > >> >> >> > > >> >> > > > >> >> > > > >> >> > > > >> >> >> _______________________________________________ > > >> >> >> users mailing list > > >> >> >> us...@open-mpi.org > > >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> >> > > > >> >> > > >> > > > >> > > > >> >> _______________________________________________ > > >> >> users mailing list > > >> >> us...@open-mpi.org > > >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > > >> > > >> _______________________________________________ > > >> users mailing list > > >> us...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >