Nathan, I got a compile-time error (see below). I use a script from contrib/platform/lanl/cray_xe6 with gcc-4.7.2. Is there any problem in my environment?
Thanks, Keita CC oob_tcp.lo oob_tcp.c:353:7: error: expected identifier or '(' before 'else' oob_tcp.c:358:5: warning: data definition has no type or storage class [enabled by default] oob_tcp.c:358:5: warning: type defaults to 'int' in declaration of 'mca_oob_tcp_ipv4_dynamic_ports' [enabled by default] oob_tcp.c:358:5: error: conflicting types for 'mca_oob_tcp_ipv4_dynamic_ports' oob_tcp.c:140:14: note: previous definition of 'mca_oob_tcp_ipv4_dynamic_ports' was here oob_tcp.c:358:38: warning: initialization makes integer from pointer without a cast [enabled by default] oob_tcp.c:359:6: error: expected identifier or '(' before 'void' oob_tcp.c:367:5: error: expected identifier or '(' before 'if' oob_tcp.c:380:7: error: expected identifier or '(' before 'else' oob_tcp.c:384:26: error: expected '=', ',', ';', 'asm' or '__attribute__' before '.' token oob_tcp.c:385:30: error: expected declaration specifiers or '...' before string constant oob_tcp.c:385:48: error: expected declaration specifiers or '...' before 'disable_family_values' oob_tcp.c:385:71: error: expected declaration specifiers or '...' before '&' token oob_tcp.c:386:6: error: expected identifier or '(' before 'void' oob_tcp.c:391:5: error: expected identifier or '(' before 'do' oob_tcp.c:391:5: error: expected identifier or '(' before 'while' oob_tcp.c:448:5: error: expected identifier or '(' before 'return' oob_tcp.c:449:1: error: expected identifier or '(' before '}' token make[2]: *** [oob_tcp.lo] Error 1 make[2]: Leaving directory `/ufs/home/knteran/openmpi-1.7.3/orte/mca/oob/tcp' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/ufs/home/knteran/openmpi-1.7.3/orte' On 11/26/13 3:54 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: >Alright, everything is identical to Cielito but it looks like you are >getting >bad data from alps. > >I think we changed some of the alps parsing for 1.7.3. Can you give that >version a try and let me know if it resolves your issue. If not I can add >better debugging to the ras/alps module. > >-Nathan > >On Tue, Nov 26, 2013 at 11:50:00PM +0000, Teranishi, Keita wrote: >> Here is what we can see: >> >> knteran@mzlogin01e:~> ls -l /opt/cray/xe-sysroot >> total 8 >> drwxr-xr-x 6 bin bin 4096 2012-02-04 11:05 >>4.0.36.securitypatch.20111221 >> drwxr-xr-x 6 bin bin 4096 2013-01-11 15:17 4.1.40 >> lrwxrwxrwx 1 root root 6 2013-01-11 15:19 default -> 4.1.40 >> >> Thanks, >> Keita >> >> >> >> >> On 11/26/13 3:19 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: >> >> >??? Alps reports that the two nodes each have one slot. What PE release >> >are you using. A quick way to find out is ls -l /opt/cray/xe-sysroot on >> >the >> >external login node (this directory does not exist on the internal >>login >> >nodes.) >> > >> >-Nathan >> > >> >On Tue, Nov 26, 2013 at 11:07:36PM +0000, Teranishi, Keita wrote: >> >> Nathan, >> >> >> >> Here it is. >> >> >> >> Keita >> >> >> >> >> >> >> >> >> >> >> >> On 11/26/13 3:02 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: >> >> >> >> >Ok, that sheds a little more light on the situation. For some >>reason it >> >> >sees 2 nodes >> >> >apparently with one slot each. One more set out outputs would be >> >>helpful. >> >> >Please run >> >> >with -mca ras_base_verbose 100 . That way I can see what was read >>from >> >> >alps. >> >> > >> >> >-Nathan >> >> > >> >> >On Tue, Nov 26, 2013 at 10:14:11PM +0000, Teranishi, Keita wrote: >> >> >> Nathan, >> >> >> >> >> >> I am hoping these files would help you. >> >> >> >> >> >> Thanks, >> >> >> Keita >> >> >> >> >> >> >> >> >> >> >> >> On 11/26/13 1:41 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: >> >> >> >> >> >> >Well, no hints as to the error there. Looks identical to the >>output >> >>on >> >> >>my >> >> >> >XE-6. How >> >> >> >about setting -mca rmaps_base_verbose 100 . See what is going on >> >>with >> >> >>the >> >> >> >mapper. >> >> >> > >> >> >> >-Nathan Hjelm >> >> >> >Application Readiness, HPC-5, LANL >> >> >> > >> >> >> >On Tue, Nov 26, 2013 at 09:33:20PM +0000, Teranishi, Keita wrote: >> >> >> >> Nathan, >> >> >> >> >> >> >> >> Please see the attached obtained from two cases (-np 2 and -np >>4). >> >> >> >> >> >> >> >> Thanks, >> >> >> >> >> >> >> >> >> >> >>>>>>>>------------------------------------------------------------------- >>>>>>>>-- >> >>>>>>-- >> >> >>>>-- >> >> >> >>-- >> >> >> >> -- >> >> >> >> Keita Teranishi >> >> >> >> Principal Member of Technical Staff >> >> >> >> Scalable Modeling and Analysis Systems >> >> >> >> Sandia National Laboratories >> >> >> >> Livermore, CA 94551 >> >> >> >> +1 (925) 294-3738 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 11/26/13 1:26 PM, "Nathan Hjelm" <hje...@lanl.gov> wrote: >> >> >> >> >> >> >> >> >Seems like something is going wrong with processor binding. >>Can >> >>you >> >> >>run >> >> >> >> >with >> >> >> >> >-mca plm_base_verbose 100 . Might shed some light on why it >> >>thinks >> >> >> >>there >> >> >> >> >are >> >> >> >> >not enough slots. >> >> >> >> > >> >> >> >> >-Nathan Hjelm >> >> >> >> >Application Readiness, HPC-5, LANL >> >> >> >> > >> >> >> >> >On Tue, Nov 26, 2013 at 09:18:14PM +0000, Teranishi, Keita >>wrote: >> >> >> >> >> Nathan, >> >> >> >> >> >> >> >> >> >> Now I remove strip_prefix stuff, which was applied to the >>other >> >> >> >>versions >> >> >> >> >> of OpenMPI. >> >> >> >> >> I still have the same problem with msubrun command. >> >> >> >> >> >> >> >> >> >> knteran@mzlogin01:~> msub -lnodes=2:ppn=16 -I >> >> >> >> >> qsub: waiting for job 7754058.sdb to start >> >> >> >> >> qsub: job 7754058.sdb ready >> >> >> >> >> >> >> >> >> >> knteran@mzlogin01:~> cd test-openmpi/ >> >> >> >> >> knteran@mzlogin01:~/test-openmpi> !mp >> >> >> >> >> mpicc cpi.c -o cpi >> >> >> >> >> knteran@mzlogin01:~/test-openmpi> mpirun -np 4 ./cpi >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>----------------------------------------------------------------- >>>>>>>>>>-- >> >>>>>>>>-- >> >> >>>>>>-- >> >> >> >>>>-- >> >> >> >> >>- >> >> >> >> >> There are not enough slots available in the system to >>satisfy >> >>the >> >> >>4 >> >> >> >> >>slots >> >> >> >> >> that were requested by the application: >> >> >> >> >> ./cpi >> >> >> >> >> >> >> >> >> >> Either request fewer slots for your application, or make >>more >> >> >>slots >> >> >> >> >> available >> >> >> >> >> for use. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>----------------------------------------------------------------- >>>>>>>>>>-- >> >>>>>>>>-- >> >> >>>>>>-- >> >> >> >>>>-- >> >> >> >> >>- >> >> >> >> >> >> >> >> >> >> I set PATH and LD_LIBRARY_PATH to match with my own OpenMPI >> >> >> >> >>installation. >> >> >> >> >> knteran@mzlogin01:~/test-openmpi> which mpirun >> >> >> >> >> /home/knteran/openmpi/bin/mpirun >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Thanks, >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>----------------------------------------------------------------- >>>>>>>>>>-- >> >>>>>>>>-- >> >> >>>>>>-- >> >> >> >>>>-- >> >> >> >> >>-- >> >> >> >> >> -- >> >> >> >> >> Keita Teranishi >> >> >> >> >> Principal Member of Technical Staff >> >> >> >> >> Scalable Modeling and Analysis Systems >> >> >> >> >> Sandia National Laboratories >> >> >> >> >> Livermore, CA 94551 >> >> >> >> >> +1 (925) 294-3738 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 11/26/13 12:52 PM, "Nathan Hjelm" <hje...@lanl.gov> >>wrote: >> >> >> >> >> >> >> >> >> >> >Weird. That is the same configuration we have deployed on >> >>Cielito >> >> >> >>and >> >> >> >> >> >Cielo. Does >> >> >> >> >> >it work under an msub allocation? >> >> >> >> >> > >> >> >> >> >> >BTW, with that configuration you should not set >> >> >> >> >> >plm_base_strip_prefix_from_node_names >> >> >> >> >> >to 0. That will confuse orte since the node hostname will >>not >> >> >>match >> >> >> >> >>what >> >> >> >> >> >was >> >> >> >> >> >supplied by alps. >> >> >> >> >> > >> >> >> >> >> >-Nathan >> >> >> >> >> > >> >> >> >> >> >On Tue, Nov 26, 2013 at 08:38:51PM +0000, Teranishi, Keita >> >>wrote: >> >> >> >> >> >> Nathan, >> >> >> >> >> >> >> >> >> >> >> >> (Please forget about the segfault. It was my mistake). >> >> >> >> >> >> I use OpenMPI-1.7.2 (build with gcc-4.7.2) to run the >> >>program. >> >> >> I >> >> >> >> >>used >> >> >> >> >> >> contrib/platform/lanl/cray_xe6/optimized_lustre and >> >> >> >> >> >> --enable-mpirun-prefix-by-default for configuration. As >>I >> >> >>said, >> >> >> >>it >> >> >> >> >> >>works >> >> >> >> >> >> fine with aprun, but fails with mpirun/mpiexec. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> knteran@mzlogin01:~/test-openmpi> ~/openmpi/bin/mpirun >>-np 4 >> >> >> >>./a.out >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>>>--------------------------------------------------------------- >>>>>>>>>>>>-- >> >>>>>>>>>>-- >> >> >>>>>>>>-- >> >> >> >>>>>>-- >> >> >> >> >>>>-- >> >> >> >> >> >>- >> >> >> >> >> >> There are not enough slots available in the system to >> >>satisfy >> >> >>the >> >> >> >>4 >> >> >> >> >> >>slots >> >> >> >> >> >> that were requested by the application: >> >> >> >> >> >> ./a.out >> >> >> >> >> >> >> >> >> >> >> >> Either request fewer slots for your application, or make >> >>more >> >> >> >>slots >> >> >> >> >> >> available >> >> >> >> >> >> for use. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>>>--------------------------------------------------------------- >>>>>>>>>>>>-- >> >>>>>>>>>>-- >> >> >>>>>>>>-- >> >> >> >>>>>>-- >> >> >> >> >>>>-- >> >> >> >> >> >>-- >> >> >> >> >> >> - >> >> >> >> >> >> >> >> >> >> >> >> Thanks, >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>>>--------------------------------------------------------------- >>>>>>>>>>>>-- >> >>>>>>>>>>-- >> >> >>>>>>>>-- >> >> >> >>>>>>-- >> >> >> >> >>>>-- >> >> >> >> >> >>-- >> >> >> >> >> >> -- >> >> >> >> >> >> Keita Teranishi >> >> >> >> >> >> Principal Member of Technical Staff >> >> >> >> >> >> Scalable Modeling and Analysis Systems >> >> >> >> >> >> Sandia National Laboratories >> >> >> >> >> >> Livermore, CA 94551 >> >> >> >> >> >> +1 (925) 294-3738 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 11/25/13 12:55 PM, "Nathan Hjelm" <hje...@lanl.gov> >> >>wrote: >> >> >> >> >> >> >> >> >> >> >> >> >Ok, that should have worked. I just double-checked it >>to me >> >> >>sure. >> >> >> >> >> >> > >> >> >> >> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ >>mpirun >> >>-np >> >> >>32 >> >> >> >> >> >>./bcast >> >> >> >> >> >> >App launch reported: 17 (out of 3) daemons - 0 (out of >>32) >> >> >>procs >> >> >> >> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ >> >> >> >> >> >> > >> >> >> >> >> >> > >> >> >> >> >> >> >How did you configure Open MPI and what version are you >> >>using? >> >> >> >> >> >> > >> >> >> >> >> >> >-Nathan >> >> >> >> >> >> > >> >> >> >> >> >> >On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, >>Keita >> >> >>wrote: >> >> >> >> >> >> >> Hi Natan, >> >> >> >> >> >> >> >> >> >> >> >> >> >> I tried qsub option you >> >> >> >> >> >> >> >> >> >> >> >> >> >> mpirun -np 4 --mca >> >>plm_base_strip_prefix_from_node_names= 0 >> >> >> >>./cpi >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>>>>>------------------------------------------------------------- >>>>>>>>>>>>>>-- >> >>>>>>>>>>>>-- >> >> >>>>>>>>>>-- >> >> >> >>>>>>>>-- >> >> >> >> >>>>>>-- >> >> >> >> >> >>>>-- >> >> >> >> >> >> >>- >> >> >> >> >> >> >> There are not enough slots available in the system to >> >> >>satisfy >> >> >> >>the >> >> >> >> >>4 >> >> >> >> >> >> >>slots >> >> >> >> >> >> >> that were requested by the application: >> >> >> >> >> >> >> ./cpi >> >> >> >> >> >> >> >> >> >> >> >> >> >> Either request fewer slots for your application, or >>make >> >> >>more >> >> >> >> >>slots >> >> >> >> >> >> >> available >> >> >> >> >> >> >> for use. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>>>>>------------------------------------------------------------- >>>>>>>>>>>>>>-- >> >>>>>>>>>>>>-- >> >> >>>>>>>>>>-- >> >> >> >>>>>>>>-- >> >> >> >> >>>>>>-- >> >> >> >> >> >>>>-- >> >> >> >> >> >> >>- >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Here is I got from aprun >> >> >> >> >> >> >> aprun -n 32 ./cpi >> >> >> >> >> >> >> Process 8 of 32 is on nid00011 >> >> >> >> >> >> >> Process 5 of 32 is on nid00011 >> >> >> >> >> >> >> Process 12 of 32 is on nid00011 >> >> >> >> >> >> >> Process 9 of 32 is on nid00011 >> >> >> >> >> >> >> Process 11 of 32 is on nid00011 >> >> >> >> >> >> >> Process 13 of 32 is on nid00011 >> >> >> >> >> >> >> Process 0 of 32 is on nid00011 >> >> >> >> >> >> >> Process 6 of 32 is on nid00011 >> >> >> >> >> >> >> Process 3 of 32 is on nid00011 >> >> >> >> >> >> >> : >> >> >> >> >> >> >> >> >> >> >> >> >> >> : >> >> >> >> >> >> >> >> >> >> >> >> >> >> Also, I found a strange error in the end of the >>program >> >> >> >> >> >>(MPI_Finalize?) >> >> >> >> >> >> >> Can you tell me what is wrong with that? >> >> >> >> >> >> >> [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0) >> >> >> >> >> >>[0x2aaaacbbb7c0] >> >> >> >> >> >> >> [nid00010:23511] [ 1] >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_i >>>>>>>>>>>>>>nt >> >>>>>>>>>>>>_f >> >> >>>>>>>>>>re >> >> >> >>>>>>>>e+ >> >> >> >> >>>>>>0x >> >> >> >> >> >>>>57 >> >> >> >> >> >> >>) >> >> >> >> >> >> >> [0x2aaaaaf38ec7] >> >> >> >> >> >> >> [nid00010:23511] [ 2] >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+ >>>>>>>>>>0x >> >>>>>>>>c3 >> >> >>>>>>) >> >> >> >> >> >> >> [0x2aaaaaf3b6c3] >> >> >> >> >> >> >> [nid00010:23511] [ 3] >> >> >> >> >> >> >> >> >> >>/home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2) >> >> >> >> >> >> >> [0x2aaaaae717b2] >> >> >> >> >> >> >> [nid00010:23511] [ 4] >> >> >> >> >> >> >> >> >> >>/home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333) >> >> >> >> >> >> >> [0x2aaaaad7be23] >> >> >> >> >> >> >> [nid00010:23511] [ 5] ./cpi() [0x400e23] >> >> >> >> >> >> >> [nid00010:23511] [ 6] >> >> >>/lib64/libc.so.6(__libc_start_main+0xe6) >> >> >> >> >> >> >> [0x2aaaacde7c36] >> >> >> >> >> >> >> [nid00010:23511] [ 7] ./cpi() [0x400b09] >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Thanks, >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>>>>>------------------------------------------------------------- >>>>>>>>>>>>>>-- >> >>>>>>>>>>>>-- >> >> >>>>>>>>>>-- >> >> >> >>>>>>>>-- >> >> >> >> >>>>>>-- >> >> >> >> >> >>>>-- >> >> >> >> >> >> >>-- >> >> >> >> >> >> >> -- >> >> >> >> >> >> >> Keita Teranishi >> >> >> >> >> >> >> >> >> >> >> >> >> >> Principal Member of Technical Staff >> >> >> >> >> >> >> Scalable Modeling and Analysis Systems >> >> >> >> >> >> >> Sandia National Laboratories >> >> >> >> >> >> >> Livermore, CA 94551 >> >> >> >> >> >> >> +1 (925) 294-3738 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 11/25/13 12:28 PM, "Nathan Hjelm" <hje...@lanl.gov> >> >> >>wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >Just talked with our local Cray rep. Sounds like that >> >> >>torque >> >> >> >> >>syntax >> >> >> >> >> >>is >> >> >> >> >> >> >> >broken. You can continue >> >> >> >> >> >> >> >to use qsub (though qsub use is strongly >>discouraged) if >> >> >>you >> >> >> >>use >> >> >> >> >>the >> >> >> >> >> >> >>msub >> >> >> >> >> >> >> >options. >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >Ex: >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >qsub -lnodes=2:ppn=16 >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >Works. >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >-Nathan >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan >>Hjelm >> >> >>wrote: >> >> >> >> >> >> >> >> Hmm, this seems like either a bug in qsub (torque >>is >> >> >>full of >> >> >> >> >> >>serious >> >> >> >> >> >> >> >>bugs) or a bug >> >> >> >> >> >> >> >> in alps. I got an allocation using that command and >> >>alps >> >> >> >>only >> >> >> >> >> >>sees 1 >> >> >> >> >> >> >> >>node: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >> >>Trying >> >> >>ALPS >> >> >> >> >> >> >> >>configuration file: "/etc/sysconfig/alps" >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >> >> >>parser_ini >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >> >>Trying >> >> >>ALPS >> >> >> >> >> >> >> >>configuration file: "/etc/alps.conf" >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >> >> >> >> >> >> >> >>parser_separated_columns >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >> >>Located >> >> >> >>ALPS >> >> >> >> >> >> >>scheduler >> >> >> >> >> >> >> >>file: "/ufs/alps_shared/appinfo" >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] >> >> >> >> >> >> >> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>begin >> >> >> >> >>processing >> >> >> >> >> >> >> >>appinfo file >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>file >> >> >> >> >> >> >> >>/ufs/alps_shared/appinfo read >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: 47 >> >> >>entries >> >> >> >>in >> >> >> >> >> >>file >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3492 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3492 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3541 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3541 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3560 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3560 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3561 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3561 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3566 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3566 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3573 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3573 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3588 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3588 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3598 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3598 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3599 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3599 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3622 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3622 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3635 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3635 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3640 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3640 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3641 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3641 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3642 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3642 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3647 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3647 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3651 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3651 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3653 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3653 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3659 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3659 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3662 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3662 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3665 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3665 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >>read >> >> >>data >> >> >> >>for >> >> >> >> >> >>resId >> >> >> >> >> >> >> >>3668 - myId 3668 >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] >> >> >>ras:alps:read_appinfo(modern): >> >> >> >> >> >> >>processing >> >> >> >> >> >> >> >>NID 29 with 16 slots >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: >> >>success >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0] >> >> >> >> >>ras:base:node_insert >> >> >> >> >> >> >> >>inserting 1 nodes >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0] >> >> >> >> >>ras:base:node_insert >> >> >> >> >> >> >>node 29 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ====================== ALLOCATED NODES >> >> >> >> >>====================== >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Data for node: 29 Num slots: 16 Max slots: 0 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>================================================================= >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Torque also shows only one node with 16 PPN: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> $ env | grep PBS >> >> >> >> >> >> >> >> ... >> >> >> >> >> >> >> >> PBS_NUM_PPN=16 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> $ cat /var/spool/torque/aux//915289.sdb >> >> >> >> >> >> >> >> login1 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Which is wrong! I will have to ask Cray what is >>going >> >>on >> >> >> >>here. >> >> >> >> >>I >> >> >> >> >> >> >> >>recommend you switch to >> >> >> >> >> >> >> >> msub to get an allocation. Moab has fewer bugs. I >> >>can't >> >> >>even >> >> >> >> >>get >> >> >> >> >> >> >>aprun >> >> >> >> >> >> >> >>to work: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> $ aprun -n 2 -N 1 hostname >> >> >> >> >> >> >> >> apsched: claim exceeds reservation's node-count >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> $ aprun -n 32 hostname >> >> >> >> >> >> >> >> apsched: claim exceeds reservation's node-count >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> To get an interactive session 2 nodes with 16 ppn >>on >> >>each >> >> >> >>run: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> msub -I -lnodes=2:ppn=16 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Open MPI should then work correctly. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -Nathan Hjelm >> >> >> >> >> >> >> >> HPC-5, LANL >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, >>Teranishi, >> >> >>Keita >> >> >> >> >>wrote: >> >> >> >> >> >> >> >> > Hi, >> >> >> >> >> >> >> >> > I installed OpenMPI on our small XE6 using the >> >> >> >>configure >> >> >> >> >> >>options >> >> >> >> >> >> >> >>under >> >> >> >> >> >> >> >> > /contrib directory. It appears it is working >> >>fine, >> >> >> >>but it >> >> >> >> >> >> >>ignores >> >> >> >> >> >> >> >>MCA >> >> >> >> >> >> >> >> > parameters (set in env var). So I switched to >> >> >>mpirun >> >> >> >>(in >> >> >> >> >> >> >>OpenMPI) >> >> >> >> >> >> >> >>and it >> >> >> >> >> >> >> >> > can handle MCA parameters somehow. However, >> >>mpirun >> >> >> >> >>fails to >> >> >> >> >> >> >> >>allocate >> >> >> >> >> >> >> >> > process by cores. For example, I allocated 32 >> >>cores >> >> >> >>(on 2 >> >> >> >> >> >> >>nodes) >> >> >> >> >> >> >> >>by "qsub >> >> >> >> >> >> >> >> > -lmppwidth=32 -lmppnppn=16", mpirun >>recognizes it >> >> >>as 2 >> >> >> >> >>slots. >> >> >> >> >> >> >> >>Is it >> >> >> >> >> >> >> >> > possible to mpirun to handle mluticore nodes >>of >> >>XE6 >> >> >> >> >>properly >> >> >> >> >> >>or >> >> >> >> >> >> >>is >> >> >> >> >> >> >> >>there >> >> >> >> >> >> >> >> > any options to handle MCA parameters for >>aprun? >> >> >> >> >> >> >> >> > Regards, >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>>>>>>>>>>>>>>>----------------------------------------------------------- >>>>>>>>>>>>>>>>-- >> >>>>>>>>>>>>>>-- >> >> >>>>>>>>>>>>-- >> >> >> >>>>>>>>>>-- >> >> >> >> >>>>>>>>-- >> >> >> >> >> >>>>>>-- >> >> >> >> >> >> >>>>-- >> >> >> >> >> >> >> >>---- >> >> >> >> >> >> >> >> > Keita Teranishi >> >> >> >> >> >> >> >> > Principal Member of Technical Staff >> >> >> >> >> >> >> >> > Scalable Modeling and Analysis Systems >> >> >> >> >> >> >> >> > Sandia National Laboratories >> >> >> >> >> >> >> >> > Livermore, CA 94551 >> >> >> >> >> >> >> >> > +1 (925) 294-3738 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > _______________________________________________ >> >> >> >> >> >> >> >> > users mailing list >> >> >> >> >> >> >> >> > us...@open-mpi.org >> >> >> >> >> >> >> >> > >>http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> >> >> >> >> >> users mailing list >> >> >> >> >> >> >> >> us...@open-mpi.org >> >> >> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> >> >> >> >> users mailing list >> >> >> >> >> >> >> us...@open-mpi.org >> >> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> >> >> >> users mailing list >> >> >> >> >> >> us...@open-mpi.org >> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> >> >> users mailing list >> >> >> >> >> us...@open-mpi.org >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> >> >> > >> >> >> > >> >> >> > >> >> >> >> _______________________________________________ >> >> >> >> users mailing list >> >> >> >> us...@open-mpi.org >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> > >> >> >> >> >> > >> >> > >> >> > >> >> >> _______________________________________________ >> >> >> users mailing list >> >> >> us...@open-mpi.org >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > >> >> >> > >> > >> >> _______________________________________________ >> >> users mailing list >> >> us...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users