Sorry to pester with questions, but I'm trying to narrow down the issue. * What kind of chips are on these machines?
* If they have h/w threads, are they enabled? * you might have lstopo on one of those machines - could you pass along its output? Otherwise, you can run a simple "mpirun -n 1 -mca ess_base_verbose 20 hostname" and it will print out. Only need one node in your allocation as we don't need a fountain of output. I'll look into the segfault - hard to understand offhand, but could be an uninitialized variable. If you have a chance, could you rerun that test with "-mca plm_base_verbose 10" on the cmd line? Thanks again Ralph On Jun 6, 2014, at 10:31 AM, Dan Dietz <ddi...@purdue.edu> wrote: > Thanks for the reply. I tried out the --display-allocation option with > several different combinations and have attached the output. I see > this behavior on both RHEL6.4, RHEL6.5, and RHEL5.10 clusters. > > > Here's debugging info on the segfault. Does that help? FWIW this does > not seem to crash on the RHEL5 cluster or RHEL6.5 cluster. Just > crashes on RHEL6.4. > > ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ gdb -c core.22623 > `which mpirun` > No symbol table is loaded. Use the "file" command. > GNU gdb (GDB) 7.5-1.3.187 > Copyright (C) 2012 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-unknown-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > Reading symbols from > /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin/mpirun...done. > [New LWP 22623] > [New LWP 22624] > > warning: Can't read pathname for load map: Input/output error. > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > Core was generated by `mpirun -np 2 -machinefile ./nodes ./hello'. > Program terminated with signal 11, Segmentation fault. > #0 0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1, > args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422 > 422 node->hostid = node->daemon->name.vpid; > (gdb) bt > #0 0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1, > args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422 > #1 0x00002acc60eec145 in opal_libevent2021_event_base_loop () from > /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6 > #2 0x00000000004073b5 in orterun (argc=6, argv=0x7fff5bb2a3a8) at > orterun.c:1077 > #3 0x00000000004048f4 in main (argc=6, argv=0x7fff5bb2a3a8) at main.c:13 > > ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ cat nodes > conte-a009 > conte-a009 > conte-a055 > conte-a055 > ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ uname -r > 2.6.32-358.14.1.el6.x86_64 > > On Thu, Jun 5, 2014 at 7:54 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> On Jun 5, 2014, at 2:13 PM, Dan Dietz <ddi...@purdue.edu> wrote: >> >>> Hello all, >>> >>> I'd like to bind 8 cores to a single MPI rank for hybrid MPI/OpenMP >>> codes. In OMPI 1.6.3, I can do: >>> >>> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello >>> >>> I get one rank bound to procs 0-7 and the other bound to 8-15. Great! >>> >>> But I'm having some difficulties doing this with openmpi 1.8.1: >>> >>> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello >>> -------------------------------------------------------------------------- >>> The following command line options and corresponding MCA parameter have >>> been deprecated and replaced as follows: >>> >>> Command line options: >>> Deprecated: --cpus-per-proc, -cpus-per-proc, --cpus-per-rank, >>> -cpus-per-rank >>> Replacement: --map-by <obj>:PE=N >>> >>> Equivalent MCA parameter: >>> Deprecated: rmaps_base_cpus_per_proc >>> Replacement: rmaps_base_mapping_policy=<obj>:PE=N >>> >>> The deprecated forms *will* disappear in a future version of Open MPI. >>> Please update to the new syntax. >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> There are not enough slots available in the system to satisfy the 2 slots >>> that were requested by the application: >>> ./hello >>> >>> Either request fewer slots for your application, or make more slots >>> available >>> for use. >>> -------------------------------------------------------------------------- >>> >>> OK, let me try the new syntax... >>> >>> $ mpirun -np 2 --map-by core:pe=8 -machinefile ./nodes ./hello >>> -------------------------------------------------------------------------- >>> There are not enough slots available in the system to satisfy the 2 slots >>> that were requested by the application: >>> ./hello >>> >>> Either request fewer slots for your application, or make more slots >>> available >>> for use. >>> -------------------------------------------------------------------------- >>> >>> What am I doing wrong? The documentation on these new options is >>> somewhat poor and confusing so I'm probably doing something wrong. If >>> anyone could provide some pointers here it'd be much appreciated! If >>> it's not something simple and you need config logs and such please let >>> me know. >> >> Looks like we think there are less than 16 slots allocated on that node. >> What is in this "nodes" file? Without it, OMPI should read the Torque >> allocation directly. You might check what we think the allocation is by >> adding --display-allocation to the cmd line >> >>> >>> As a side note - >>> >>> If I try this using the PBS nodefile with the above, I get a confusing >>> message: >>> >>> -------------------------------------------------------------------------- >>> A request for multiple cpus-per-proc was given, but a directive >>> was also give to map to an object level that has less cpus than >>> requested ones: >>> >>> #cpus-per-proc: 8 >>> number of cpus: 1 >>> map-by: BYCORE:NOOVERSUBSCRIBE >>> >>> Please specify a mapping level that has more cpus, or else let us >>> define a default mapping that will allow multiple cpus-per-proc. >>> -------------------------------------------------------------------------- >>> >>> From what I've gathered this is because I have a node listed 16 times >>> in my PBS nodefile so it's assuming then I have 1 core per node? >> >> >> No - if listed 16 times, it should compute 16 slots. Try adding >> --display-allocation to your cmd line and it should tell you how many slots >> are present. >> >> However, it doesn't assume there is a core for each slot. Instead, it >> detects the cores directly on the node. It sounds like it isn't seeing them >> for some reason. What OS are you running on that node? >> >> FWIW: the 1.6 series has a different detection system for cores. Could be >> something is causing problems for the new one. >> >>> Some >>> better documentation here would be helpful. I haven't been able to >>> figure out how to use the "oversubscribe" option listed in the docs. >>> Not that I really want to oversubscribe, of course, I need to modify >>> the nodefile, but this just stumped me for a while as 1.6.3 didn't >>> have this behavior. >>> >>> >>> As a extra bonus, I get a segfault in this situation: >>> >>> $ mpirun -np 2 -machinefile ./nodes ./hello >>> [conte-a497:13255] *** Process received signal *** >>> [conte-a497:13255] Signal: Segmentation fault (11) >>> [conte-a497:13255] Signal code: Address not mapped (1) >>> [conte-a497:13255] Failing at address: 0x2c >>> [conte-a497:13255] [ 0] /lib64/libpthread.so.0[0x3c9460f500] >>> [conte-a497:13255] [ 1] >>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2ba960a59015] >>> [conte-a497:13255] [ 2] >>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2ba961666715] >>> [conte-a497:13255] [ 3] mpirun(orterun+0x1b45)[0x40684f] >>> [conte-a497:13255] [ 4] mpirun(main+0x20)[0x4047f4] >>> [conte-a497:13255] [ 5] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3a1bc1ecdd] >>> [conte-a497:13255] [ 6] mpirun[0x404719] >>> [conte-a497:13255] *** End of error message *** >>> Segmentation fault (core dumped) >>> >> >> Huh - that's odd. Could you perhaps configure OMPI with --enable-debug and >> gdb the core file to tell us the line numbers involved? >> >> Thanks >> Ralph >> >>> >>> My "nodes" file simply contains the first two lines of my original >>> $PBS_NODEFILE provided by Torque. See above why I modified. Works fine >>> if use the full file. >>> >>> >>> >>> Thanks in advance for any pointers you all may have! >>> >>> Dan >>> >>> >>> -- >>> Dan Dietz >>> Scientific Applications Analyst >>> ITaP Research Computing, Purdue University >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Dan Dietz > Scientific Applications Analyst > ITaP Research Computing, Purdue University > <slots>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users