It's a good idea to provide the default setting for the modifier pe.
Okay, I can take a look to review but a bit busy now, so please give me a few days. Regards, Tetsuya > Okay, I revised the command line option to be a little more user-friendly. You can now specify the equivalent of the old --cpus-per-proc as just "--map-by :pe=N", leaving the mapping policy set as > the default. We will default to NUMA so the cpus will all be in the same NUMA region, if possible, thus providing better performance. > > Scheduled this for 1.8.2, asking Tetsuya to review. > > On Jun 6, 2014, at 6:25 PM, Ralph Castain <r...@open-mpi.org> wrote: > > > Hmmm....Tetsuya is quite correct. Afraid I got distracted by the segfault (still investigating that one). Our default policy for 2 processes is to map-by core, and that would indeed fail when > cpus-per-proc > 1. However, that seems like a non-intuitive requirement, so let me see if I can make this be a little more user-friendly. > > > > > > On Jun 6, 2014, at 2:25 PM, tmish...@jcity.maeda.co.jp wrote: > > > >> > >> > >> Hi Dan, > >> > >> Please try: > >> mpirun -np 2 --map-by socket:pe=8 ./hello > >> or > >> mpirun -np 2 --map-by slot:pe=8 ./hello > >> > >> You can not bind 8 cpus to the object "core" which has > >> only one cpu. This limitation started from 1.8 series. > >> > >> The objcet "socket" has 8 cores in your case. So you > >> can do it. And, the object "slot" is almost same as the > >> "core" but it can exceed the limitation, because it's a > >> fictitious object which has no size. > >> > >> Regards, > >> Tetsuya Mishima > >> > >> > >>> No problem - > >>> > >>> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips. > >>> 2 per node, 8 cores each. No threading enabled. > >>> > >>> $ lstopo > >>> Machine (64GB) > >>> NUMANode L#0 (P#0 32GB) > >>> Socket L#0 + L3 L#0 (20MB) > >>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 > >> (P#0) > >>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 > >> (P#1) > >>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 > >> (P#2) > >>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 > >> (P#3) > >>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 > >> (P#4) > >>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 > >> (P#5) > >>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 > >> (P#6) > >>> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 > >> (P#7) > >>> HostBridge L#0 > >>> PCIBridge > >>> PCI 1000:0087 > >>> Block L#0 "sda" > >>> PCIBridge > >>> PCI 8086:2250 > >>> PCIBridge > >>> PCI 8086:1521 > >>> Net L#1 "eth0" > >>> PCI 8086:1521 > >>> Net L#2 "eth1" > >>> PCIBridge > >>> PCI 102b:0533 > >>> PCI 8086:1d02 > >>> NUMANode L#1 (P#1 32GB) > >>> Socket L#1 + L3 L#1 (20MB) > >>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 > >> (P#8) > >>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 > >> (P#9) > >>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 > >>> + PU L#10 (P#10) > >>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 > >>> + PU L#11 (P#11) > >>> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 > >>> + PU L#12 (P#12) > >>> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 > >>> + PU L#13 (P#13) > >>> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 > >>> + PU L#14 (P#14) > >>> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 > >>> + PU L#15 (P#15) > >>> HostBridge L#5 > >>> PCIBridge > >>> PCI 15b3:1011 > >>> Net L#3 "ib0" > >>> OpenFabrics L#4 "mlx5_0" > >>> PCIBridge > >>> PCI 8086:2250 > >>> > >>> From the segfault below. I tried reproducing the crash on less than an > >>> 4 node allocation but wasn't able to. > >>> > >>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ mpirun -np 2 > >>> -machinefile ./nodes -mca plm_base_verbose 10 ./hello > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: > >>> registering plm components > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: > >>> found loaded component isolated > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: > >>> component isolated has no register or open function > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: > >>> found loaded component slurm > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: > >>> component slurm register function successful > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: > >>> found loaded component rsh > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: > >>> component rsh register function successful > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: > >>> found loaded component tm > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: > >>> component tm register function successful > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: opening > >>> plm components > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found > >>> loaded component isolated > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: > >>> component isolated open function successful > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found > >>> loaded component slurm > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: > >>> component slurm open function successful > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found > >>> loaded component rsh > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: > >>> component rsh open function successful > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found > >>> loaded component tm > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: > >>> component tm open function successful > >>> [conte-a009.rcac.purdue.edu:55685] mca:base:select: Auto-selecting plm > >>> components > >>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying > >>> component [isolated] > >>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of > >>> component [isolated] set priority to 0 > >>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying > >>> component [slurm] > >>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Skipping > >>> component [slurm]. Query failed to return a module > >>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying > >>> component [rsh] > >>> [conte-a009.rcac.purdue.edu:55685] [[INVALID],INVALID] plm:rsh_lookup > >>> on agent ssh : rsh path NULL > >>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of > >>> component [rsh] set priority to 10 > >>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying > >>> component [tm] > >>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of > >>> component [tm] set priority to 75 > >>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Selected > >>> component [tm] > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component isolated > >> closed > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading > >>> component isolated > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component slurm > >> closed > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading component > >> slurm > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component rsh closed > >>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading component > >> rsh > >>> [conte-a009.rcac.purdue.edu:55685] plm:base:set_hnp_name: initial bias > >>> 55685 nodename hash 3965217721 > >>> [conte-a009.rcac.purdue.edu:55685] plm:base:set_hnp_name: final jobfam > >> 24164 > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:receive start > >> comm > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_job > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm > >> creating map > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm add > >>> new daemon [[24164,0],1] > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm > >>> assigning new daemon [[24164,0],1] to node conte-a055 > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: launching vm > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: final top-level > >> argv: > >>> orted -mca ess tm -mca orte_ess_jobid 1583611904 -mca orte_ess_vpid > >>> <template> -mca orte_ess_num_procs 2 -mca orte_hnp_uri > >>> > >> "1583611904.0;tcp://172.18.96.49,172.31.1.254,172.31.2.254,172.18.112.49:37380" > >> > >>> -mca plm_base_verbose 10 > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: resetting > >>> LD_LIBRARY_PATH: > >>> > /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/usr/ > >> > >>> > >> > pbs/lib:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/mpirt/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/ipp/lib/intel64:/a > >> > >>> > >> pps/rhel6/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144 > >> > >>> /tbb/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/apps/rhel6/intel/opencl-1.2-3.2.1.16712/lib64 > >> > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: resetting > >>> PATH: > >>> > /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/apps > >> > >>> > /rhel6/intel/composer_xe_2013_sp1.2.144/bin/intel64:/opt/intel/mic/bin:/apps/rhel6/intel/inspector_xe_2013/bin64:/apps/rhel6/intel/advisor_xe_2013/bin64:/apps/rhel6/intel/vtune_amplifier_xe_2013/bin64 > >> > >>> :/apps/rhel6/intel/opencl-1.2-3.2.1.16712/bin:/usr/lib64/qt-3.3 > >>> > /bin:/opt/moab/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/hpss/bin:/opt/hsi/bin:/opt/ibutils/bin:/usr/pbs/bin:/opt/moab/bin:/usr/site/rcac/scripts:/usr/site/rcac/support_scripts:/usr/site/ > >> > >>> rcac/bin:/usr/site/rcac/sbin:/usr/sbin > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: launching on > >>> node conte-a055 > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: executing: > >>> orted -mca ess tm -mca orte_ess_jobid 1583611904 -mca orte_ess_vpid 1 > >>> -mca orte_ess_num_procs 2 -mca orte_hnp_uri > >>> > >> "1583611904.0;tcp://172.18.96.49,172.31.1.254,172.31.2.254,172.18.112.49:37380" > >> > >>> -mca plm_base_verbose 10 > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm:launch: > >>> finished spawning orteds > >>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register: > >>> registering plm components > >>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register: > >>> found loaded component rsh > >>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register: > >>> component rsh register function successful > >>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: opening > >>> plm components > >>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: found > >>> loaded component rsh > >>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: > >>> component rsh open function successful > >>> [conte-a055.rcac.purdue.edu:32094] mca:base:select: Auto-selecting plm > >>> components > >>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Querying > >>> component [rsh] > >>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:rsh_lookup on > >>> agent ssh : rsh path NULL > >>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Query of > >>> component [rsh] set priority to 10 > >>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Selected > >>> component [rsh] > >>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:rsh_setup on > >>> agent ssh : rsh path NULL > >>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:base:receive start > >> comm > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] > >>> plm:base:orted_report_launch from daemon [[24164,0],1] > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] > >>> plm:base:orted_report_launch from daemon [[24164,0],1] on node > >>> conte-a055 > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] RECEIVED TOPOLOGY > >>> FROM NODE conte-a055 > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] NEW TOPOLOGY - ADDING > >>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] > >>> plm:base:orted_report_launch completed for daemon [[24164,0],1] at > >>> contact > >> 1583611904.1;tcp://172.18.96.95,172.31.1.254,172.31.2.254,172.18.112.95:58312 > >> > >>> [conte-a009:55685] *** Process received signal *** > >>> [conte-a009:55685] Signal: Segmentation fault (11) > >>> [conte-a009:55685] Signal code: Address not mapped (1) > >>> [conte-a009:55685] Failing at address: 0x4c > >>> [conte-a009:55685] [ 0] /lib64/libpthread.so.0[0x327f80f500] > >>> [conte-a009:55685] [ 1] > >>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-rte.so.7 > >> (orte_plm_base_complete_setup+0x951)[0x2b5b069a50e1] > >>> [conte-a009:55685] [ 2] > >>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6 > >> (opal_libevent2021_event_base_loop+0xa05)[0x2b5b075ff145] > >>> [conte-a009:55685] [ 3] mpirun(orterun+0x1ffd)[0x4073b5] > >>> [conte-a009:55685] [ 4] mpirun(main+0x20)[0x4048f4] > >>> [conte-a009:55685] [ 5] /lib64/libc.so.6(__libc_start_main > >> +0xfd)[0x327f41ecdd] > >>> [conte-a009:55685] [ 6] mpirun[0x404819] > >>> [conte-a009:55685] *** End of error message *** > >>> Segmentation fault (core dumped) > >>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ > >>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:base:receive stop > >>> comm > >>> > >>> On Fri, Jun 6, 2014 at 3:00 PM, Ralph Castain <r...@open-mpi.org> wrote: > >>>> Sorry to pester with questions, but I'm trying to narrow down the > >> issue. > >>>> > >>>> * What kind of chips are on these machines? > >>>> > >>>> * If they have h/w threads, are they enabled? > >>>> > >>>> * you might have lstopo on one of those machines - could you pass along > >> its output? Otherwise, you can run a simple "mpirun -n 1 -mca > >> ess_base_verbose 20 hostname" and it will print out. Only need > >>> one node in your allocation as we don't need a fountain of output. > >>>> > >>>> I'll look into the segfault - hard to understand offhand, but could be > >> an uninitialized variable. If you have a chance, could you rerun that test > >> with "-mca plm_base_verbose 10" on the cmd line? > >>>> > >>>> Thanks again > >>>> Ralph > >>>> > >>>> On Jun 6, 2014, at 10:31 AM, Dan Dietz <ddi...@purdue.edu> wrote: > >>>> > >>>>> Thanks for the reply. I tried out the --display-allocation option with > >>>>> several different combinations and have attached the output. I see > >>>>> this behavior on both RHEL6.4, RHEL6.5, and RHEL5.10 clusters. > >>>>> > >>>>> > >>>>> Here's debugging info on the segfault. Does that help? FWIW this does > >>>>> not seem to crash on the RHEL5 cluster or RHEL6.5 cluster. Just > >>>>> crashes on RHEL6.4. > >>>>> > >>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ gdb -c core.22623 > >>>>> `which mpirun` > >>>>> No symbol table is loaded. Use the "file" command. > >>>>> GNU gdb (GDB) 7.5-1.3.187>>>>> Copyright (C) 2012 Free Software Foundation, Inc. > >>>>> License GPLv3+: GNU GPL version 3 or later > >> <http://gnu.org/licenses/gpl.html> > >>>>> This is free software: you are free to change and redistribute it. > >>>>> There is NO WARRANTY, to the extent permitted by law. Type "show > >> copying" > >>>>> and "show warranty" for details. > >>>>> This GDB was configured as "x86_64-unknown-linux-gnu". > >>>>> For bug reporting instructions, please see: > >>>>> <http://www.gnu.org/software/gdb/bugs/>... > >>>>> Reading symbols from > >>> > >>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin/mpirun...done. > >> > >>>>> [New LWP 22623] > >>>>> [New LWP 22624] > >>>>> > >>>>> warning: Can't read pathname for load map: Input/output error. > >>>>> [Thread debugging using libthread_db enabled] > >>>>> Using host libthread_db library "/lib64/libthread_db.so.1". > >>>>> Core was generated by `mpirun -np 2 -machinefile ./nodes ./hello'. > >>>>> Program terminated with signal 11, Segmentation fault. > >>>>> #0 0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1, > >>>>> args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422 > >>>>> 422 node->hostid = node->daemon->name.vpid; > >>>>> (gdb) bt > >>>>> #0 0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1, > >>>>> args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422 > >>>>> #1 0x00002acc60eec145 in opal_libevent2021_event_base_loop () from > >>> > >>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6 > >> > >>>>> #2 0x00000000004073b5 in orterun (argc=6, argv=0x7fff5bb2a3a8) at > >>>>> orterun.c:1077 > >>>>> #3 0x00000000004048f4 in main (argc=6, argv=0x7fff5bb2a3a8) at > >> main.c:13 > >>>>> > >>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ cat nodes > >>>>> conte-a009 > >>>>> conte-a009 > >>>>> conte-a055 > >>>>> conte-a055 > >>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ uname -r > >>>>> 2.6.32-358.14.1.el6.x86_64 > >>>>> > >>>>> On Thu, Jun 5, 2014 at 7:54 PM, Ralph Castain <r...@open-mpi.org> > >> wrote: > >>>>>> > >>>>>> On Jun 5, 2014, at 2:13 PM, Dan Dietz <ddi...@purdue.edu> wrote: > >>>>>> > >>>>>>> Hello all, > >>>>>>> > >>>>>>> I'd like to bind 8 cores to a single MPI rank for hybrid MPI/OpenMP > >>>>>>> codes. In OMPI 1.6.3, I can do: > >>>>>>> > >>>>>>> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello > >>>>>>> > >>>>>>> I get one rank bound to procs 0-7 and the other bound to 8-15. > >> Great! > >>>>>>> > >>>>>>> But I'm having some difficulties doing this with openmpi 1.8.1: > >>>>>>>>>>> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello > >>>>>>> > >> -------------------------------------------------------------------------- > >>>>>>> The following command line options and corresponding MCA parameter > >> have > >>>>>>> been deprecated and replaced as follows: > >>>>>>> > >>>>>>> Command line options: > >>>>>>> Deprecated: --cpus-per-proc, -cpus-per-proc, --cpus-per-rank, > >>>>>>> -cpus-per-rank > >>>>>>> Replacement: --map-by <obj>:PE=N > >>>>>>> > >>>>>>> Equivalent MCA parameter: > >>>>>>> Deprecated: rmaps_base_cpus_per_proc > >>>>>>> Replacement: rmaps_base_mapping_policy=<obj>:PE=N > >>>>>>> > >>>>>>> The deprecated forms *will* disappear in a future version of Open > >> MPI. > >>>>>>> Please update to the new syntax. > >>>>>>> > >> -------------------------------------------------------------------------- > >>>>>>> > >> -------------------------------------------------------------------------- > >>>>>>> There are not enough slots available in the system to satisfy the 2 > >> slots > >>>>>>> that were requested by the application: > >>>>>>> ./hello > >>>>>>> > >>>>>>> Either request fewer slots for your application, or make more slots > >> available > >>>>>>> for use. > >>>>>>> > >> -------------------------------------------------------------------------- > >>>>>>> > >>>>>>> OK, let me try the new syntax... > >>>>>>> > >>>>>>> $ mpirun -np 2 --map-by core:pe=8 -machinefile ./nodes ./hello > >>>>>>> > >> -------------------------------------------------------------------------- > >>>>>>> There are not enough slots available in the system to satisfy the 2 > >> slots > >>>>>>> that were requested by the application: > >>>>>>> ./hello > >>>>>>> > >>>>>>> Either request fewer slots for your application, or make more slots > >> available > >>>>>>> for use. > >>>>>>> > >> -------------------------------------------------------------------------- > >>>>>>> > >>>>>>> What am I doing wrong? The documentation on these new options is > >>>>>>> somewhat poor and confusing so I'm probably doing something wrong. > >> If > >>>>>>> anyone could provide some pointers here it'd be much appreciated! If > >>>>>>> it's not something simple and you need config logs and such please > >> let > >>>>>>> me know. > >>>>>> > >>>>>> Looks like we think there are less than 16 slots allocated on that > >> node. What is in this "nodes" file? Without it, OMPI should read the Torque > >> allocation directly. You might check what we think > >>> the allocation is by adding --display-allocation to the cmd line > >>>>>> > >>>>>>> > >>>>>>> As a side note - > >>>>>>> > >>>>>>> If I try this using the PBS nodefile with the above, I get a > >> confusing message: > >>>>>>> > >>>>>>> > >> -------------------------------------------------------------------------- > >>>>>>> A request for multiple cpus-per-proc was given, but a directive > >>>>>>> was also give to map to an object level that has less cpus than > >>>>>>> requested ones: > >>>>>>> > >>>>>>> #cpus-per-proc: 8 > >>>>>>> number of cpus: 1 > >>>>>>> map-by: BYCORE:NOOVERSUBSCRIBE > >>>>>>> > >>>>>>> Please specify a mapping level that has more cpus, or else let us > >>>>>>> define a default mapping that will allow multiple cpus-per-proc. > >>>>>>> > >> -------------------------------------------------------------------------- > >>>>>>> > >>>>>>> From what I've gathered this is because I have a node listed 16 > >> times > >>>>>>> in my PBS nodefile so it's assuming then I have 1 core per node? > >>>>>> > >>>>>> > >>>>>> No - if listed 16 times, it should compute 16 slots. Try adding > >> --display-allocation to your cmd line and it should tell you how many slots > >> are present. > >>>>>> > >>>>>> However, it doesn't assume there is a core for each slot. Instead, it > >> detects the cores directly on the node. It sounds like it isn't seeing them > >> for some reason. What OS are you running on that > >>> node? > >>>>>> > >>>>>> FWIW: the 1.6 series has a different detection system for cores. > >> Could be something is causing problems for the new one. > >>>>>> > >>>>>>> Some > >>>>>>> better documentation here would be helpful. I haven't been able to > >>>>>>> figure out how to use the "oversubscribe" option listed in the docs. > >>>>>>> Not that I really want to oversubscribe, of course, I need to modify > >>>>>>> the nodefile, but this just stumped me for a while as 1.6.3 didn't > >>>>>>> have this behavior. > >>>>>>> > >>>>>>> > >>>>>>> As a extra bonus, I get a segfault in this situation: > >>>>>>> > >>>>>>> $ mpirun -np 2 -machinefile ./nodes ./hello > >>>>>>> [conte-a497:13255] *** Process received signal *** > >>>>>>> [conte-a497:13255] Signal: Segmentation fault (11) > >>>>>>> [conte-a497:13255] Signal code: Address not mapped (1) > >>>>>>> [conte-a497:13255] Failing at address: 0x2c > >>>>>>> [conte-a497:13255] [ 0] /lib64/libpthread.so.0[0x3c9460f500] > >>>>>>> [conte-a497:13255] [ 1] > >>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7 > >> (orte_plm_base_complete_setup+0x615)[0x2ba960a59015] > >>>>>>> [conte-a497:13255] [ 2] > >>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6 > >> (opal_libevent2021_event_base_loop+0xa05)[0x2ba961666715] > >>>>>>> [conte-a497:13255] [ 3] mpirun(orterun+0x1b45)[0x40684f] > >>>>>>> [conte-a497:13255] [ 4] mpirun(main+0x20)[0x4047f4] > >>>>>>> [conte-a497:13255] [ 5] /lib64/libc.so.6(__libc_start_main > >> +0xfd)[0x3a1bc1ecdd] > >>>>>>> [conte-a497:13255] [ 6] mpirun[0x404719] > >>>>>>> [conte-a497:13255] *** End of error message *** > >>>>>>> Segmentation fault (core dumped) > >>>>>>> > >>>>>> > >>>>>> Huh - that's odd. Could you perhaps configure OMPI with > >> --enable-debug and gdb the core file to tell us the line numbers involved? > >>>>>> > >>>>>> Thanks > >>>>>> Ralph > >>>>>> > >>>>>>> > >>>>>>> My "nodes" file simply contains the first two lines of my original > >>>>>>> $PBS_NODEFILE provided by Torque. See above why I modified. Works > >> fine > >>>>>>> if use the full file. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Thanks in advance for any pointers you all may have! > >>>>>>> > >>>>>>> Dan > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Dan Dietz > >>>>>>> Scientific Applications Analyst > >>>>>>> ITaP Research Computing, Purdue University > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Dan Dietz > >>>>> Scientific Applications Analyst > >>>>> ITaP Research Computing, Purdue University > >>>>> <slots>_______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> > >>> > >>> -- > >>> Dan Dietz > >>> Scientific Applications Analyst > >>> ITaP Research Computing, Purdue University > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users