Just to wrap this up for the user list: this has now been fixed and added to 1.8.2 in the nightly tarball. The problem proved to be an edge case when partial allocations were combined with coprocessor existence (hit a slightly different code path).
On Jun 12, 2014, at 9:04 AM, Dan Dietz <ddi...@purdue.edu> wrote: > That shouldn't be a problem. Let me figure out the process and I'll > get back to you. > > Dan > > On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain <r...@open-mpi.org> wrote: >> Arggh - is there any way I can get access to this beast so I can debug this? >> I can't figure out what in the world is going on, but it seems to be >> something triggered by your specific setup. >> >> >> On Jun 12, 2014, at 8:48 AM, Dan Dietz <ddi...@purdue.edu> wrote: >> >>> Unfortunately, the nightly tarball appears to be crashing in a similar >>> fashion. :-( I used the latest snapshot 1.8.2a1r31981. >>> >>> Dan >>> >>> On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>> I've poked and prodded, and the 1.8.2 tarball seems to be handling this >>>> situation just fine. I don't have access to a Torque machine, but I did >>>> set everything to follow the same code path, added faux coprocessors, etc. >>>> - and it ran just fine. >>>> >>>> Can you try the 1.8.2 tarball and see if it solves the problem? >>>> >>>> >>>> On Jun 11, 2014, at 2:15 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>>> Okay, let me poke around some more. It is clearly tied to the >>>>> coprocessors, but I'm not yet sure just why. >>>>> >>>>> One thing you might do is try the nightly 1.8.2 tarball - there have been >>>>> a number of fixes, and this may well have been caught there. Worth taking >>>>> a look. >>>>> >>>>> >>>>> On Jun 11, 2014, at 6:44 AM, Dan Dietz <ddi...@purdue.edu> wrote: >>>>> >>>>>> Sorry - it crashes with both torque and rsh launchers. The output from >>>>>> a gdb backtrace on the core files looks identical. >>>>>> >>>>>> Dan >>>>>> >>>>>> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> Afraid I'm a little confused now - are you saying it works fine under >>>>>>> Torque, but segfaults under rsh? Could you please clarify your current >>>>>>> situation? >>>>>>> >>>>>>> >>>>>>> On Jun 11, 2014, at 6:27 AM, Dan Dietz <ddi...@purdue.edu> wrote: >>>>>>> >>>>>>>> It looks like it is still segfaulting with the rsh launcher: >>>>>>>> >>>>>>>> ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh >>>>>>>> -np 4 -machinefile ./nodes ./hello >>>>>>>> [conte-a084:51113] *** Process received signal *** >>>>>>>> [conte-a084:51113] Signal: Segmentation fault (11) >>>>>>>> [conte-a084:51113] Signal code: Address not mapped (1) >>>>>>>> [conte-a084:51113] Failing at address: 0x2c >>>>>>>> [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710] >>>>>>>> [conte-a084:51113] [ 1] >>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015] >>>>>>>> [conte-a084:51113] [ 2] >>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715] >>>>>>>> [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f] >>>>>>>> [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4] >>>>>>>> [conte-a084:51113] [ 5] >>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d] >>>>>>>> [conte-a084:51113] [ 6] mpirun[0x404719] >>>>>>>> [conte-a084:51113] *** End of error message *** >>>>>>>> Segmentation fault (core dumped) >>>>>>>> >>>>>>>> On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>> wrote: >>>>>>>>> I'm having no luck poking at this segfault issue. For some strange >>>>>>>>> reason, we seem to think there are coprocessors on those remote nodes >>>>>>>>> - e.g., a Phi card. Yet your lstopo output doesn't seem to show it. >>>>>>>>> >>>>>>>>> Out of curiosity, can you try running this with "-mca plm rsh"? This >>>>>>>>> will substitute the rsh/ssh launcher in place of Torque - assuming >>>>>>>>> your system will allow it, this will let me see if the problem is >>>>>>>>> somewhere in the Torque launcher or elsewhere in OMPI. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Ralph >>>>>>>>> >>>>>>>>> On Jun 6, 2014, at 12:48 PM, Dan Dietz <ddi...@purdue.edu> wrote: >>>>>>>>> >>>>>>>>>> No problem - >>>>>>>>>> >>>>>>>>>> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz >>>>>>>>>> chips. >>>>>>>>>> 2 per node, 8 cores each. No threading enabled. >>>>>>>>>> >>>>>>>>>> $ lstopo >>>>>>>>>> Machine (64GB) >>>>>>>>>> NUMANode L#0 (P#0 32GB) >>>>>>>>>> Socket L#0 + L3 L#0 (20MB) >>>>>>>>>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 >>>>>>>>>> (P#0) >>>>>>>>>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 >>>>>>>>>> (P#1) >>>>>>>>>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 >>>>>>>>>> (P#2) >>>>>>>>>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 >>>>>>>>>> (P#3) >>>>>>>>>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 >>>>>>>>>> (P#4) >>>>>>>>>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 >>>>>>>>>> (P#5) >>>>>>>>>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 >>>>>>>>>> (P#6) >>>>>>>>>> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 >>>>>>>>>> (P#7) >>>>>>>>>> HostBridge L#0 >>>>>>>>>> PCIBridge >>>>>>>>>> PCI 1000:0087 >>>>>>>>>> Block L#0 "sda" >>>>>>>>>> PCIBridge >>>>>>>>>> PCI 8086:2250 >>>>>>>>>> PCIBridge >>>>>>>>>> PCI 8086:1521 >>>>>>>>>> Net L#1 "eth0" >>>>>>>>>> PCI 8086:1521 >>>>>>>>>> Net L#2 "eth1" >>>>>>>>>> PCIBridge >>>>>>>>>> PCI 102b:0533 >>>>>>>>>> PCI 8086:1d02 >>>>>>>>>> NUMANode L#1 (P#1 32GB) >>>>>>>>>> Socket L#1 + L3 L#1 (20MB) >>>>>>>>>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 >>>>>>>>>> (P#8) >>>>>>>>>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 >>>>>>>>>> (P#9) >>>>>>>>>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 >>>>>>>>>> + PU L#10 (P#10) >>>>>>>>>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 >>>>>>>>>> + PU L#11 (P#11) >>>>>>>>>> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 >>>>>>>>>> + PU L#12 (P#12) >>>>>>>>>> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 >>>>>>>>>> + PU L#13 (P#13) >>>>>>>>>> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 >>>>>>>>>> + PU L#14 (P#14) >>>>>>>>>> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 >>>>>>>>>> + PU L#15 (P#15) >>>>>>>>>> HostBridge L#5 >>>>>>>>>> PCIBridge >>>>>>>>>> PCI 15b3:1011 >>>>>>>>>> Net L#3 "ib0" >>>>>>>>>> OpenFabrics L#4 "mlx5_0" >>>>>>>>>> PCIBridge >>>>>>>>>> PCI 8086:2250 >>>>>>>>>> >>>>>>>>>> From the segfault below. I tried reproducing the crash on less than >>>>>>>>>> an >>>>>>>>>> 4 node allocation but wasn't able to. >>>>>>>>>> >>>>>>>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ mpirun -np 2 >>>>>>>>>> -machinefile ./nodes -mca plm_base_verbose 10 ./hello >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: >>>>>>>>>> registering plm components >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: >>>>>>>>>> found loaded component isolated >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: >>>>>>>>>> component isolated has no register or open function >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: >>>>>>>>>> found loaded component slurm >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: >>>>>>>>>> component slurm register function successful >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: >>>>>>>>>> found loaded component rsh >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: >>>>>>>>>> component rsh register function successful >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: >>>>>>>>>> found loaded component tm >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register: >>>>>>>>>> component tm register function successful >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: >>>>>>>>>> opening >>>>>>>>>> plm components >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found >>>>>>>>>> loaded component isolated >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: >>>>>>>>>> component isolated open function successful >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found >>>>>>>>>> loaded component slurm >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: >>>>>>>>>> component slurm open function successful >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found >>>>>>>>>> loaded component rsh >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: >>>>>>>>>> component rsh open function successful >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found >>>>>>>>>> loaded component tm >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: >>>>>>>>>> component tm open function successful >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select: Auto-selecting >>>>>>>>>> plm >>>>>>>>>> components >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying >>>>>>>>>> component [isolated] >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of >>>>>>>>>> component [isolated] set priority to 0 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying >>>>>>>>>> component [slurm] >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Skipping >>>>>>>>>> component [slurm]. Query failed to return a module >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying >>>>>>>>>> component [rsh] >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[INVALID],INVALID] plm:rsh_lookup >>>>>>>>>> on agent ssh : rsh path NULL >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of >>>>>>>>>> component [rsh] set priority to 10 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Querying >>>>>>>>>> component [tm] >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Query of >>>>>>>>>> component [tm] set priority to 75 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:( plm) Selected >>>>>>>>>> component [tm] >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component >>>>>>>>>> isolated closed >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading >>>>>>>>>> component isolated >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component slurm >>>>>>>>>> closed >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading >>>>>>>>>> component slurm >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component rsh >>>>>>>>>> closed >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading >>>>>>>>>> component rsh >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] plm:base:set_hnp_name: initial >>>>>>>>>> bias >>>>>>>>>> 55685 nodename hash 3965217721 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] plm:base:set_hnp_name: final >>>>>>>>>> jobfam 24164 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:receive >>>>>>>>>> start comm >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_job >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm >>>>>>>>>> creating map >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm >>>>>>>>>> add >>>>>>>>>> new daemon [[24164,0],1] >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm >>>>>>>>>> assigning new daemon [[24164,0],1] to node conte-a055 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: launching vm >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: final >>>>>>>>>> top-level argv: >>>>>>>>>> orted -mca ess tm -mca orte_ess_jobid 1583611904 -mca orte_ess_vpid >>>>>>>>>> <template> -mca orte_ess_num_procs 2 -mca orte_hnp_uri >>>>>>>>>> "1583611904.0;tcp://172.18.96.49,172.31.1.254,172.31.2.254,172.18.112.49:37380" >>>>>>>>>> -mca plm_base_verbose 10 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: resetting >>>>>>>>>> LD_LIBRARY_PATH: >>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/usr/pbs/lib:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/mpirt/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/ipp/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/tbb/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/apps/rhel6/intel/opencl-1.2-3.2.1.16712/lib64 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: resetting >>>>>>>>>> PATH: >>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/bin/intel64:/opt/intel/mic/bin:/apps/rhel6/intel/inspector_xe_2013/bin64:/apps/rhel6/intel/advisor_xe_2013/bin64:/apps/rhel6/intel/vtune_amplifier_xe_2013/bin64:/apps/rhel6/intel/opencl-1.2-3.2.1.16712/bin:/usr/lib64/qt-3.3/bin:/opt/moab/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/hpss/bin:/opt/hsi/bin:/opt/ibutils/bin:/usr/pbs/bin:/opt/moab/bin:/usr/site/rcac/scripts:/usr/site/rcac/support_scripts:/usr/site/rcac/bin:/usr/site/rcac/sbin:/usr/sbin >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: launching on >>>>>>>>>> node conte-a055 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: executing: >>>>>>>>>> orted -mca ess tm -mca orte_ess_jobid 1583611904 -mca orte_ess_vpid 1 >>>>>>>>>> -mca orte_ess_num_procs 2 -mca orte_hnp_uri >>>>>>>>>> "1583611904.0;tcp://172.18.96.49,172.31.1.254,172.31.2.254,172.18.112.49:37380" >>>>>>>>>> -mca plm_base_verbose 10 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm:launch: >>>>>>>>>> finished spawning orteds >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register: >>>>>>>>>> registering plm components >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register: >>>>>>>>>> found loaded component rsh >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register: >>>>>>>>>> component rsh register function successful >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: >>>>>>>>>> opening >>>>>>>>>> plm components >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: found >>>>>>>>>> loaded component rsh >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: >>>>>>>>>> component rsh open function successful >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select: Auto-selecting >>>>>>>>>> plm >>>>>>>>>> components >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Querying >>>>>>>>>> component [rsh] >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:rsh_lookup on >>>>>>>>>> agent ssh : rsh path NULL >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Query of >>>>>>>>>> component [rsh] set priority to 10 >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:( plm) Selected >>>>>>>>>> component [rsh] >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:rsh_setup on >>>>>>>>>> agent ssh : rsh path NULL >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:base:receive >>>>>>>>>> start comm >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] >>>>>>>>>> plm:base:orted_report_launch from daemon [[24164,0],1] >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] >>>>>>>>>> plm:base:orted_report_launch from daemon [[24164,0],1] on node >>>>>>>>>> conte-a055 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] RECEIVED TOPOLOGY >>>>>>>>>> FROM NODE conte-a055 >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] NEW TOPOLOGY - >>>>>>>>>> ADDING >>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] >>>>>>>>>> plm:base:orted_report_launch completed for daemon [[24164,0],1] at >>>>>>>>>> contact >>>>>>>>>> 1583611904.1;tcp://172.18.96.95,172.31.1.254,172.31.2.254,172.18.112.95:58312 >>>>>>>>>> [conte-a009:55685] *** Process received signal *** >>>>>>>>>> [conte-a009:55685] Signal: Segmentation fault (11) >>>>>>>>>> [conte-a009:55685] Signal code: Address not mapped (1) >>>>>>>>>> [conte-a009:55685] Failing at address: 0x4c >>>>>>>>>> [conte-a009:55685] [ 0] /lib64/libpthread.so.0[0x327f80f500] >>>>>>>>>> [conte-a009:55685] [ 1] >>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x951)[0x2b5b069a50e1] >>>>>>>>>> [conte-a009:55685] [ 2] >>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b5b075ff145] >>>>>>>>>> [conte-a009:55685] [ 3] mpirun(orterun+0x1ffd)[0x4073b5] >>>>>>>>>> [conte-a009:55685] [ 4] mpirun(main+0x20)[0x4048f4] >>>>>>>>>> [conte-a009:55685] [ 5] >>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x327f41ecdd] >>>>>>>>>> [conte-a009:55685] [ 6] mpirun[0x404819] >>>>>>>>>> [conte-a009:55685] *** End of error message *** >>>>>>>>>> Segmentation fault (core dumped) >>>>>>>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ >>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:base:receive >>>>>>>>>> stop >>>>>>>>>> comm >>>>>>>>>> >>>>>>>>>> On Fri, Jun 6, 2014 at 3:00 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>>> Sorry to pester with questions, but I'm trying to narrow down the >>>>>>>>>>> issue. >>>>>>>>>>> >>>>>>>>>>> * What kind of chips are on these machines? >>>>>>>>>>> >>>>>>>>>>> * If they have h/w threads, are they enabled? >>>>>>>>>>> >>>>>>>>>>> * you might have lstopo on one of those machines - could you pass >>>>>>>>>>> along its output? Otherwise, you can run a simple "mpirun -n 1 -mca >>>>>>>>>>> ess_base_verbose 20 hostname" and it will print out. Only need one >>>>>>>>>>> node in your allocation as we don't need a fountain of output. >>>>>>>>>>> >>>>>>>>>>> I'll look into the segfault - hard to understand offhand, but could >>>>>>>>>>> be an uninitialized variable. If you have a chance, could you rerun >>>>>>>>>>> that test with "-mca plm_base_verbose 10" on the cmd line? >>>>>>>>>>> >>>>>>>>>>> Thanks again >>>>>>>>>>> Ralph >>>>>>>>>>> >>>>>>>>>>> On Jun 6, 2014, at 10:31 AM, Dan Dietz <ddi...@purdue.edu> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks for the reply. I tried out the --display-allocation option >>>>>>>>>>>> with >>>>>>>>>>>> several different combinations and have attached the output. I see >>>>>>>>>>>> this behavior on both RHEL6.4, RHEL6.5, and RHEL5.10 clusters. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Here's debugging info on the segfault. Does that help? FWIW this >>>>>>>>>>>> does >>>>>>>>>>>> not seem to crash on the RHEL5 cluster or RHEL6.5 cluster. Just >>>>>>>>>>>> crashes on RHEL6.4. >>>>>>>>>>>> >>>>>>>>>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ gdb -c core.22623 >>>>>>>>>>>> `which mpirun` >>>>>>>>>>>> No symbol table is loaded. Use the "file" command. >>>>>>>>>>>> GNU gdb (GDB) 7.5-1.3.187 >>>>>>>>>>>> Copyright (C) 2012 Free Software Foundation, Inc. >>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later >>>>>>>>>>>> <http://gnu.org/licenses/gpl.html> >>>>>>>>>>>> This is free software: you are free to change and redistribute it. >>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law. Type "show >>>>>>>>>>>> copying" >>>>>>>>>>>> and "show warranty" for details. >>>>>>>>>>>> This GDB was configured as "x86_64-unknown-linux-gnu". >>>>>>>>>>>> For bug reporting instructions, please see: >>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>... >>>>>>>>>>>> Reading symbols from >>>>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin/mpirun...done. >>>>>>>>>>>> [New LWP 22623] >>>>>>>>>>>> [New LWP 22624] >>>>>>>>>>>> >>>>>>>>>>>> warning: Can't read pathname for load map: Input/output error. >>>>>>>>>>>> [Thread debugging using libthread_db enabled] >>>>>>>>>>>> Using host libthread_db library "/lib64/libthread_db.so.1". >>>>>>>>>>>> Core was generated by `mpirun -np 2 -machinefile ./nodes ./hello'. >>>>>>>>>>>> Program terminated with signal 11, Segmentation fault. >>>>>>>>>>>> #0 0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1, >>>>>>>>>>>> args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422 >>>>>>>>>>>> 422 node->hostid = node->daemon->name.vpid; >>>>>>>>>>>> (gdb) bt >>>>>>>>>>>> #0 0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1, >>>>>>>>>>>> args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422 >>>>>>>>>>>> #1 0x00002acc60eec145 in opal_libevent2021_event_base_loop () from >>>>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6 >>>>>>>>>>>> #2 0x00000000004073b5 in orterun (argc=6, argv=0x7fff5bb2a3a8) at >>>>>>>>>>>> orterun.c:1077 >>>>>>>>>>>> #3 0x00000000004048f4 in main (argc=6, argv=0x7fff5bb2a3a8) at >>>>>>>>>>>> main.c:13 >>>>>>>>>>>> >>>>>>>>>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ cat nodes >>>>>>>>>>>> conte-a009 >>>>>>>>>>>> conte-a009 >>>>>>>>>>>> conte-a055 >>>>>>>>>>>> conte-a055 >>>>>>>>>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ uname -r >>>>>>>>>>>> 2.6.32-358.14.1.el6.x86_64 >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jun 5, 2014 at 7:54 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Jun 5, 2014, at 2:13 PM, Dan Dietz <ddi...@purdue.edu> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'd like to bind 8 cores to a single MPI rank for hybrid >>>>>>>>>>>>>> MPI/OpenMP >>>>>>>>>>>>>> codes. In OMPI 1.6.3, I can do: >>>>>>>>>>>>>> >>>>>>>>>>>>>> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello >>>>>>>>>>>>>> >>>>>>>>>>>>>> I get one rank bound to procs 0-7 and the other bound to 8-15. >>>>>>>>>>>>>> Great! >>>>>>>>>>>>>> >>>>>>>>>>>>>> But I'm having some difficulties doing this with openmpi 1.8.1: >>>>>>>>>>>>>> >>>>>>>>>>>>>> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello >>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>> The following command line options and corresponding MCA >>>>>>>>>>>>>> parameter have >>>>>>>>>>>>>> been deprecated and replaced as follows: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Command line options: >>>>>>>>>>>>>> Deprecated: --cpus-per-proc, -cpus-per-proc, --cpus-per-rank, >>>>>>>>>>>>>> -cpus-per-rank >>>>>>>>>>>>>> Replacement: --map-by <obj>:PE=N >>>>>>>>>>>>>> >>>>>>>>>>>>>> Equivalent MCA parameter: >>>>>>>>>>>>>> Deprecated: rmaps_base_cpus_per_proc >>>>>>>>>>>>>> Replacement: rmaps_base_mapping_policy=<obj>:PE=N >>>>>>>>>>>>>> >>>>>>>>>>>>>> The deprecated forms *will* disappear in a future version of >>>>>>>>>>>>>> Open MPI. >>>>>>>>>>>>>> Please update to the new syntax. >>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>> There are not enough slots available in the system to satisfy >>>>>>>>>>>>>> the 2 slots >>>>>>>>>>>>>> that were requested by the application: >>>>>>>>>>>>>> ./hello >>>>>>>>>>>>>> >>>>>>>>>>>>>> Either request fewer slots for your application, or make more >>>>>>>>>>>>>> slots available >>>>>>>>>>>>>> for use. >>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>> OK, let me try the new syntax... >>>>>>>>>>>>>> >>>>>>>>>>>>>> $ mpirun -np 2 --map-by core:pe=8 -machinefile ./nodes ./hello >>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>> There are not enough slots available in the system to satisfy >>>>>>>>>>>>>> the 2 slots >>>>>>>>>>>>>> that were requested by the application: >>>>>>>>>>>>>> ./hello >>>>>>>>>>>>>> >>>>>>>>>>>>>> Either request fewer slots for your application, or make more >>>>>>>>>>>>>> slots available >>>>>>>>>>>>>> for use. >>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>> What am I doing wrong? The documentation on these new options is >>>>>>>>>>>>>> somewhat poor and confusing so I'm probably doing something >>>>>>>>>>>>>> wrong. If >>>>>>>>>>>>>> anyone could provide some pointers here it'd be much >>>>>>>>>>>>>> appreciated! If >>>>>>>>>>>>>> it's not something simple and you need config logs and such >>>>>>>>>>>>>> please let >>>>>>>>>>>>>> me know. >>>>>>>>>>>>> >>>>>>>>>>>>> Looks like we think there are less than 16 slots allocated on >>>>>>>>>>>>> that node. What is in this "nodes" file? Without it, OMPI should >>>>>>>>>>>>> read the Torque allocation directly. You might check what we >>>>>>>>>>>>> think the allocation is by adding --display-allocation to the cmd >>>>>>>>>>>>> line >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> As a side note - >>>>>>>>>>>>>> >>>>>>>>>>>>>> If I try this using the PBS nodefile with the above, I get a >>>>>>>>>>>>>> confusing message: >>>>>>>>>>>>>> >>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>> A request for multiple cpus-per-proc was given, but a directive >>>>>>>>>>>>>> was also give to map to an object level that has less cpus than >>>>>>>>>>>>>> requested ones: >>>>>>>>>>>>>> >>>>>>>>>>>>>> #cpus-per-proc: 8 >>>>>>>>>>>>>> number of cpus: 1 >>>>>>>>>>>>>> map-by: BYCORE:NOOVERSUBSCRIBE >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please specify a mapping level that has more cpus, or else let us >>>>>>>>>>>>>> define a default mapping that will allow multiple cpus-per-proc. >>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>> >>>>>>>>>>>>>> From what I've gathered this is because I have a node listed 16 >>>>>>>>>>>>>> times >>>>>>>>>>>>>> in my PBS nodefile so it's assuming then I have 1 core per node? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> No - if listed 16 times, it should compute 16 slots. Try adding >>>>>>>>>>>>> --display-allocation to your cmd line and it should tell you how >>>>>>>>>>>>> many slots are present. >>>>>>>>>>>>> >>>>>>>>>>>>> However, it doesn't assume there is a core for each slot. >>>>>>>>>>>>> Instead, it detects the cores directly on the node. It sounds >>>>>>>>>>>>> like it isn't seeing them for some reason. What OS are you >>>>>>>>>>>>> running on that node? >>>>>>>>>>>>> >>>>>>>>>>>>> FWIW: the 1.6 series has a different detection system for cores. >>>>>>>>>>>>> Could be something is causing problems for the new one. >>>>>>>>>>>>> >>>>>>>>>>>>>> Some >>>>>>>>>>>>>> better documentation here would be helpful. I haven't been able >>>>>>>>>>>>>> to >>>>>>>>>>>>>> figure out how to use the "oversubscribe" option listed in the >>>>>>>>>>>>>> docs. >>>>>>>>>>>>>> Not that I really want to oversubscribe, of course, I need to >>>>>>>>>>>>>> modify >>>>>>>>>>>>>> the nodefile, but this just stumped me for a while as 1.6.3 >>>>>>>>>>>>>> didn't >>>>>>>>>>>>>> have this behavior. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> As a extra bonus, I get a segfault in this situation: >>>>>>>>>>>>>> >>>>>>>>>>>>>> $ mpirun -np 2 -machinefile ./nodes ./hello >>>>>>>>>>>>>> [conte-a497:13255] *** Process received signal *** >>>>>>>>>>>>>> [conte-a497:13255] Signal: Segmentation fault (11) >>>>>>>>>>>>>> [conte-a497:13255] Signal code: Address not mapped (1) >>>>>>>>>>>>>> [conte-a497:13255] Failing at address: 0x2c >>>>>>>>>>>>>> [conte-a497:13255] [ 0] /lib64/libpthread.so.0[0x3c9460f500] >>>>>>>>>>>>>> [conte-a497:13255] [ 1] >>>>>>>>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2ba960a59015] >>>>>>>>>>>>>> [conte-a497:13255] [ 2] >>>>>>>>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2ba961666715] >>>>>>>>>>>>>> [conte-a497:13255] [ 3] mpirun(orterun+0x1b45)[0x40684f] >>>>>>>>>>>>>> [conte-a497:13255] [ 4] mpirun(main+0x20)[0x4047f4] >>>>>>>>>>>>>> [conte-a497:13255] [ 5] >>>>>>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3a1bc1ecdd] >>>>>>>>>>>>>> [conte-a497:13255] [ 6] mpirun[0x404719] >>>>>>>>>>>>>> [conte-a497:13255] *** End of error message *** >>>>>>>>>>>>>> Segmentation fault (core dumped) >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Huh - that's odd. Could you perhaps configure OMPI with >>>>>>>>>>>>> --enable-debug and gdb the core file to tell us the line numbers >>>>>>>>>>>>> involved? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Ralph >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> My "nodes" file simply contains the first two lines of my >>>>>>>>>>>>>> original >>>>>>>>>>>>>> $PBS_NODEFILE provided by Torque. See above why I modified. >>>>>>>>>>>>>> Works fine >>>>>>>>>>>>>> if use the full file. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks in advance for any pointers you all may have! >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dan >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Dan Dietz >>>>>>>>>>>>>> Scientific Applications Analyst >>>>>>>>>>>>>> ITaP Research Computing, Purdue University >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Dan Dietz >>>>>>>>>>>> Scientific Applications Analyst >>>>>>>>>>>> ITaP Research Computing, Purdue University >>>>>>>>>>>> <slots>_______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Dan Dietz >>>>>>>>>> Scientific Applications Analyst >>>>>>>>>> ITaP Research Computing, Purdue University >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Dan Dietz >>>>>>>> Scientific Applications Analyst >>>>>>>> ITaP Research Computing, Purdue University >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24629.php >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24630.php >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Dan Dietz >>>>>> Scientific Applications Analyst >>>>>> ITaP Research Computing, Purdue University >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24631.php >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/06/24642.php >>> >>> >>> >>> -- >>> Dan Dietz >>> Scientific Applications Analyst >>> ITaP Research Computing, Purdue University >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/06/24645.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/06/24646.php > > > > -- > Dan Dietz > Scientific Applications Analyst > ITaP Research Computing, Purdue University > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/06/24647.php