Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

Ralph Castain Mon, 16 Jun 2014 11:41:47 -0400 (EDT)

Just to wrap this up for the user list: this has now been fixed and added to 
1.8.2 in the nightly tarball. The problem proved to be an edge case when 
partial allocations were combined with coprocessor existence (hit a slightly 
different code path).



On Jun 12, 2014, at 9:04 AM, Dan Dietz <ddi...@purdue.edu> wrote:

> That shouldn't be a problem. Let me figure out the process and I'll
> get back to you.
> 
> Dan
> 
> On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> Arggh - is there any way I can get access to this beast so I can debug this? 
>> I can't figure out what in the world is going on, but it seems to be 
>> something triggered by your specific setup.
>> 
>> 
>> On Jun 12, 2014, at 8:48 AM, Dan Dietz <ddi...@purdue.edu> wrote:
>> 
>>> Unfortunately, the nightly tarball appears to be crashing in a similar
>>> fashion. :-( I used the latest snapshot 1.8.2a1r31981.
>>> 
>>> Dan
>>> 
>>> On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> I've poked and prodded, and the 1.8.2 tarball seems to be handling this 
>>>> situation just fine. I don't have access to a Torque machine, but I did 
>>>> set everything to follow the same code path, added faux coprocessors, etc. 
>>>> - and it ran just fine.
>>>> 
>>>> Can you try the 1.8.2 tarball and see if it solves the problem?
>>>> 
>>>> 
>>>> On Jun 11, 2014, at 2:15 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>>> Okay, let me poke around some more. It is clearly tied to the 
>>>>> coprocessors, but I'm not yet sure just why.
>>>>> 
>>>>> One thing you might do is try the nightly 1.8.2 tarball - there have been 
>>>>> a number of fixes, and this may well have been caught there. Worth taking 
>>>>> a look.
>>>>> 
>>>>> 
>>>>> On Jun 11, 2014, at 6:44 AM, Dan Dietz <ddi...@purdue.edu> wrote:
>>>>> 
>>>>>> Sorry - it crashes with both torque and rsh launchers. The output from
>>>>>> a gdb backtrace on the core files looks identical.
>>>>>> 
>>>>>> Dan
>>>>>> 
>>>>>> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>> Afraid I'm a little confused now - are you saying it works fine under 
>>>>>>> Torque, but segfaults under rsh? Could you please clarify your current 
>>>>>>> situation?
>>>>>>> 
>>>>>>> 
>>>>>>> On Jun 11, 2014, at 6:27 AM, Dan Dietz <ddi...@purdue.edu> wrote:
>>>>>>> 
>>>>>>>> It looks like it is still segfaulting with the rsh launcher:
>>>>>>>> 
>>>>>>>> ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
>>>>>>>> -np 4 -machinefile ./nodes ./hello
>>>>>>>> [conte-a084:51113] *** Process received signal ***
>>>>>>>> [conte-a084:51113] Signal: Segmentation fault (11)
>>>>>>>> [conte-a084:51113] Signal code: Address not mapped (1)
>>>>>>>> [conte-a084:51113] Failing at address: 0x2c
>>>>>>>> [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710]
>>>>>>>> [conte-a084:51113] [ 1]
>>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015]
>>>>>>>> [conte-a084:51113] [ 2]
>>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715]
>>>>>>>> [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f]
>>>>>>>> [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4]
>>>>>>>> [conte-a084:51113] [ 5] 
>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d]
>>>>>>>> [conte-a084:51113] [ 6] mpirun[0x404719]
>>>>>>>> [conte-a084:51113] *** End of error message ***
>>>>>>>> Segmentation fault (core dumped)
>>>>>>>> 
>>>>>>>> On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>> wrote:
>>>>>>>>> I'm having no luck poking at this segfault issue. For some strange 
>>>>>>>>> reason, we seem to think there are coprocessors on those remote nodes 
>>>>>>>>> - e.g., a Phi card. Yet your lstopo output doesn't seem to show it.
>>>>>>>>> 
>>>>>>>>> Out of curiosity, can you try running this with "-mca plm rsh"? This 
>>>>>>>>> will substitute the rsh/ssh launcher in place of Torque - assuming 
>>>>>>>>> your system will allow it, this will let me see if the problem is 
>>>>>>>>> somewhere in the Torque launcher or elsewhere in OMPI.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Ralph
>>>>>>>>> 
>>>>>>>>> On Jun 6, 2014, at 12:48 PM, Dan Dietz <ddi...@purdue.edu> wrote:
>>>>>>>>> 
>>>>>>>>>> No problem -
>>>>>>>>>> 
>>>>>>>>>> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz 
>>>>>>>>>> chips.
>>>>>>>>>> 2 per node, 8 cores each. No threading enabled.
>>>>>>>>>> 
>>>>>>>>>> $ lstopo
>>>>>>>>>> Machine (64GB)
>>>>>>>>>> NUMANode L#0 (P#0 32GB)
>>>>>>>>>> Socket L#0 + L3 L#0 (20MB)
>>>>>>>>>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
>>>>>>>>>> (P#0)
>>>>>>>>>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
>>>>>>>>>> (P#1)
>>>>>>>>>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
>>>>>>>>>> (P#2)
>>>>>>>>>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
>>>>>>>>>> (P#3)
>>>>>>>>>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
>>>>>>>>>> (P#4)
>>>>>>>>>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 
>>>>>>>>>> (P#5)
>>>>>>>>>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 
>>>>>>>>>> (P#6)
>>>>>>>>>> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 
>>>>>>>>>> (P#7)
>>>>>>>>>> HostBridge L#0
>>>>>>>>>> PCIBridge
>>>>>>>>>>   PCI 1000:0087
>>>>>>>>>>     Block L#0 "sda"
>>>>>>>>>> PCIBridge
>>>>>>>>>>   PCI 8086:2250
>>>>>>>>>> PCIBridge
>>>>>>>>>>   PCI 8086:1521
>>>>>>>>>>     Net L#1 "eth0"
>>>>>>>>>>   PCI 8086:1521
>>>>>>>>>>     Net L#2 "eth1"
>>>>>>>>>> PCIBridge
>>>>>>>>>>   PCI 102b:0533
>>>>>>>>>> PCI 8086:1d02
>>>>>>>>>> NUMANode L#1 (P#1 32GB)
>>>>>>>>>> Socket L#1 + L3 L#1 (20MB)
>>>>>>>>>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 
>>>>>>>>>> (P#8)
>>>>>>>>>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 
>>>>>>>>>> (P#9)
>>>>>>>>>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>>>>>>>>> + PU L#10 (P#10)
>>>>>>>>>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>>>>>>>>> + PU L#11 (P#11)
>>>>>>>>>> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>>>>>>>>>> + PU L#12 (P#12)
>>>>>>>>>> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>>>>>>>>>> + PU L#13 (P#13)
>>>>>>>>>> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>>>>>>>>>> + PU L#14 (P#14)
>>>>>>>>>> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>>>>>>>>>> + PU L#15 (P#15)
>>>>>>>>>> HostBridge L#5
>>>>>>>>>> PCIBridge
>>>>>>>>>>   PCI 15b3:1011
>>>>>>>>>>     Net L#3 "ib0"
>>>>>>>>>>     OpenFabrics L#4 "mlx5_0"
>>>>>>>>>> PCIBridge
>>>>>>>>>>   PCI 8086:2250
>>>>>>>>>> 
>>>>>>>>>> From the segfault below. I tried reproducing the crash on less than 
>>>>>>>>>> an
>>>>>>>>>> 4 node allocation but wasn't able to.
>>>>>>>>>> 
>>>>>>>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ mpirun -np 2
>>>>>>>>>> -machinefile ./nodes -mca plm_base_verbose 10 ./hello
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>>> registering plm components
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>>> found loaded component isolated
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>>> component isolated has no register or open function
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>>> found loaded component slurm
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>>> component slurm register function successful
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>>> found loaded component rsh
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>>> component rsh register function successful
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>>> found loaded component tm
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_register:
>>>>>>>>>> component tm register function successful
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: 
>>>>>>>>>> opening
>>>>>>>>>> plm components
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>>>>>>>>>> loaded component isolated
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>>>>>>>>>> component isolated open function successful
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>>>>>>>>>> loaded component slurm
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>>>>>>>>>> component slurm open function successful
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>>>>>>>>>> loaded component rsh
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>>>>>>>>>> component rsh open function successful
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open: found
>>>>>>>>>> loaded component tm
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: components_open:
>>>>>>>>>> component tm open function successful
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select: Auto-selecting 
>>>>>>>>>> plm
>>>>>>>>>> components
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:(  plm) Querying
>>>>>>>>>> component [isolated]
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:(  plm) Query of
>>>>>>>>>> component [isolated] set priority to 0
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:(  plm) Querying
>>>>>>>>>> component [slurm]
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:(  plm) Skipping
>>>>>>>>>> component [slurm]. Query failed to return a module
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:(  plm) Querying
>>>>>>>>>> component [rsh]
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[INVALID],INVALID] plm:rsh_lookup
>>>>>>>>>> on agent ssh : rsh path NULL
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:(  plm) Query of
>>>>>>>>>> component [rsh] set priority to 10
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:(  plm) Querying
>>>>>>>>>> component [tm]
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:(  plm) Query of
>>>>>>>>>> component [tm] set priority to 75
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca:base:select:(  plm) Selected
>>>>>>>>>> component [tm]
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component 
>>>>>>>>>> isolated closed
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading
>>>>>>>>>> component isolated
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component slurm 
>>>>>>>>>> closed
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading 
>>>>>>>>>> component slurm
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: component rsh 
>>>>>>>>>> closed
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] mca: base: close: unloading 
>>>>>>>>>> component rsh
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] plm:base:set_hnp_name: initial 
>>>>>>>>>> bias
>>>>>>>>>> 55685 nodename hash 3965217721
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] plm:base:set_hnp_name: final 
>>>>>>>>>> jobfam 24164
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:receive 
>>>>>>>>>> start comm
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_job
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm 
>>>>>>>>>> creating map
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm 
>>>>>>>>>> add
>>>>>>>>>> new daemon [[24164,0],1]
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:base:setup_vm
>>>>>>>>>> assigning new daemon [[24164,0],1] to node conte-a055
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: launching vm
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: final 
>>>>>>>>>> top-level argv:
>>>>>>>>>> orted -mca ess tm -mca orte_ess_jobid 1583611904 -mca orte_ess_vpid
>>>>>>>>>> <template> -mca orte_ess_num_procs 2 -mca orte_hnp_uri
>>>>>>>>>> "1583611904.0;tcp://172.18.96.49,172.31.1.254,172.31.2.254,172.18.112.49:37380"
>>>>>>>>>> -mca plm_base_verbose 10
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: resetting
>>>>>>>>>> LD_LIBRARY_PATH:
>>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib:/usr/pbs/lib:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/mpirt/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/ipp/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/tbb/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/apps/rhel6/intel/opencl-1.2-3.2.1.16712/lib64
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: resetting
>>>>>>>>>> PATH: 
>>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin:/apps/rhel6/intel/composer_xe_2013_sp1.2.144/bin/intel64:/opt/intel/mic/bin:/apps/rhel6/intel/inspector_xe_2013/bin64:/apps/rhel6/intel/advisor_xe_2013/bin64:/apps/rhel6/intel/vtune_amplifier_xe_2013/bin64:/apps/rhel6/intel/opencl-1.2-3.2.1.16712/bin:/usr/lib64/qt-3.3/bin:/opt/moab/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/hpss/bin:/opt/hsi/bin:/opt/ibutils/bin:/usr/pbs/bin:/opt/moab/bin:/usr/site/rcac/scripts:/usr/site/rcac/support_scripts:/usr/site/rcac/bin:/usr/site/rcac/sbin:/usr/sbin
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: launching on
>>>>>>>>>> node conte-a055
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm: executing:
>>>>>>>>>> orted -mca ess tm -mca orte_ess_jobid 1583611904 -mca orte_ess_vpid 1
>>>>>>>>>> -mca orte_ess_num_procs 2 -mca orte_hnp_uri
>>>>>>>>>> "1583611904.0;tcp://172.18.96.49,172.31.1.254,172.31.2.254,172.18.112.49:37380"
>>>>>>>>>> -mca plm_base_verbose 10
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] plm:tm:launch:
>>>>>>>>>> finished spawning orteds
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register:
>>>>>>>>>> registering plm components
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register:
>>>>>>>>>> found loaded component rsh
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_register:
>>>>>>>>>> component rsh register function successful
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: 
>>>>>>>>>> opening
>>>>>>>>>> plm components
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open: found
>>>>>>>>>> loaded component rsh
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca: base: components_open:
>>>>>>>>>> component rsh open function successful
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select: Auto-selecting 
>>>>>>>>>> plm
>>>>>>>>>> components
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:(  plm) Querying
>>>>>>>>>> component [rsh]
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:rsh_lookup on
>>>>>>>>>> agent ssh : rsh path NULL
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:(  plm) Query of
>>>>>>>>>> component [rsh] set priority to 10
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] mca:base:select:(  plm) Selected
>>>>>>>>>> component [rsh]
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:rsh_setup on
>>>>>>>>>> agent ssh : rsh path NULL
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:base:receive 
>>>>>>>>>> start comm
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0]
>>>>>>>>>> plm:base:orted_report_launch from daemon [[24164,0],1]
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0]
>>>>>>>>>> plm:base:orted_report_launch from daemon [[24164,0],1] on node
>>>>>>>>>> conte-a055
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] RECEIVED TOPOLOGY
>>>>>>>>>> FROM NODE conte-a055
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] NEW TOPOLOGY - 
>>>>>>>>>> ADDING
>>>>>>>>>> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0]
>>>>>>>>>> plm:base:orted_report_launch completed for daemon [[24164,0],1] at
>>>>>>>>>> contact 
>>>>>>>>>> 1583611904.1;tcp://172.18.96.95,172.31.1.254,172.31.2.254,172.18.112.95:58312
>>>>>>>>>> [conte-a009:55685] *** Process received signal ***
>>>>>>>>>> [conte-a009:55685] Signal: Segmentation fault (11)
>>>>>>>>>> [conte-a009:55685] Signal code: Address not mapped (1)
>>>>>>>>>> [conte-a009:55685] Failing at address: 0x4c
>>>>>>>>>> [conte-a009:55685] [ 0] /lib64/libpthread.so.0[0x327f80f500]
>>>>>>>>>> [conte-a009:55685] [ 1]
>>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x951)[0x2b5b069a50e1]
>>>>>>>>>> [conte-a009:55685] [ 2]
>>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b5b075ff145]
>>>>>>>>>> [conte-a009:55685] [ 3] mpirun(orterun+0x1ffd)[0x4073b5]
>>>>>>>>>> [conte-a009:55685] [ 4] mpirun(main+0x20)[0x4048f4]
>>>>>>>>>> [conte-a009:55685] [ 5] 
>>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x327f41ecdd]
>>>>>>>>>> [conte-a009:55685] [ 6] mpirun[0x404819]
>>>>>>>>>> [conte-a009:55685] *** End of error message ***
>>>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$
>>>>>>>>>> [conte-a055.rcac.purdue.edu:32094] [[24164,0],1] plm:base:receive 
>>>>>>>>>> stop
>>>>>>>>>> comm
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jun 6, 2014 at 3:00 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>> wrote:
>>>>>>>>>>> Sorry to pester with questions, but I'm trying to narrow down the 
>>>>>>>>>>> issue.
>>>>>>>>>>> 
>>>>>>>>>>> * What kind of chips are on these machines?
>>>>>>>>>>> 
>>>>>>>>>>> * If they have h/w threads, are they enabled?
>>>>>>>>>>> 
>>>>>>>>>>> * you might have lstopo on one of those machines - could you pass 
>>>>>>>>>>> along its output? Otherwise, you can run a simple "mpirun -n 1 -mca 
>>>>>>>>>>> ess_base_verbose 20 hostname" and it will print out. Only need one 
>>>>>>>>>>> node in your allocation as we don't need a fountain of output.
>>>>>>>>>>> 
>>>>>>>>>>> I'll look into the segfault - hard to understand offhand, but could 
>>>>>>>>>>> be an uninitialized variable. If you have a chance, could you rerun 
>>>>>>>>>>> that test with "-mca plm_base_verbose 10" on the cmd line?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks again
>>>>>>>>>>> Ralph
>>>>>>>>>>> 
>>>>>>>>>>> On Jun 6, 2014, at 10:31 AM, Dan Dietz <ddi...@purdue.edu> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the reply. I tried out the --display-allocation option 
>>>>>>>>>>>> with
>>>>>>>>>>>> several different combinations and have attached the output. I see
>>>>>>>>>>>> this behavior on both RHEL6.4, RHEL6.5, and RHEL5.10 clusters.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Here's debugging info on the segfault. Does that help? FWIW this 
>>>>>>>>>>>> does
>>>>>>>>>>>> not seem to crash on the RHEL5 cluster or RHEL6.5 cluster. Just
>>>>>>>>>>>> crashes on RHEL6.4.
>>>>>>>>>>>> 
>>>>>>>>>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ gdb -c core.22623
>>>>>>>>>>>> `which mpirun`
>>>>>>>>>>>> No symbol table is loaded.  Use the "file" command.
>>>>>>>>>>>> GNU gdb (GDB) 7.5-1.3.187
>>>>>>>>>>>> Copyright (C) 2012 Free Software Foundation, Inc.
>>>>>>>>>>>> License GPLv3+: GNU GPL version 3 or later 
>>>>>>>>>>>> <http://gnu.org/licenses/gpl.html>
>>>>>>>>>>>> This is free software: you are free to change and redistribute it.
>>>>>>>>>>>> There is NO WARRANTY, to the extent permitted by law.  Type "show 
>>>>>>>>>>>> copying"
>>>>>>>>>>>> and "show warranty" for details.
>>>>>>>>>>>> This GDB was configured as "x86_64-unknown-linux-gnu".
>>>>>>>>>>>> For bug reporting instructions, please see:
>>>>>>>>>>>> <http://www.gnu.org/software/gdb/bugs/>...
>>>>>>>>>>>> Reading symbols from
>>>>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/bin/mpirun...done.
>>>>>>>>>>>> [New LWP 22623]
>>>>>>>>>>>> [New LWP 22624]
>>>>>>>>>>>> 
>>>>>>>>>>>> warning: Can't read pathname for load map: Input/output error.
>>>>>>>>>>>> [Thread debugging using libthread_db enabled]
>>>>>>>>>>>> Using host libthread_db library "/lib64/libthread_db.so.1".
>>>>>>>>>>>> Core was generated by `mpirun -np 2 -machinefile ./nodes ./hello'.
>>>>>>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>>>>>>> #0  0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1,
>>>>>>>>>>>> args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422
>>>>>>>>>>>> 422                    node->hostid = node->daemon->name.vpid;
>>>>>>>>>>>> (gdb) bt
>>>>>>>>>>>> #0  0x00002acc602920e1 in orte_plm_base_complete_setup (fd=-1,
>>>>>>>>>>>> args=-1, cbdata=0x20c0840) at base/plm_base_launch_support.c:422
>>>>>>>>>>>> #1  0x00002acc60eec145 in opal_libevent2021_event_base_loop () from
>>>>>>>>>>>> /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6
>>>>>>>>>>>> #2  0x00000000004073b5 in orterun (argc=6, argv=0x7fff5bb2a3a8) at
>>>>>>>>>>>> orterun.c:1077
>>>>>>>>>>>> #3  0x00000000004048f4 in main (argc=6, argv=0x7fff5bb2a3a8) at 
>>>>>>>>>>>> main.c:13
>>>>>>>>>>>> 
>>>>>>>>>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ cat nodes
>>>>>>>>>>>> conte-a009
>>>>>>>>>>>> conte-a009
>>>>>>>>>>>> conte-a055
>>>>>>>>>>>> conte-a055
>>>>>>>>>>>> ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ uname -r
>>>>>>>>>>>> 2.6.32-358.14.1.el6.x86_64
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Jun 5, 2014 at 7:54 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jun 5, 2014, at 2:13 PM, Dan Dietz <ddi...@purdue.edu> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'd like to bind 8 cores to a single MPI rank for hybrid 
>>>>>>>>>>>>>> MPI/OpenMP
>>>>>>>>>>>>>> codes. In OMPI 1.6.3, I can do:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> $ mpirun -np 2 -cpus-per-rank 8  -machinefile ./nodes ./hello
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I get one rank bound to procs 0-7 and the other bound to 8-15. 
>>>>>>>>>>>>>> Great!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> But I'm having some difficulties doing this with openmpi 1.8.1:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> $ mpirun -np 2 -cpus-per-rank 8  -machinefile ./nodes ./hello
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> The following command line options and corresponding MCA 
>>>>>>>>>>>>>> parameter have
>>>>>>>>>>>>>> been deprecated and replaced as follows:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Command line options:
>>>>>>>>>>>>>> Deprecated:  --cpus-per-proc, -cpus-per-proc, --cpus-per-rank,
>>>>>>>>>>>>>> -cpus-per-rank
>>>>>>>>>>>>>> Replacement: --map-by <obj>:PE=N
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Equivalent MCA parameter:
>>>>>>>>>>>>>> Deprecated:  rmaps_base_cpus_per_proc
>>>>>>>>>>>>>> Replacement: rmaps_base_mapping_policy=<obj>:PE=N
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The deprecated forms *will* disappear in a future version of 
>>>>>>>>>>>>>> Open MPI.
>>>>>>>>>>>>>> Please update to the new syntax.
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> There are not enough slots available in the system to satisfy 
>>>>>>>>>>>>>> the 2 slots
>>>>>>>>>>>>>> that were requested by the application:
>>>>>>>>>>>>>> ./hello
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Either request fewer slots for your application, or make more 
>>>>>>>>>>>>>> slots available
>>>>>>>>>>>>>> for use.
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> OK, let me try the new syntax...
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> $ mpirun -np 2 --map-by core:pe=8 -machinefile ./nodes ./hello
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> There are not enough slots available in the system to satisfy 
>>>>>>>>>>>>>> the 2 slots
>>>>>>>>>>>>>> that were requested by the application:
>>>>>>>>>>>>>> ./hello
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Either request fewer slots for your application, or make more 
>>>>>>>>>>>>>> slots available
>>>>>>>>>>>>>> for use.
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What am I doing wrong? The documentation on these new options is
>>>>>>>>>>>>>> somewhat poor and confusing so I'm probably doing something 
>>>>>>>>>>>>>> wrong. If
>>>>>>>>>>>>>> anyone could provide some pointers here it'd be much 
>>>>>>>>>>>>>> appreciated! If
>>>>>>>>>>>>>> it's not something simple and you need config logs and such 
>>>>>>>>>>>>>> please let
>>>>>>>>>>>>>> me know.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Looks like we think there are less than 16 slots allocated on 
>>>>>>>>>>>>> that node. What is in this "nodes" file? Without it, OMPI should 
>>>>>>>>>>>>> read the Torque allocation directly. You might check what we 
>>>>>>>>>>>>> think the allocation is by adding --display-allocation to the cmd 
>>>>>>>>>>>>> line
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As a side note -
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If I try this using the PBS nodefile with the above, I get a 
>>>>>>>>>>>>>> confusing message:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> A request for multiple cpus-per-proc was given, but a directive
>>>>>>>>>>>>>> was also give to map to an object level that has less cpus than
>>>>>>>>>>>>>> requested ones:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> #cpus-per-proc:  8
>>>>>>>>>>>>>> number of cpus:  1
>>>>>>>>>>>>>> map-by:          BYCORE:NOOVERSUBSCRIBE
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Please specify a mapping level that has more cpus, or else let us
>>>>>>>>>>>>>> define a default mapping that will allow multiple cpus-per-proc.
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> From what I've gathered this is because I have a node listed 16 
>>>>>>>>>>>>>> times
>>>>>>>>>>>>>> in my PBS nodefile so it's assuming then I have 1 core per node?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> No - if listed 16 times, it should compute 16 slots. Try adding 
>>>>>>>>>>>>> --display-allocation to your cmd line and it should tell you how 
>>>>>>>>>>>>> many slots are present.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However, it doesn't assume there is a core for each slot. 
>>>>>>>>>>>>> Instead, it detects the cores directly on the node. It sounds 
>>>>>>>>>>>>> like it isn't seeing them for some reason. What OS are you 
>>>>>>>>>>>>> running on that node?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> FWIW: the 1.6 series has a different detection system for cores. 
>>>>>>>>>>>>> Could be something is causing problems for the new one.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Some
>>>>>>>>>>>>>> better documentation here would be helpful. I haven't been able 
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> figure out how to use the "oversubscribe" option listed in the 
>>>>>>>>>>>>>> docs.
>>>>>>>>>>>>>> Not that I really want to oversubscribe, of course, I need to 
>>>>>>>>>>>>>> modify
>>>>>>>>>>>>>> the nodefile, but this just stumped me for a while as 1.6.3 
>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>> have this behavior.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As a extra bonus, I get a segfault in this situation:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> $ mpirun -np 2 -machinefile ./nodes ./hello
>>>>>>>>>>>>>> [conte-a497:13255] *** Process received signal ***
>>>>>>>>>>>>>> [conte-a497:13255] Signal: Segmentation fault (11)
>>>>>>>>>>>>>> [conte-a497:13255] Signal code: Address not mapped (1)
>>>>>>>>>>>>>> [conte-a497:13255] Failing at address: 0x2c
>>>>>>>>>>>>>> [conte-a497:13255] [ 0] /lib64/libpthread.so.0[0x3c9460f500]
>>>>>>>>>>>>>> [conte-a497:13255] [ 1]
>>>>>>>>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2ba960a59015]
>>>>>>>>>>>>>> [conte-a497:13255] [ 2]
>>>>>>>>>>>>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2ba961666715]
>>>>>>>>>>>>>> [conte-a497:13255] [ 3] mpirun(orterun+0x1b45)[0x40684f]
>>>>>>>>>>>>>> [conte-a497:13255] [ 4] mpirun(main+0x20)[0x4047f4]
>>>>>>>>>>>>>> [conte-a497:13255] [ 5] 
>>>>>>>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3a1bc1ecdd]
>>>>>>>>>>>>>> [conte-a497:13255] [ 6] mpirun[0x404719]
>>>>>>>>>>>>>> [conte-a497:13255] *** End of error message ***
>>>>>>>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Huh - that's odd. Could you perhaps configure OMPI with 
>>>>>>>>>>>>> --enable-debug and gdb the core file to tell us the line numbers 
>>>>>>>>>>>>> involved?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My "nodes" file simply contains the first two lines of my 
>>>>>>>>>>>>>> original
>>>>>>>>>>>>>> $PBS_NODEFILE provided by Torque. See above why I modified. 
>>>>>>>>>>>>>> Works fine
>>>>>>>>>>>>>> if use the full file.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks in advance for any pointers you all may have!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Dan
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Dan Dietz
>>>>>>>>>>>>>> Scientific Applications Analyst
>>>>>>>>>>>>>> ITaP Research Computing, Purdue University
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Dan Dietz
>>>>>>>>>>>> Scientific Applications Analyst
>>>>>>>>>>>> ITaP Research Computing, Purdue University
>>>>>>>>>>>> <slots>_______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Dan Dietz
>>>>>>>>>> Scientific Applications Analyst
>>>>>>>>>> ITaP Research Computing, Purdue University
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Dan Dietz
>>>>>>>> Scientific Applications Analyst
>>>>>>>> ITaP Research Computing, Purdue University
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24629.php
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24630.php
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Dan Dietz
>>>>>> Scientific Applications Analyst
>>>>>> ITaP Research Computing, Purdue University
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2014/06/24631.php
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/06/24642.php
>>> 
>>> 
>>> 
>>> --
>>> Dan Dietz
>>> Scientific Applications Analyst
>>> ITaP Research Computing, Purdue University
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/06/24645.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/06/24646.php
> 
> 
> 
> -- 
> Dan Dietz
> Scientific Applications Analyst
> ITaP Research Computing, Purdue University
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24647.php

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

Reply via email to