Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1
I've poked and prodded, and the 1.8.2 tarball seems to be handling this situation just fine. I don't have access to a Torque machine, but I did set everything to follow the same code path, added faux coprocessors, etc. - and it ran just fine. Can you try the 1.8.2 tarball and see if it solves the problem? On Jun 11, 2014, at 2:15 PM, Ralph Castain wrote: > Okay, let me poke around some more. It is clearly tied to the coprocessors, > but I'm not yet sure just why. > > One thing you might do is try the nightly 1.8.2 tarball - there have been a > number of fixes, and this may well have been caught there. Worth taking a > look. > > > On Jun 11, 2014, at 6:44 AM, Dan Dietz wrote: > >> Sorry - it crashes with both torque and rsh launchers. The output from >> a gdb backtrace on the core files looks identical. >> >> Dan >> >> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote: >>> Afraid I'm a little confused now - are you saying it works fine under >>> Torque, but segfaults under rsh? Could you please clarify your current >>> situation? >>> >>> >>> On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote: >>> It looks like it is still segfaulting with the rsh launcher: ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh -np 4 -machinefile ./nodes ./hello [conte-a084:51113] *** Process received signal *** [conte-a084:51113] Signal: Segmentation fault (11) [conte-a084:51113] Signal code: Address not mapped (1) [conte-a084:51113] Failing at address: 0x2c [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710] [conte-a084:51113] [ 1] /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015] [conte-a084:51113] [ 2] /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715] [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f] [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4] [conte-a084:51113] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d] [conte-a084:51113] [ 6] mpirun[0x404719] [conte-a084:51113] *** End of error message *** Segmentation fault (core dumped) On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain wrote: > I'm having no luck poking at this segfault issue. For some strange > reason, we seem to think there are coprocessors on those remote nodes - > e.g., a Phi card. Yet your lstopo output doesn't seem to show it. > > Out of curiosity, can you try running this with "-mca plm rsh"? This will > substitute the rsh/ssh launcher in place of Torque - assuming your system > will allow it, this will let me see if the problem is somewhere in the > Torque launcher or elsewhere in OMPI. > > Thanks > Ralph > > On Jun 6, 2014, at 12:48 PM, Dan Dietz wrote: > >> No problem - >> >> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips. >> 2 per node, 8 cores each. No threading enabled. >> >> $ lstopo >> Machine (64GB) >> NUMANode L#0 (P#0 32GB) >> Socket L#0 + L3 L#0 (20MB) >> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 >> (P#0) >> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 >> (P#1) >> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 >> (P#2) >> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 >> (P#3) >> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 >> (P#4) >> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 >> (P#5) >> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 >> (P#6) >> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 >> (P#7) >> HostBridge L#0 >> PCIBridge >> PCI 1000:0087 >> Block L#0 "sda" >> PCIBridge >> PCI 8086:2250 >> PCIBridge >> PCI 8086:1521 >> Net L#1 "eth0" >> PCI 8086:1521 >> Net L#2 "eth1" >> PCIBridge >> PCI 102b:0533 >> PCI 8086:1d02 >> NUMANode L#1 (P#1 32GB) >> Socket L#1 + L3 L#1 (20MB) >> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 >> (P#8) >> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 >> (P#9) >> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 >> + PU L#10 (P#10) >> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 >> + PU L#11 (P#11) >> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 >> + PU L#12 (P#12) >> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 >> + PU L#13 (P#13) >> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 >> + PU
Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1
On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote: > I've poked and prodded, and the 1.8.2 tarball seems to be handling this > situation Ralph, That's still the development tarball, right? 1.8.2 remains unreleased? Is the an ETA for 1.8.2 the end of this month? Thanks, -- bennet
Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1
It isn't a development tarball - it's the current state of the release branch and is therefore managed much more strictly than the developer trunk. We are preparing it now for release candidate. I have about a dozen CMR's waiting for final review before moving across to 1.8.2, and then we'll begin the release process. Most of those changes involve closing memory leaks, and a few deal with more esoteric use-cases, so I think most users would probably be okay at the moment. The nightly is passing rather cleanly, actually, so I'm feeling good that the ETA for release isn't far off. Maybe not end June as last-minute things always surface in the RC process, but likely early July. On Jun 12, 2014, at 8:11 AM, Bennet Fauber wrote: > On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote: >> I've poked and prodded, and the 1.8.2 tarball seems to be handling this >> situation > > Ralph, > > That's still the development tarball, right? 1.8.2 remains unreleased? > > Is the an ETA for 1.8.2 the end of this month? > > Thanks, -- bennet > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/06/24643.php
Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1
Unfortunately, the nightly tarball appears to be crashing in a similar fashion. :-( I used the latest snapshot 1.8.2a1r31981. Dan On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote: > I've poked and prodded, and the 1.8.2 tarball seems to be handling this > situation just fine. I don't have access to a Torque machine, but I did set > everything to follow the same code path, added faux coprocessors, etc. - and > it ran just fine. > > Can you try the 1.8.2 tarball and see if it solves the problem? > > > On Jun 11, 2014, at 2:15 PM, Ralph Castain wrote: > >> Okay, let me poke around some more. It is clearly tied to the coprocessors, >> but I'm not yet sure just why. >> >> One thing you might do is try the nightly 1.8.2 tarball - there have been a >> number of fixes, and this may well have been caught there. Worth taking a >> look. >> >> >> On Jun 11, 2014, at 6:44 AM, Dan Dietz wrote: >> >>> Sorry - it crashes with both torque and rsh launchers. The output from >>> a gdb backtrace on the core files looks identical. >>> >>> Dan >>> >>> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote: Afraid I'm a little confused now - are you saying it works fine under Torque, but segfaults under rsh? Could you please clarify your current situation? On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote: > It looks like it is still segfaulting with the rsh launcher: > > ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh > -np 4 -machinefile ./nodes ./hello > [conte-a084:51113] *** Process received signal *** > [conte-a084:51113] Signal: Segmentation fault (11) > [conte-a084:51113] Signal code: Address not mapped (1) > [conte-a084:51113] Failing at address: 0x2c > [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710] > [conte-a084:51113] [ 1] > /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015] > [conte-a084:51113] [ 2] > /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715] > [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f] > [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4] > [conte-a084:51113] [ 5] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d] > [conte-a084:51113] [ 6] mpirun[0x404719] > [conte-a084:51113] *** End of error message *** > Segmentation fault (core dumped) > > On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain wrote: >> I'm having no luck poking at this segfault issue. For some strange >> reason, we seem to think there are coprocessors on those remote nodes - >> e.g., a Phi card. Yet your lstopo output doesn't seem to show it. >> >> Out of curiosity, can you try running this with "-mca plm rsh"? This >> will substitute the rsh/ssh launcher in place of Torque - assuming your >> system will allow it, this will let me see if the problem is somewhere >> in the Torque launcher or elsewhere in OMPI. >> >> Thanks >> Ralph >> >> On Jun 6, 2014, at 12:48 PM, Dan Dietz wrote: >> >>> No problem - >>> >>> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips. >>> 2 per node, 8 cores each. No threading enabled. >>> >>> $ lstopo >>> Machine (64GB) >>> NUMANode L#0 (P#0 32GB) >>> Socket L#0 + L3 L#0 (20MB) >>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 >>> (P#0) >>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 >>> (P#1) >>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 >>> (P#2) >>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 >>> (P#3) >>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 >>> (P#4) >>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 >>> (P#5) >>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 >>> (P#6) >>> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 >>> (P#7) >>> HostBridge L#0 >>> PCIBridge >>> PCI 1000:0087 >>> Block L#0 "sda" >>> PCIBridge >>> PCI 8086:2250 >>> PCIBridge >>> PCI 8086:1521 >>> Net L#1 "eth0" >>> PCI 8086:1521 >>> Net L#2 "eth1" >>> PCIBridge >>> PCI 102b:0533 >>> PCI 8086:1d02 >>> NUMANode L#1 (P#1 32GB) >>> Socket L#1 + L3 L#1 (20MB) >>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 >>> (P#8) >>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 >>> (P#9) >>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 >>> + PU L#10 (P#10) >>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 >>>
Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1
Arggh - is there any way I can get access to this beast so I can debug this? I can't figure out what in the world is going on, but it seems to be something triggered by your specific setup. On Jun 12, 2014, at 8:48 AM, Dan Dietz wrote: > Unfortunately, the nightly tarball appears to be crashing in a similar > fashion. :-( I used the latest snapshot 1.8.2a1r31981. > > Dan > > On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote: >> I've poked and prodded, and the 1.8.2 tarball seems to be handling this >> situation just fine. I don't have access to a Torque machine, but I did set >> everything to follow the same code path, added faux coprocessors, etc. - and >> it ran just fine. >> >> Can you try the 1.8.2 tarball and see if it solves the problem? >> >> >> On Jun 11, 2014, at 2:15 PM, Ralph Castain wrote: >> >>> Okay, let me poke around some more. It is clearly tied to the coprocessors, >>> but I'm not yet sure just why. >>> >>> One thing you might do is try the nightly 1.8.2 tarball - there have been a >>> number of fixes, and this may well have been caught there. Worth taking a >>> look. >>> >>> >>> On Jun 11, 2014, at 6:44 AM, Dan Dietz wrote: >>> Sorry - it crashes with both torque and rsh launchers. The output from a gdb backtrace on the core files looks identical. Dan On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote: > Afraid I'm a little confused now - are you saying it works fine under > Torque, but segfaults under rsh? Could you please clarify your current > situation? > > > On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote: > >> It looks like it is still segfaulting with the rsh launcher: >> >> ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh >> -np 4 -machinefile ./nodes ./hello >> [conte-a084:51113] *** Process received signal *** >> [conte-a084:51113] Signal: Segmentation fault (11) >> [conte-a084:51113] Signal code: Address not mapped (1) >> [conte-a084:51113] Failing at address: 0x2c >> [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710] >> [conte-a084:51113] [ 1] >> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015] >> [conte-a084:51113] [ 2] >> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715] >> [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f] >> [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4] >> [conte-a084:51113] [ 5] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d] >> [conte-a084:51113] [ 6] mpirun[0x404719] >> [conte-a084:51113] *** End of error message *** >> Segmentation fault (core dumped) >> >> On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain wrote: >>> I'm having no luck poking at this segfault issue. For some strange >>> reason, we seem to think there are coprocessors on those remote nodes - >>> e.g., a Phi card. Yet your lstopo output doesn't seem to show it. >>> >>> Out of curiosity, can you try running this with "-mca plm rsh"? This >>> will substitute the rsh/ssh launcher in place of Torque - assuming your >>> system will allow it, this will let me see if the problem is somewhere >>> in the Torque launcher or elsewhere in OMPI. >>> >>> Thanks >>> Ralph >>> >>> On Jun 6, 2014, at 12:48 PM, Dan Dietz wrote: >>> No problem - These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips. 2 per node, 8 cores each. No threading enabled. $ lstopo Machine (64GB) NUMANode L#0 (P#0 32GB) Socket L#0 + L3 L#0 (20MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7) HostBridge L#0 PCIBridge PCI 1000:0087 Block L#0 "sda" PCIBridge PCI 8086:2250 PCIBridge PCI 8086:1521 Net L#1 "eth0" PCI 8086:1521 Net L#2 "eth1" PCIBridge PCI 102b:0533 PCI 8086:1d02 NUMANode L#1 (P#1 32GB) Socket L#1 + L3 L#1 (20MB) >
Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1
That shouldn't be a problem. Let me figure out the process and I'll get back to you. Dan On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain wrote: > Arggh - is there any way I can get access to this beast so I can debug this? > I can't figure out what in the world is going on, but it seems to be > something triggered by your specific setup. > > > On Jun 12, 2014, at 8:48 AM, Dan Dietz wrote: > >> Unfortunately, the nightly tarball appears to be crashing in a similar >> fashion. :-( I used the latest snapshot 1.8.2a1r31981. >> >> Dan >> >> On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote: >>> I've poked and prodded, and the 1.8.2 tarball seems to be handling this >>> situation just fine. I don't have access to a Torque machine, but I did set >>> everything to follow the same code path, added faux coprocessors, etc. - >>> and it ran just fine. >>> >>> Can you try the 1.8.2 tarball and see if it solves the problem? >>> >>> >>> On Jun 11, 2014, at 2:15 PM, Ralph Castain wrote: >>> Okay, let me poke around some more. It is clearly tied to the coprocessors, but I'm not yet sure just why. One thing you might do is try the nightly 1.8.2 tarball - there have been a number of fixes, and this may well have been caught there. Worth taking a look. On Jun 11, 2014, at 6:44 AM, Dan Dietz wrote: > Sorry - it crashes with both torque and rsh launchers. The output from > a gdb backtrace on the core files looks identical. > > Dan > > On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote: >> Afraid I'm a little confused now - are you saying it works fine under >> Torque, but segfaults under rsh? Could you please clarify your current >> situation? >> >> >> On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote: >> >>> It looks like it is still segfaulting with the rsh launcher: >>> >>> ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh >>> -np 4 -machinefile ./nodes ./hello >>> [conte-a084:51113] *** Process received signal *** >>> [conte-a084:51113] Signal: Segmentation fault (11) >>> [conte-a084:51113] Signal code: Address not mapped (1) >>> [conte-a084:51113] Failing at address: 0x2c >>> [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710] >>> [conte-a084:51113] [ 1] >>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015] >>> [conte-a084:51113] [ 2] >>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715] >>> [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f] >>> [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4] >>> [conte-a084:51113] [ 5] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d] >>> [conte-a084:51113] [ 6] mpirun[0x404719] >>> [conte-a084:51113] *** End of error message *** >>> Segmentation fault (core dumped) >>> >>> On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain wrote: I'm having no luck poking at this segfault issue. For some strange reason, we seem to think there are coprocessors on those remote nodes - e.g., a Phi card. Yet your lstopo output doesn't seem to show it. Out of curiosity, can you try running this with "-mca plm rsh"? This will substitute the rsh/ssh launcher in place of Torque - assuming your system will allow it, this will let me see if the problem is somewhere in the Torque launcher or elsewhere in OMPI. Thanks Ralph On Jun 6, 2014, at 12:48 PM, Dan Dietz wrote: > No problem - > > These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips. > 2 per node, 8 cores each. No threading enabled. > > $ lstopo > Machine (64GB) > NUMANode L#0 (P#0 32GB) > Socket L#0 + L3 L#0 (20MB) > L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 > (P#0) > L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 > (P#1) > L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 > (P#2) > L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 > (P#3) > L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 > (P#4) > L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 > (P#5) > L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 > (P#6) > L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 > (P#7) > HostBridge L#0 > PCIBridge >PCI 1000:0087 > Block L#0 "sda" > PCIBridge >PCI 8086:2250 > PCIBridge >
Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1
Kewl - thanks! I'm a Purdue alum, if that helps :-) On Jun 12, 2014, at 9:04 AM, Dan Dietz wrote: > That shouldn't be a problem. Let me figure out the process and I'll > get back to you. > > Dan > > On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain wrote: >> Arggh - is there any way I can get access to this beast so I can debug this? >> I can't figure out what in the world is going on, but it seems to be >> something triggered by your specific setup. >> >> >> On Jun 12, 2014, at 8:48 AM, Dan Dietz wrote: >> >>> Unfortunately, the nightly tarball appears to be crashing in a similar >>> fashion. :-( I used the latest snapshot 1.8.2a1r31981. >>> >>> Dan >>> >>> On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote: I've poked and prodded, and the 1.8.2 tarball seems to be handling this situation just fine. I don't have access to a Torque machine, but I did set everything to follow the same code path, added faux coprocessors, etc. - and it ran just fine. Can you try the 1.8.2 tarball and see if it solves the problem? On Jun 11, 2014, at 2:15 PM, Ralph Castain wrote: > Okay, let me poke around some more. It is clearly tied to the > coprocessors, but I'm not yet sure just why. > > One thing you might do is try the nightly 1.8.2 tarball - there have been > a number of fixes, and this may well have been caught there. Worth taking > a look. > > > On Jun 11, 2014, at 6:44 AM, Dan Dietz wrote: > >> Sorry - it crashes with both torque and rsh launchers. The output from >> a gdb backtrace on the core files looks identical. >> >> Dan >> >> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote: >>> Afraid I'm a little confused now - are you saying it works fine under >>> Torque, but segfaults under rsh? Could you please clarify your current >>> situation? >>> >>> >>> On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote: >>> It looks like it is still segfaulting with the rsh launcher: ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh -np 4 -machinefile ./nodes ./hello [conte-a084:51113] *** Process received signal *** [conte-a084:51113] Signal: Segmentation fault (11) [conte-a084:51113] Signal code: Address not mapped (1) [conte-a084:51113] Failing at address: 0x2c [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710] [conte-a084:51113] [ 1] /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015] [conte-a084:51113] [ 2] /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715] [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f] [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4] [conte-a084:51113] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d] [conte-a084:51113] [ 6] mpirun[0x404719] [conte-a084:51113] *** End of error message *** Segmentation fault (core dumped) On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain wrote: > I'm having no luck poking at this segfault issue. For some strange > reason, we seem to think there are coprocessors on those remote nodes > - e.g., a Phi card. Yet your lstopo output doesn't seem to show it. > > Out of curiosity, can you try running this with "-mca plm rsh"? This > will substitute the rsh/ssh launcher in place of Torque - assuming > your system will allow it, this will let me see if the problem is > somewhere in the Torque launcher or elsewhere in OMPI. > > Thanks > Ralph > > On Jun 6, 2014, at 12:48 PM, Dan Dietz wrote: > >> No problem - >> >> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz >> chips. >> 2 per node, 8 cores each. No threading enabled. >> >> $ lstopo >> Machine (64GB) >> NUMANode L#0 (P#0 32GB) >> Socket L#0 + L3 L#0 (20MB) >> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 >> (P#0) >> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 >> (P#1) >> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 >> (P#2) >> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 >> (P#3) >> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 >> (P#4) >> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 >> (P#5) >> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 >> (P#6) >> L2 L#7 (256K
Re: [OMPI users] OPENIB unknown transport errors
Aha ... looking at "ibv_devinfo -v" got me my first concrete hint of what's going on. On a node that's working fine (w2), under port 1 there is a line: LinkLayer: InfiniBand On a node that is having trouble (w3), that line is not present. The question is why this inconsistency occurs. I don't seem to have ofed_info installed on my system -- not sure what magical package Red Hat decided to hide that in. The InfiniBand stack I am running is stock with our version of Scientific Linux (6.2). I am beginning to wonder if this isn't some bug with the Red Hat/SL-provided InfiniBand stack. I'll do some more poking, but at least now I've got something semi-solid to poke at. Thanks for all of your help; I've attached the results of "ibv_devinfo -v" for both systems, so if you see anything else that jumps at you, please let me know. Tim On Sat, Jun 7, 2014 at 2:21 AM, Mike Dubman wrote: > could you please attach output of "ibv_devinfo -v" and "ofed_info -s" > Thx > > > On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller wrote: > >> Hi Josh, >> >> I asked one of our more advanced users to add the "-mca btl_openib_if_include >> mlx4_0:1" argument to his job script. Unfortunately, the same error >> occurred as before. >> >> We'll keep digging on our end; if you have any other suggestions, please >> let us know. >> >> Tim >> >> >> On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller wrote: >> >>> Hi Josh, >>> >>> Thanks for attempting to sort this out. In answer to your questions: >>> >>> 1. Node allocation is done by TORQUE, however we don't use the TM API to >>> launch jobs (long story). Instead, we just pass a hostfile to mpirun, and >>> mpirun uses the ssh launcher to actually communicate and launch the >>> processes on remote nodes. >>> 2. We have only one port per HCA (the HCA silicon is integrated with the >>> motherboard on most of our nodes, including all that have this issue). They >>> are all configured to use InfiniBand (no IPoIB or other protocols). >>> 3. No, we don't explicitly ask for a device port pair. We will try your >>> suggestion and report back. >>> >>> Thanks again! >>> >>> Tim >>> >>> >>> On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd >>> wrote: >>> Strange indeed. This info (remote adapter info) is passed around in the modex and the struct is locally populated during add procs. 1. How do you launch jobs? Mpirun, srun, or something else? 2. How many active ports do you have on each HCA? Are they all configured to use IB? 3. Do you explicitly ask for a device:port pair with the "if include" mca param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1" (assuming you have a ConnectX-3 HCA and port 1 is configured to run over IB.) Josh On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller wrote: > Hi, > > I'd like to revive this thread, since I am still periodically getting > errors of this type. I have built 1.8.1 with --enable-debug and run with > -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide > any > additional information that I can find useful. I've gone ahead and > attached > a dump of the output under 1.8.1. The key lines are: > > > -- > Open MPI detected two different OpenFabrics transport types in the > same Infiniband network. > Such mixed network trasport configuration is not supported by Open MPI. > > Local host:w1 > Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428) > Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB > > Remote host: w16 > Remote Adapter:(vendor 0x2c9, part ID 26428) > Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN > > - > > Note that the vendor and part IDs are the same. If I immediately run > on the same two nodes using MVAPICH2, everything is fine. > > I'm really very befuddled by this. OpenMPI sees that the two cards are > the same and made by the same vendor, yet it thinks the transport types > are > different (and one is unknown). I'm hoping someone with some experience > with how the OpenIB BTL works can shed some light on this problem... > > Tim > > > On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd > wrote: > >> >> Just wondering if you've tried with the latest stable OMPI, 1.8.1? >> I'm wondering if this is an issue with the OOB. If you have a debug >> build, >> you can run -mca btl_openib_verbose 10 >> >> Josh >> >> >> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd >> wrote: >> >>> Hi, Tim >>> >>> Run "ibstat" on each host: >>> >>> 1. Make sure the adapters are alive and active. >>> >>> 2. Look at the Link Layer settings