Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
I've poked and prodded, and the 1.8.2 tarball seems to be handling this 
situation just fine. I don't have access to a Torque machine, but I did set 
everything to follow the same code path, added faux coprocessors, etc. - and it 
ran just fine.

Can you try the 1.8.2 tarball and see if it solves the problem?


On Jun 11, 2014, at 2:15 PM, Ralph Castain  wrote:

> Okay, let me poke around some more. It is clearly tied to the coprocessors, 
> but I'm not yet sure just why.
> 
> One thing you might do is try the nightly 1.8.2 tarball - there have been a 
> number of fixes, and this may well have been caught there. Worth taking a 
> look.
> 
> 
> On Jun 11, 2014, at 6:44 AM, Dan Dietz  wrote:
> 
>> Sorry - it crashes with both torque and rsh launchers. The output from
>> a gdb backtrace on the core files looks identical.
>> 
>> Dan
>> 
>> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain  wrote:
>>> Afraid I'm a little confused now - are you saying it works fine under 
>>> Torque, but segfaults under rsh? Could you please clarify your current 
>>> situation?
>>> 
>>> 
>>> On Jun 11, 2014, at 6:27 AM, Dan Dietz  wrote:
>>> 
 It looks like it is still segfaulting with the rsh launcher:
 
 ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
 -np 4 -machinefile ./nodes ./hello
 [conte-a084:51113] *** Process received signal ***
 [conte-a084:51113] Signal: Segmentation fault (11)
 [conte-a084:51113] Signal code: Address not mapped (1)
 [conte-a084:51113] Failing at address: 0x2c
 [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710]
 [conte-a084:51113] [ 1]
 /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015]
 [conte-a084:51113] [ 2]
 /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715]
 [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f]
 [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4]
 [conte-a084:51113] [ 5] 
 /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d]
 [conte-a084:51113] [ 6] mpirun[0x404719]
 [conte-a084:51113] *** End of error message ***
 Segmentation fault (core dumped)
 
 On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain  wrote:
> I'm having no luck poking at this segfault issue. For some strange 
> reason, we seem to think there are coprocessors on those remote nodes - 
> e.g., a Phi card. Yet your lstopo output doesn't seem to show it.
> 
> Out of curiosity, can you try running this with "-mca plm rsh"? This will 
> substitute the rsh/ssh launcher in place of Torque - assuming your system 
> will allow it, this will let me see if the problem is somewhere in the 
> Torque launcher or elsewhere in OMPI.
> 
> Thanks
> Ralph
> 
> On Jun 6, 2014, at 12:48 PM, Dan Dietz  wrote:
> 
>> No problem -
>> 
>> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips.
>> 2 per node, 8 cores each. No threading enabled.
>> 
>> $ lstopo
>> Machine (64GB)
>> NUMANode L#0 (P#0 32GB)
>> Socket L#0 + L3 L#0 (20MB)
>>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
>> (P#0)
>>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
>> (P#1)
>>   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
>> (P#2)
>>   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
>> (P#3)
>>   L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
>> (P#4)
>>   L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 
>> (P#5)
>>   L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 
>> (P#6)
>>   L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 
>> (P#7)
>> HostBridge L#0
>>   PCIBridge
>> PCI 1000:0087
>>   Block L#0 "sda"
>>   PCIBridge
>> PCI 8086:2250
>>   PCIBridge
>> PCI 8086:1521
>>   Net L#1 "eth0"
>> PCI 8086:1521
>>   Net L#2 "eth1"
>>   PCIBridge
>> PCI 102b:0533
>>   PCI 8086:1d02
>> NUMANode L#1 (P#1 32GB)
>> Socket L#1 + L3 L#1 (20MB)
>>   L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 
>> (P#8)
>>   L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 
>> (P#9)
>>   L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>> + PU L#10 (P#10)
>>   L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>> + PU L#11 (P#11)
>>   L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>> + PU L#12 (P#12)
>>   L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>> + PU L#13 (P#13)
>>   L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>> + PU

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Bennet Fauber
On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain  wrote:
> I've poked and prodded, and the 1.8.2 tarball seems to be handling this 
> situation

Ralph,

That's still the development tarball, right?  1.8.2 remains unreleased?

Is the an ETA for 1.8.2 the end of this month?

Thanks,  -- bennet


Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
It isn't a development tarball - it's the current state of the release branch 
and is therefore managed much more strictly than the developer trunk. We are 
preparing it now for release candidate. I have about a dozen CMR's waiting for 
final review before moving across to 1.8.2, and then we'll begin the release 
process.

Most of those changes involve closing memory leaks, and a few deal with more 
esoteric use-cases, so I think most users would probably be okay at the moment. 
The nightly is passing rather cleanly, actually, so I'm feeling good that the 
ETA for release isn't far off. Maybe not end June as last-minute things always 
surface in the RC process, but likely early July.


On Jun 12, 2014, at 8:11 AM, Bennet Fauber  wrote:

> On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain  wrote:
>> I've poked and prodded, and the 1.8.2 tarball seems to be handling this 
>> situation
> 
> Ralph,
> 
> That's still the development tarball, right?  1.8.2 remains unreleased?
> 
> Is the an ETA for 1.8.2 the end of this month?
> 
> Thanks,  -- bennet
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24643.php



Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Dan Dietz
Unfortunately, the nightly tarball appears to be crashing in a similar
fashion. :-( I used the latest snapshot 1.8.2a1r31981.

Dan

On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain  wrote:
> I've poked and prodded, and the 1.8.2 tarball seems to be handling this 
> situation just fine. I don't have access to a Torque machine, but I did set 
> everything to follow the same code path, added faux coprocessors, etc. - and 
> it ran just fine.
>
> Can you try the 1.8.2 tarball and see if it solves the problem?
>
>
> On Jun 11, 2014, at 2:15 PM, Ralph Castain  wrote:
>
>> Okay, let me poke around some more. It is clearly tied to the coprocessors, 
>> but I'm not yet sure just why.
>>
>> One thing you might do is try the nightly 1.8.2 tarball - there have been a 
>> number of fixes, and this may well have been caught there. Worth taking a 
>> look.
>>
>>
>> On Jun 11, 2014, at 6:44 AM, Dan Dietz  wrote:
>>
>>> Sorry - it crashes with both torque and rsh launchers. The output from
>>> a gdb backtrace on the core files looks identical.
>>>
>>> Dan
>>>
>>> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain  wrote:
 Afraid I'm a little confused now - are you saying it works fine under 
 Torque, but segfaults under rsh? Could you please clarify your current 
 situation?


 On Jun 11, 2014, at 6:27 AM, Dan Dietz  wrote:

> It looks like it is still segfaulting with the rsh launcher:
>
> ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
> -np 4 -machinefile ./nodes ./hello
> [conte-a084:51113] *** Process received signal ***
> [conte-a084:51113] Signal: Segmentation fault (11)
> [conte-a084:51113] Signal code: Address not mapped (1)
> [conte-a084:51113] Failing at address: 0x2c
> [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710]
> [conte-a084:51113] [ 1]
> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015]
> [conte-a084:51113] [ 2]
> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715]
> [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f]
> [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4]
> [conte-a084:51113] [ 5] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d]
> [conte-a084:51113] [ 6] mpirun[0x404719]
> [conte-a084:51113] *** End of error message ***
> Segmentation fault (core dumped)
>
> On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain  wrote:
>> I'm having no luck poking at this segfault issue. For some strange 
>> reason, we seem to think there are coprocessors on those remote nodes - 
>> e.g., a Phi card. Yet your lstopo output doesn't seem to show it.
>>
>> Out of curiosity, can you try running this with "-mca plm rsh"? This 
>> will substitute the rsh/ssh launcher in place of Torque - assuming your 
>> system will allow it, this will let me see if the problem is somewhere 
>> in the Torque launcher or elsewhere in OMPI.
>>
>> Thanks
>> Ralph
>>
>> On Jun 6, 2014, at 12:48 PM, Dan Dietz  wrote:
>>
>>> No problem -
>>>
>>> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips.
>>> 2 per node, 8 cores each. No threading enabled.
>>>
>>> $ lstopo
>>> Machine (64GB)
>>> NUMANode L#0 (P#0 32GB)
>>> Socket L#0 + L3 L#0 (20MB)
>>>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
>>> (P#0)
>>>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
>>> (P#1)
>>>   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
>>> (P#2)
>>>   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
>>> (P#3)
>>>   L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
>>> (P#4)
>>>   L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 
>>> (P#5)
>>>   L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 
>>> (P#6)
>>>   L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 
>>> (P#7)
>>> HostBridge L#0
>>>   PCIBridge
>>> PCI 1000:0087
>>>   Block L#0 "sda"
>>>   PCIBridge
>>> PCI 8086:2250
>>>   PCIBridge
>>> PCI 8086:1521
>>>   Net L#1 "eth0"
>>> PCI 8086:1521
>>>   Net L#2 "eth1"
>>>   PCIBridge
>>> PCI 102b:0533
>>>   PCI 8086:1d02
>>> NUMANode L#1 (P#1 32GB)
>>> Socket L#1 + L3 L#1 (20MB)
>>>   L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 
>>> (P#8)
>>>   L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 
>>> (P#9)
>>>   L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>> + PU L#10 (P#10)
>>>   L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>> 

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
Arggh - is there any way I can get access to this beast so I can debug this? I 
can't figure out what in the world is going on, but it seems to be something 
triggered by your specific setup.


On Jun 12, 2014, at 8:48 AM, Dan Dietz  wrote:

> Unfortunately, the nightly tarball appears to be crashing in a similar
> fashion. :-( I used the latest snapshot 1.8.2a1r31981.
> 
> Dan
> 
> On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain  wrote:
>> I've poked and prodded, and the 1.8.2 tarball seems to be handling this 
>> situation just fine. I don't have access to a Torque machine, but I did set 
>> everything to follow the same code path, added faux coprocessors, etc. - and 
>> it ran just fine.
>> 
>> Can you try the 1.8.2 tarball and see if it solves the problem?
>> 
>> 
>> On Jun 11, 2014, at 2:15 PM, Ralph Castain  wrote:
>> 
>>> Okay, let me poke around some more. It is clearly tied to the coprocessors, 
>>> but I'm not yet sure just why.
>>> 
>>> One thing you might do is try the nightly 1.8.2 tarball - there have been a 
>>> number of fixes, and this may well have been caught there. Worth taking a 
>>> look.
>>> 
>>> 
>>> On Jun 11, 2014, at 6:44 AM, Dan Dietz  wrote:
>>> 
 Sorry - it crashes with both torque and rsh launchers. The output from
 a gdb backtrace on the core files looks identical.
 
 Dan
 
 On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain  wrote:
> Afraid I'm a little confused now - are you saying it works fine under 
> Torque, but segfaults under rsh? Could you please clarify your current 
> situation?
> 
> 
> On Jun 11, 2014, at 6:27 AM, Dan Dietz  wrote:
> 
>> It looks like it is still segfaulting with the rsh launcher:
>> 
>> ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
>> -np 4 -machinefile ./nodes ./hello
>> [conte-a084:51113] *** Process received signal ***
>> [conte-a084:51113] Signal: Segmentation fault (11)
>> [conte-a084:51113] Signal code: Address not mapped (1)
>> [conte-a084:51113] Failing at address: 0x2c
>> [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710]
>> [conte-a084:51113] [ 1]
>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015]
>> [conte-a084:51113] [ 2]
>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715]
>> [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f]
>> [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4]
>> [conte-a084:51113] [ 5] 
>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d]
>> [conte-a084:51113] [ 6] mpirun[0x404719]
>> [conte-a084:51113] *** End of error message ***
>> Segmentation fault (core dumped)
>> 
>> On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain  wrote:
>>> I'm having no luck poking at this segfault issue. For some strange 
>>> reason, we seem to think there are coprocessors on those remote nodes - 
>>> e.g., a Phi card. Yet your lstopo output doesn't seem to show it.
>>> 
>>> Out of curiosity, can you try running this with "-mca plm rsh"? This 
>>> will substitute the rsh/ssh launcher in place of Torque - assuming your 
>>> system will allow it, this will let me see if the problem is somewhere 
>>> in the Torque launcher or elsewhere in OMPI.
>>> 
>>> Thanks
>>> Ralph
>>> 
>>> On Jun 6, 2014, at 12:48 PM, Dan Dietz  wrote:
>>> 
 No problem -
 
 These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips.
 2 per node, 8 cores each. No threading enabled.
 
 $ lstopo
 Machine (64GB)
 NUMANode L#0 (P#0 32GB)
 Socket L#0 + L3 L#0 (20MB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
 (P#0)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
 (P#1)
  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
 (P#2)
  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
 (P#3)
  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
 (P#4)
  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 
 (P#5)
  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 
 (P#6)
  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 
 (P#7)
 HostBridge L#0
  PCIBridge
PCI 1000:0087
  Block L#0 "sda"
  PCIBridge
PCI 8086:2250
  PCIBridge
PCI 8086:1521
  Net L#1 "eth0"
PCI 8086:1521
  Net L#2 "eth1"
  PCIBridge
PCI 102b:0533
  PCI 8086:1d02
 NUMANode L#1 (P#1 32GB)
 Socket L#1 + L3 L#1 (20MB)
>

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Dan Dietz
That shouldn't be a problem. Let me figure out the process and I'll
get back to you.

Dan

On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain  wrote:
> Arggh - is there any way I can get access to this beast so I can debug this? 
> I can't figure out what in the world is going on, but it seems to be 
> something triggered by your specific setup.
>
>
> On Jun 12, 2014, at 8:48 AM, Dan Dietz  wrote:
>
>> Unfortunately, the nightly tarball appears to be crashing in a similar
>> fashion. :-( I used the latest snapshot 1.8.2a1r31981.
>>
>> Dan
>>
>> On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain  wrote:
>>> I've poked and prodded, and the 1.8.2 tarball seems to be handling this 
>>> situation just fine. I don't have access to a Torque machine, but I did set 
>>> everything to follow the same code path, added faux coprocessors, etc. - 
>>> and it ran just fine.
>>>
>>> Can you try the 1.8.2 tarball and see if it solves the problem?
>>>
>>>
>>> On Jun 11, 2014, at 2:15 PM, Ralph Castain  wrote:
>>>
 Okay, let me poke around some more. It is clearly tied to the 
 coprocessors, but I'm not yet sure just why.

 One thing you might do is try the nightly 1.8.2 tarball - there have been 
 a number of fixes, and this may well have been caught there. Worth taking 
 a look.


 On Jun 11, 2014, at 6:44 AM, Dan Dietz  wrote:

> Sorry - it crashes with both torque and rsh launchers. The output from
> a gdb backtrace on the core files looks identical.
>
> Dan
>
> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain  wrote:
>> Afraid I'm a little confused now - are you saying it works fine under 
>> Torque, but segfaults under rsh? Could you please clarify your current 
>> situation?
>>
>>
>> On Jun 11, 2014, at 6:27 AM, Dan Dietz  wrote:
>>
>>> It looks like it is still segfaulting with the rsh launcher:
>>>
>>> ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
>>> -np 4 -machinefile ./nodes ./hello
>>> [conte-a084:51113] *** Process received signal ***
>>> [conte-a084:51113] Signal: Segmentation fault (11)
>>> [conte-a084:51113] Signal code: Address not mapped (1)
>>> [conte-a084:51113] Failing at address: 0x2c
>>> [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710]
>>> [conte-a084:51113] [ 1]
>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015]
>>> [conte-a084:51113] [ 2]
>>> /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715]
>>> [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f]
>>> [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4]
>>> [conte-a084:51113] [ 5] 
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d]
>>> [conte-a084:51113] [ 6] mpirun[0x404719]
>>> [conte-a084:51113] *** End of error message ***
>>> Segmentation fault (core dumped)
>>>
>>> On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain  wrote:
 I'm having no luck poking at this segfault issue. For some strange 
 reason, we seem to think there are coprocessors on those remote nodes 
 - e.g., a Phi card. Yet your lstopo output doesn't seem to show it.

 Out of curiosity, can you try running this with "-mca plm rsh"? This 
 will substitute the rsh/ssh launcher in place of Torque - assuming 
 your system will allow it, this will let me see if the problem is 
 somewhere in the Torque launcher or elsewhere in OMPI.

 Thanks
 Ralph

 On Jun 6, 2014, at 12:48 PM, Dan Dietz  wrote:

> No problem -
>
> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips.
> 2 per node, 8 cores each. No threading enabled.
>
> $ lstopo
> Machine (64GB)
> NUMANode L#0 (P#0 32GB)
> Socket L#0 + L3 L#0 (20MB)
>  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
> (P#0)
>  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
> (P#1)
>  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
> (P#2)
>  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
> (P#3)
>  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
> (P#4)
>  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 
> (P#5)
>  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 
> (P#6)
>  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 
> (P#7)
> HostBridge L#0
>  PCIBridge
>PCI 1000:0087
>  Block L#0 "sda"
>  PCIBridge
>PCI 8086:2250
>  PCIBridge
>   

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
Kewl - thanks! I'm a Purdue alum, if that helps :-)

On Jun 12, 2014, at 9:04 AM, Dan Dietz  wrote:

> That shouldn't be a problem. Let me figure out the process and I'll
> get back to you.
> 
> Dan
> 
> On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain  wrote:
>> Arggh - is there any way I can get access to this beast so I can debug this? 
>> I can't figure out what in the world is going on, but it seems to be 
>> something triggered by your specific setup.
>> 
>> 
>> On Jun 12, 2014, at 8:48 AM, Dan Dietz  wrote:
>> 
>>> Unfortunately, the nightly tarball appears to be crashing in a similar
>>> fashion. :-( I used the latest snapshot 1.8.2a1r31981.
>>> 
>>> Dan
>>> 
>>> On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain  wrote:
 I've poked and prodded, and the 1.8.2 tarball seems to be handling this 
 situation just fine. I don't have access to a Torque machine, but I did 
 set everything to follow the same code path, added faux coprocessors, etc. 
 - and it ran just fine.
 
 Can you try the 1.8.2 tarball and see if it solves the problem?
 
 
 On Jun 11, 2014, at 2:15 PM, Ralph Castain  wrote:
 
> Okay, let me poke around some more. It is clearly tied to the 
> coprocessors, but I'm not yet sure just why.
> 
> One thing you might do is try the nightly 1.8.2 tarball - there have been 
> a number of fixes, and this may well have been caught there. Worth taking 
> a look.
> 
> 
> On Jun 11, 2014, at 6:44 AM, Dan Dietz  wrote:
> 
>> Sorry - it crashes with both torque and rsh launchers. The output from
>> a gdb backtrace on the core files looks identical.
>> 
>> Dan
>> 
>> On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain  wrote:
>>> Afraid I'm a little confused now - are you saying it works fine under 
>>> Torque, but segfaults under rsh? Could you please clarify your current 
>>> situation?
>>> 
>>> 
>>> On Jun 11, 2014, at 6:27 AM, Dan Dietz  wrote:
>>> 
 It looks like it is still segfaulting with the rsh launcher:
 
 ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
 -np 4 -machinefile ./nodes ./hello
 [conte-a084:51113] *** Process received signal ***
 [conte-a084:51113] Signal: Segmentation fault (11)
 [conte-a084:51113] Signal code: Address not mapped (1)
 [conte-a084:51113] Failing at address: 0x2c
 [conte-a084:51113] [ 0] /lib64/libpthread.so.0[0x36ddc0f710]
 [conte-a084:51113] [ 1]
 /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x615)[0x2b857e203015]
 [conte-a084:51113] [ 2]
 /apps/rhel6/openmpi/1.8.1/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b857ee10715]
 [conte-a084:51113] [ 3] mpirun(orterun+0x1b45)[0x40684f]
 [conte-a084:51113] [ 4] mpirun(main+0x20)[0x4047f4]
 [conte-a084:51113] [ 5] 
 /lib64/libc.so.6(__libc_start_main+0xfd)[0x36dd41ed1d]
 [conte-a084:51113] [ 6] mpirun[0x404719]
 [conte-a084:51113] *** End of error message ***
 Segmentation fault (core dumped)
 
 On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain  
 wrote:
> I'm having no luck poking at this segfault issue. For some strange 
> reason, we seem to think there are coprocessors on those remote nodes 
> - e.g., a Phi card. Yet your lstopo output doesn't seem to show it.
> 
> Out of curiosity, can you try running this with "-mca plm rsh"? This 
> will substitute the rsh/ssh launcher in place of Torque - assuming 
> your system will allow it, this will let me see if the problem is 
> somewhere in the Torque launcher or elsewhere in OMPI.
> 
> Thanks
> Ralph
> 
> On Jun 6, 2014, at 12:48 PM, Dan Dietz  wrote:
> 
>> No problem -
>> 
>> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz 
>> chips.
>> 2 per node, 8 cores each. No threading enabled.
>> 
>> $ lstopo
>> Machine (64GB)
>> NUMANode L#0 (P#0 32GB)
>> Socket L#0 + L3 L#0 (20MB)
>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 
>> (P#0)
>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 
>> (P#1)
>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 
>> (P#2)
>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 
>> (P#3)
>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 
>> (P#4)
>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 
>> (P#5)
>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 
>> (P#6)
>> L2 L#7 (256K

Re: [OMPI users] OPENIB unknown transport errors

2014-06-12 Thread Tim Miller
Aha ... looking at "ibv_devinfo -v" got me my first concrete hint of what's
going on. On a node that's working fine (w2), under port 1 there is a line:

LinkLayer: InfiniBand

On a node that is having trouble (w3), that line is not present. The
question is why this inconsistency occurs.

I don't seem to have ofed_info installed on my system -- not sure what
magical package Red Hat decided to hide that in. The InfiniBand stack I am
running is stock with our version of Scientific Linux (6.2). I am beginning
to wonder if this isn't some bug with the Red Hat/SL-provided InfiniBand
stack. I'll do some more poking, but at least now I've got something
semi-solid to poke at. Thanks for all of your help; I've attached the
results of "ibv_devinfo -v" for both systems, so if you see anything else
that jumps at you, please let me know.

Tim




On Sat, Jun 7, 2014 at 2:21 AM, Mike Dubman 
wrote:

> could you please attach output of "ibv_devinfo -v" and "ofed_info -s"
> Thx
>
>
> On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller  wrote:
>
>> Hi Josh,
>>
>> I asked one of our more advanced users to add the "-mca btl_openib_if_include
>> mlx4_0:1" argument to his job script. Unfortunately, the same error
>> occurred as before.
>>
>> We'll keep digging on our end; if you have any other suggestions, please
>> let us know.
>>
>> Tim
>>
>>
>> On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller  wrote:
>>
>>> Hi Josh,
>>>
>>> Thanks for attempting to sort this out. In answer to your questions:
>>>
>>> 1. Node allocation is done by TORQUE, however we don't use the TM API to
>>> launch jobs (long story). Instead, we just pass a hostfile to mpirun, and
>>> mpirun uses the ssh launcher to actually communicate and launch the
>>> processes on remote nodes.
>>> 2. We have only one port per HCA (the HCA silicon is integrated with the
>>> motherboard on most of our nodes, including all that have this issue). They
>>> are all configured to use InfiniBand (no IPoIB or other protocols).
>>> 3. No, we don't explicitly ask for a device port pair. We will try your
>>> suggestion and report back.
>>>
>>> Thanks again!
>>>
>>> Tim
>>>
>>>
>>> On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd 
>>> wrote:
>>>
 Strange indeed. This info (remote adapter info) is passed around in the
 modex and the struct is locally populated during add procs.

 1. How do you launch jobs? Mpirun, srun, or something else?
 2. How many active ports do you have on each HCA? Are they all
 configured to use IB?
 3. Do you explicitly ask for a device:port pair with the "if include"
 mca param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1"
 (assuming you have a ConnectX-3 HCA and port 1 is configured to run over
 IB.)

 Josh


 On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller 
 wrote:

> Hi,
>
> I'd like to revive this thread, since I am still periodically getting
> errors of this type. I have built 1.8.1 with --enable-debug and run with
> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide 
> any
> additional information that I can find useful. I've gone ahead and 
> attached
> a dump of the output under 1.8.1. The key lines are:
>
>
> --
> Open MPI detected two different OpenFabrics transport types in the
> same Infiniband network.
> Such mixed network trasport configuration is not supported by Open MPI.
>
>   Local host:w1
>   Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>   Local transport type:  MCA_BTL_OPENIB_TRANSPORT_IB
>
>   Remote host:   w16
>   Remote Adapter:(vendor 0x2c9, part ID 26428)
>   Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>
> -
>
> Note that the vendor and part IDs are the same. If I immediately run
> on the same two nodes using MVAPICH2, everything is fine.
>
> I'm really very befuddled by this. OpenMPI sees that the two cards are
> the same and made by the same vendor, yet it thinks the transport types 
> are
> different (and one is unknown). I'm hoping someone with some experience
> with how the OpenIB BTL works can shed some light on this problem...
>
> Tim
>
>
> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd 
> wrote:
>
>>
>> Just wondering if you've tried with the latest stable OMPI, 1.8.1?
>> I'm wondering if this is an issue with the OOB. If you have a debug 
>> build,
>> you can run -mca btl_openib_verbose 10
>>
>> Josh
>>
>>
>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd 
>> wrote:
>>
>>> Hi, Tim
>>>
>>> Run "ibstat" on each host:
>>>
>>> 1. Make sure the adapters are alive and active.
>>>
>>> 2. Look at the Link Layer settings