Cheers,
Gilles
On Wednesday, June 3, 2015, Jeff Squyres (jsquyres)
<jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:
Steve --
I think that this falls directly in your prevue since you
volunteered to maintain the openib BTL (this HCA ID thing is part
of the openib BTL bootstrapping). :-)
> On Jun 2, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org
<javascript:;>> wrote:
>
>
>> On Jun 2, 2015, at 7:10 AM, Steve Wise
<sw...@opengridcomputing.com <javascript:;>> wrote:
>>
>> On 6/1/2015 9:51 PM, Ralph Castain wrote:
>>> I’m wondering if it is also possible that the error message is
simply printing that ID incorrectly. Looking at the code, it
appears that we do perform the network byte translation correctly
when we setup the data for transmission between the processes.
However, I don’t see that translation being done before we print
the error message. Hence, I think the error message is printing
out the device ID incorrectly - and the problem truly is just that
the queues are different.
>>>
>>
>> Does the code convert the device id/part number into HBO before
looking it up in the .ini file?
>
> All I could see was that it is converted to NBO for
transmission, and to HBO at the remote end for use. So both sides
should have accurate IDs. I don’t know what happens beyond that,
I’m afraid - this isn’t my particular code area.
>
>>
>> Assuming atlas3 is just displaying the vendor and part numbers
w/o converting to HBO, they do look correct. part ID 21505 is
0x5401, and part ID 22282240 is 0x5401 swapped:
>>
>> [root@atlas3 openmpi]# echo $((0x5401))
>> 21505
>> [root@atlas3 openmpi]# echo $((0x01540000))
>> 22282240
>>
>> Looking at the .ini on both nodes however, I see valid and
identical entries for device 0x1425/0x5401:
>>
>> [root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini
>>
>> [Chelsio T5]
>> vendor_id = 0x1425
>> vendor_part_id =
0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
>> use_eager_rdma = 1
>> mtu = 2048
>> receive_queues = P,65536,64
>>
>> [root@atlas3 openmpi]# grep -3 0x5401 *ini
>>
>> [Chelsio T5]
>> vendor_id = 0x1425
>> vendor_part_id =
0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
>> use_eager_rdma = 1
>> mtu = 2048
>> receive_queues = P,65536,64
>>
>> So I still think somehow the one node is looking up the wrong
entry in the .ini file.
>>
>> Also: Attached are the ompi-info outputs and a diff of the two.
>>
>> Steve.
>>
>>
>>
>>>> On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com <javascript:;>> wrote:
>>>>
>>>> This is not a heterogeneous run-time issue -- it's the issue
that Nathan cited: that OMPI detected different receive queue
setups on different machines.
>>>>
>>>> As the error message states; the openib BTL simply cannot
handle when different MPI processes specific different receive
queue specifications.
>>>>
>>>> You mentioned that the device ID is being incorrectly
identified: is that OMPI's fault, or something wrong with the
device itself?
>>>>
>>>>
>>>>
>>>>> On Jun 1, 2015, at 6:06 PM, Steve Wise
<sw...@opengridcomputing.com <javascript:;>> wrote:
>>>>>
>>>>> On 6/1/2015 9:53 AM, Ralph Castain wrote:
>>>>>> Well, I checked and it looks to me like —hetero-apps is a
stale option in the master at least - I don’t see where it gets used.
>>>>>>
>>>>>> Looking at the code, I would suspect that something didn’t
get configured correctly - either the —enable-heterogeneous flag
didn’t get set on one side, or we incorrectly failed to identify
the BE machine, or both. You might run ompi_info on the two sides
and verify they both were built correctly
>>>>> We'll check ompi_info...
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Steve.
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <javascript:;>
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27025.php
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com <javascript:;>
>>>> For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <javascript:;>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27026.php
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <javascript:;>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27027.php
>>
>>
<atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________
>> users mailing list
>> us...@open-mpi.org <javascript:;>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27030.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <javascript:;>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27031.php
--
Jeff Squyres
jsquy...@cisco.com <javascript:;>
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
users mailing list
us...@open-mpi.org <javascript:;>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27033.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27034.php