Steve --

I think that this falls directly in your prevue since you volunteered to 
maintain the openib BTL (this HCA ID thing is part of the openib BTL 
bootstrapping).  :-)


> On Jun 2, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> 
>> On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com> wrote:
>> 
>> On 6/1/2015 9:51 PM, Ralph Castain wrote:
>>> I’m wondering if it is also possible that the error message is simply 
>>> printing that ID incorrectly. Looking at the code, it appears that we do 
>>> perform the network byte translation correctly when we setup the data for 
>>> transmission between the processes. However, I don’t see that translation 
>>> being done before we print the error message. Hence, I think the error 
>>> message is printing out the device ID incorrectly - and the problem truly 
>>> is just that the queues are different.
>>> 
>> 
>> Does the code convert the device id/part number into HBO before looking it 
>> up in the .ini file?
> 
> All I could see was that it is converted to NBO for transmission, and to HBO 
> at the remote end for use.  So both sides should have accurate IDs. I don’t 
> know what happens beyond that, I’m afraid - this isn’t my particular code 
> area.
> 
>> 
>> Assuming atlas3 is just displaying the vendor and part numbers w/o 
>> converting to HBO, they do look correct.  part ID 21505 is 0x5401, and part 
>> ID 22282240 is 0x5401 swapped:
>> 
>> [root@atlas3 openmpi]# echo $((0x5401))
>> 21505
>> [root@atlas3 openmpi]# echo $((0x01540000))
>> 22282240
>> 
>> Looking at the .ini on both nodes however, I see valid and identical entries 
>> for device 0x1425/0x5401:
>> 
>> [root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini
>> 
>> [Chelsio T5]
>> vendor_id = 0x1425
>> vendor_part_id = 
>> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
>> use_eager_rdma = 1
>> mtu = 2048
>> receive_queues = P,65536,64
>> 
>> [root@atlas3 openmpi]# grep -3 0x5401 *ini
>> 
>> [Chelsio T5]
>> vendor_id = 0x1425
>> vendor_part_id = 
>> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
>> use_eager_rdma = 1
>> mtu = 2048
>> receive_queues = P,65536,64
>> 
>> So I still think somehow the one node is looking up the wrong entry in the 
>> .ini file.
>> 
>> Also:  Attached are the ompi-info outputs and a diff of the two.
>> 
>> Steve.
>> 
>> 
>> 
>>>> On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>>>> wrote:
>>>> 
>>>> This is not a heterogeneous run-time issue -- it's the issue that Nathan 
>>>> cited: that OMPI detected different receive queue setups on different 
>>>> machines.
>>>> 
>>>> As the error message states; the openib BTL simply cannot handle when 
>>>> different MPI processes specific different receive queue specifications.
>>>> 
>>>> You mentioned that the device ID is being incorrectly identified: is that 
>>>> OMPI's fault, or something wrong with the device itself?
>>>> 
>>>> 
>>>> 
>>>>> On Jun 1, 2015, at 6:06 PM, Steve Wise <sw...@opengridcomputing.com> 
>>>>> wrote:
>>>>> 
>>>>> On 6/1/2015 9:53 AM, Ralph Castain wrote:
>>>>>> Well, I checked and it looks to me like —hetero-apps is a stale option 
>>>>>> in the master at least - I don’t see where it gets used.
>>>>>> 
>>>>>> Looking at the code, I would suspect that something didn’t get 
>>>>>> configured correctly - either the —enable-heterogeneous flag didn’t get 
>>>>>> set on one side, or we incorrectly failed to identify the BE machine, or 
>>>>>> both. You might run ompi_info on the two sides and verify they both were 
>>>>>> built correctly
>>>>> We'll check ompi_info...
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Steve.
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27025.php
>>>> 
>>>> -- 
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to: 
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/06/27026.php
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/06/27027.php
>> 
>> <atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/06/27030.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/06/27031.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to