In the master, the code is in opal/mca/btl/openib/btl_openib_component.c

In the 1.8/1.10 series, the code is in the same file, but located under the 
ompi/mca/btl/openib directory

> On Jun 2, 2015, at 8:14 AM, Steve Wise <sw...@opengridcomputing.com> wrote:
> 
> On 6/2/2015 10:04 AM, Ralph Castain wrote:
>> 
>>> On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com 
>>> <mailto:sw...@opengridcomputing.com>> wrote:
>>> 
>>> On 6/1/2015 9:51 PM, Ralph Castain wrote:
>>>> I’m wondering if it is also possible that the error message is simply 
>>>> printing that ID incorrectly. Looking at the code, it appears that we do 
>>>> perform the network byte translation correctly when we setup the data for 
>>>> transmission between the processes. However, I don’t see that translation 
>>>> being done before we print the error message. Hence, I think the error 
>>>> message is printing out the device ID incorrectly - and the problem truly 
>>>> is just that the queues are different.
>>>> 
>>> 
>>> Does the code convert the device id/part number into HBO before looking it 
>>> up in the .ini file?
>> 
>> All I could see was that it is converted to NBO for transmission, and to HBO 
>> at the remote end for use.  So both sides should have accurate IDs. I don’t 
>> know what happens beyond that, I’m afraid - this isn’t my particular code 
>> area.
>> 
> That makes 2 of us :)
> 
> Where is this code located in the ompi tree? 
> 
> Are there any verbose parameters that will help show more detail on how it is 
> searching the .ini file?
> 
> 
>>> 
>>> Assuming atlas3 is just displaying the vendor and part numbers w/o 
>>> converting to HBO, they do look correct.  part ID 21505 is 0x5401, and part 
>>> ID 22282240 is 0x5401 swapped:
>>> 
>>> [root@atlas3 openmpi]# echo $((0x5401))
>>> 21505
>>> [root@atlas3 openmpi]# echo $((0x01540000))
>>> 22282240
>>> 
>>> Looking at the .ini on both nodes however, I see valid and identical 
>>> entries for device 0x1425/0x5401:
>>> 
>>> [root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini
>>> 
>>> [Chelsio T5]
>>> vendor_id = 0x1425
>>> vendor_part_id = 
>>> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
>>> use_eager_rdma = 1
>>> mtu = 2048
>>> receive_queues = P,65536,64
>>> 
>>> [root@atlas3 openmpi]# grep -3 0x5401 *ini
>>> 
>>> [Chelsio T5]
>>> vendor_id = 0x1425
>>> vendor_part_id = 
>>> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
>>> use_eager_rdma = 1
>>> mtu = 2048
>>> receive_queues = P,65536,64
>>> 
>>> So I still think somehow the one node is looking up the wrong entry in the 
>>> .ini file.
>>> 
>>> Also:  Attached are the ompi-info outputs and a diff of the two.
>>> 
>>> Steve.
>>> 
>>> 
>>> 
>>>>> On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com 
>>>>> <mailto:jsquy...@cisco.com>> wrote:
>>>>> 
>>>>> This is not a heterogeneous run-time issue -- it's the issue that Nathan 
>>>>> cited: that OMPI detected different receive queue setups on different 
>>>>> machines.
>>>>> 
>>>>> As the error message states; the openib BTL simply cannot handle when 
>>>>> different MPI processes specific different receive queue specifications.
>>>>> 
>>>>> You mentioned that the device ID is being incorrectly identified: is that 
>>>>> OMPI's fault, or something wrong with the device itself?
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Jun 1, 2015, at 6:06 PM, Steve Wise <sw...@opengridcomputing.com 
>>>>>> <mailto:sw...@opengridcomputing.com>> wrote:
>>>>>> 
>>>>>> On 6/1/2015 9:53 AM, Ralph Castain wrote:
>>>>>>> Well, I checked and it looks to me like —hetero-apps is a stale option 
>>>>>>> in the master at least - I don’t see where it gets used.
>>>>>>> 
>>>>>>> Looking at the code, I would suspect that something didn’t get 
>>>>>>> configured correctly - either the —enable-heterogeneous flag didn’t get 
>>>>>>> set on one side, or we incorrectly failed to identify the BE machine, 
>>>>>>> or both. You might run ompi_info on the two sides and verify they both 
>>>>>>> were built correctly
>>>>>> We'll check ompi_info...
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> Steve.
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27025.php 
>>>>>> <http://www.open-mpi.org/community/lists/users/2015/06/27025.php>
>>>>> 
>>>>> -- 
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>>>> For corporate legal information go to: 
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27026.php 
>>>>> <http://www.open-mpi.org/community/lists/users/2015/06/27026.php>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/06/27027.php 
>>>> <http://www.open-mpi.org/community/lists/users/2015/06/27027.php>
>>> 
>>> <atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/06/27030.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/06/27030.php>
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/06/27031.php 
>> <http://www.open-mpi.org/community/lists/users/2015/06/27031.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/06/27032.php

Reply via email to