> On Jun 2, 2015, at 7:10 AM, Steve Wise <sw...@opengridcomputing.com> wrote:
> 
> On 6/1/2015 9:51 PM, Ralph Castain wrote:
>> I’m wondering if it is also possible that the error message is simply 
>> printing that ID incorrectly. Looking at the code, it appears that we do 
>> perform the network byte translation correctly when we setup the data for 
>> transmission between the processes. However, I don’t see that translation 
>> being done before we print the error message. Hence, I think the error 
>> message is printing out the device ID incorrectly - and the problem truly is 
>> just that the queues are different.
>> 
> 
> Does the code convert the device id/part number into HBO before looking it up 
> in the .ini file?

All I could see was that it is converted to NBO for transmission, and to HBO at 
the remote end for use.  So both sides should have accurate IDs. I don’t know 
what happens beyond that, I’m afraid - this isn’t my particular code area.

> 
> Assuming atlas3 is just displaying the vendor and part numbers w/o converting 
> to HBO, they do look correct.  part ID 21505 is 0x5401, and part ID 22282240 
> is 0x5401 swapped:
> 
> [root@atlas3 openmpi]# echo $((0x5401))
> 21505
> [root@atlas3 openmpi]# echo $((0x01540000))
> 22282240
> 
> Looking at the .ini on both nodes however, I see valid and identical entries 
> for device 0x1425/0x5401:
> 
> [root@ppc64-rhel71 openmpi]# grep -3 0x5401 *ini
> 
> [Chelsio T5]
> vendor_id = 0x1425
> vendor_part_id = 
> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
> use_eager_rdma = 1
> mtu = 2048
> receive_queues = P,65536,64
> 
> [root@atlas3 openmpi]# grep -3 0x5401 *ini
> 
> [Chelsio T5]
> vendor_id = 0x1425
> vendor_part_id = 
> 0xb000,0xb001,0x5400,0x5401,0x5402,0x5403,0x5404,0x5405,0x5406,0x5407,0x5408,0x5409,0x540a,0x540b,0x540c,0x540d,0x540e,0x540f,0x5410,0x5411,0x5412,0x5413
> use_eager_rdma = 1
> mtu = 2048
> receive_queues = P,65536,64
> 
> So I still think somehow the one node is looking up the wrong entry in the 
> .ini file.
> 
> Also:  Attached are the ompi-info outputs and a diff of the two.
> 
> Steve.
> 
> 
> 
>>> On Jun 1, 2015, at 7:30 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
>>> wrote:
>>> 
>>> This is not a heterogeneous run-time issue -- it's the issue that Nathan 
>>> cited: that OMPI detected different receive queue setups on different 
>>> machines.
>>> 
>>> As the error message states; the openib BTL simply cannot handle when 
>>> different MPI processes specific different receive queue specifications.
>>> 
>>> You mentioned that the device ID is being incorrectly identified: is that 
>>> OMPI's fault, or something wrong with the device itself?
>>> 
>>> 
>>> 
>>>> On Jun 1, 2015, at 6:06 PM, Steve Wise <sw...@opengridcomputing.com> wrote:
>>>> 
>>>> On 6/1/2015 9:53 AM, Ralph Castain wrote:
>>>>> Well, I checked and it looks to me like —hetero-apps is a stale option in 
>>>>> the master at least - I don’t see where it gets used.
>>>>> 
>>>>> Looking at the code, I would suspect that something didn’t get configured 
>>>>> correctly - either the —enable-heterogeneous flag didn’t get set on one 
>>>>> side, or we incorrectly failed to identify the BE machine, or both. You 
>>>>> might run ompi_info on the two sides and verify they both were built 
>>>>> correctly
>>>> We'll check ompi_info...
>>>> 
>>>> Thanks!
>>>> 
>>>> Steve.
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/06/27025.php
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/06/27026.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/06/27027.php 
>> <http://www.open-mpi.org/community/lists/users/2015/06/27027.php>
> 
> <atlas3_ompi_info.txt><diff.txt><ppc64_ompi_info.txt>_______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/06/27030.php 
> <http://www.open-mpi.org/community/lists/users/2015/06/27030.php>

Reply via email to