Re: [OMPI users] Running openmpi jobs on two system-librdmacm: couldn't read ABI version

Jeff Squyres (jsquyres) Tue, 26 Mar 2013 09:46:36 -0400

Yes, it looks like you have a heterogeneous system (i.e., a binary compiled on 
one server doesn't necessarily run properly on another server).


In this case, you should see the heterogeneous section of the FAQ.

Fair warning, though -- heterogeneous systems are more difficult to 
manage/maintain/use then homogeneous systems...


On Mar 26, 2013, at 3:54 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:

> It may be because the other system is running upgraded version of linux which 
> is not having infiniband drivers. Any solution?
> 
> 
> On Tue, Mar 26, 2013 at 12:42 PM, Syed Ahsan Ali <ahsansha...@gmail.com> 
> wrote:
> Tried this but mpirun exits with this error
>  
> mpirun -np 40 /home/MET/hrm/bin/hrm
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> librdmacm: couldn't read ABI version.
> CMA: unable to get RDMA device list
> CMA: unable to get RDMA device list
> CMA: unable to get RDMA device list
> CMA: unable to get RDMA device list
> librdmacm: assuming: 4
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> CMA: unable to get RDMA device list
> CMA: unable to get RDMA device list
> librdmacm: couldn't read ABI version.
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> CMA: unable to get RDMA device list
> librdmacm: assuming: 4
> CMA: unable to get RDMA device list
> --------------------------------------------------------------------------
> [[33095,1],8]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
> Module: OpenFabrics (openib)
>   Host: pmd04.pakmet.com
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>   Process 1 ([[33095,1],28]) is on host: compute-02-00.private02.pakmet.com
>   Process 2 ([[33095,1],0]) is on host: pmd02
>   BTLs attempted: openib self sm
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
>  
>  
> Ahsan
> 
> On Fri, Mar 22, 2013 at 7:09 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> On Mar 22, 2013, at 3:42 AM, Syed Ahsan Ali <ahsansha...@gmail.com> wrote:
> 
>> Actually due to some data base corruption I am not able to add any new node 
>> to cluster from the installer node. So I want to run parallel job on more 
>> nodes without adding them to existing cluster.
>> You are right the binaries must be present on the remote node as well.
>> Is this possible throught nfs? just as the compute nodes are nfs mounted 
>> with the installer node.
> 
> Sure - OMPI doesn't care how the binaries got there. Just so long as they are 
> present on the compute node.
> 
>>  
>> Ahsan
>> 
>> 
>> On Fri, Mar 22, 2013 at 3:33 PM, Reuti <re...@staff.uni-marburg.de> wrote:
>> Am 22.03.2013 um 10:14 schrieb Syed Ahsan Ali:
>> 
>> > I have a very basic question. If we want to run mpirun job on two systems 
>> > which are not part of cluster, then how we can make it possible. Can the 
>> > host be specifiend on mpirun which is not compute node, rather a stand 
>> > alone system.
>> 
>> Sure, the machines can be specified as argument to `mpiexec`. But do you 
>> want to run applications just between these two machines, or should they 
>> participate on a larger parallel job with machines of the cluster: then a 
>> direct network connection between outside and inside of the cluster is 
>> necessary by some kind of forwarding in case these are separated networks.
>> 
>> Also the paths to the started binaries may be different, in case the two 
>> machines are not sharing the same /home with the cluster and this needs to 
>> be honored.
>> 
>> In case you are using a queuing system and want to route jobs to outside 
>> machines of the set up cluster: it's necessary to negotiate with the admin 
>> to allow jobs being scheduled thereto.
>> 
>> -- Reuti
>> 
>> 
>> > Thanks
>> > Ahsan
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> -- 
>> Syed Ahsan Ali Bokhari 
>> Electronic Engineer (EE)
>> 
>> Research & Development Division
>> Pakistan Meteorological Department H-8/4, Islamabad.
>> Phone # off  +92518358714
>> Cell # +923155145014
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
>  
> 
> 
> 
> -- 
> Syed Ahsan Ali Bokhari 
> Electronic Engineer (EE)
> 
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off  +92518358714
> Cell # +923155145014
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Running openmpi jobs on two system-librdmacm: couldn't read ABI version

Reply via email to