Henk, On Friday 06 July 2007 05:34:35 am SLIM H.A. wrote: > Dear Tim > > I followed the use of "--mca btl mx,self" as suggested in the FAQ > > http://www.open-mpi.org/faq/?category=myrinet#myri-btl Yeah, that FAQ is wrong. I am working right now to fix it up. It should be updated this afternoon.
> > When I use your extra mca value I get: > >mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi > > ------------------------------------------------------------------------ > -- > > > WARNING: A user-supplied value attempted to override the read-only MCA > > parameter named "btl_mx_shared_mem". > > > > The user-supplied value was ignored. Opps, on the 1.2 branch this is a read-only parameter. On the current trunk the user can change it. Sorry for the confusion. Oh well, you should probably use Open MPI's shared memory support instead anyways. So you should either pass '-mca btl mx,sm,self', or just pass nothing at all. Open MPI is fairly smart at figuring out what components to use, so you really should not need to specify anything. > followed by the same error messages as before. > > Note that although I add "self" the error messages complain about it > > missing: > > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication. > > > If you specified the use of a BTL component, you may have > > > > forgotten a > > > > > component (such as "self") in the list of usable components. > > I checked the output from mx_info for both the current node and another, > there seems not to be a problem. > I attch the output from ompi_info --all > Also > > >ompi_info | grep mx > > Prefix: > /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3 > MCA btl: mx (MCA v1.0, API v1.0.1, Component v1.2.3) > MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2.3) > > As a further check, I rebuild the exe with mpich and that works fine on > the same node over myrinet. I wonder whether mx is properly include in > my openmpi build. > Use of ldd -v on the mpich exe gives references to libmyriexpress.so, > which is not the case for the ompi built exe, suggesting something is > missing? No, this is expected behavior. The Open MPI executeables are not linked to libmyriexpress.so, only the mx components. So if you do a ldd on /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3/lib/openmpi/mca_btl_mx.so, this should show the Myrinet library. > I used --with-mx=/usr/local/Cluster-Apps/mx/mx-1.1.1 > and a listing of that directory is > > >ls /usr/local/Cluster-Apps/mx/mx-1.1.1 > > bin etc include lib lib32 lib64 sbin > > This should be sufficient, I don't need --with-mx-libdir? Correct. Hope this helps, Tim > > Thanks > > Henk > > > -----Original Message----- > > From: users-boun...@open-mpi.org > > [mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins > > Sent: 05 July 2007 18:16 > > To: Open MPI Users > > Subject: Re: [OMPI users] openmpi fails on mx endpoint busy > > > > Hi Henk, > > > > By specifying '--mca btl mx,self' you are telling Open MPI > > not to use its shared memory support. If you want to use Open > > MPI's shared memory support, you must add 'sm' to the list. > > I.e. '--mca btl mx,self'. If you would rather use MX's shared > > memory support, instead use '--mca btl mx,self --mca > > btl_mx_shared_mem 1'. However, in most cases I believe Open > > MPI's shared memory support is a bit better. > > > > Alternatively, if you don't specify any btls, Open MPI should > > figure out what to use automatically. > > > > Hope this helps, > > > > Tim > > > > SLIM H.A. wrote: > > > Hello > > > > > > I have compiled openmpi-1.2.3 with the --with-mx=<directory> > > > configuration and gcc compiler. On testing with 4-8 slots I get an > > > > > > error message, the mx ports are busy: > > >> mpirun --mca btl mx,self -np 4 ./cpi > > > > > > [node001:10071] mca_btl_mx_init: mx_open_endpoint() failed with > > > status=20 [node001:10074] mca_btl_mx_init: > > > > mx_open_endpoint() failed > > > > > with status=20 [node001:10073] mca_btl_mx_init: mx_open_endpoint() > > > failed with status=20 > > > > ---------------------------------------------------------------------- > > > > > -- > > > -- > > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication. > > > If you specified the use of a BTL component, you may have > > > > forgotten a > > > > > component (such as "self") in the list of usable components. > > > ... snipped > > > It looks like MPI_INIT failed for some reason; your > > > > parallel process > > > > > is likely to abort. There are many reasons that a parallel process > > > can fail during MPI_INIT; some of which are due to configuration or > > > environment problems. This failure appears to be an > > > > internal failure; > > > > > here's some additional information (which may only be > > > > relevant to an > > > > > Open MPI > > > developer): > > > > > > PML add procs failed > > > --> Returned "Unreachable" (-12) instead of "Success" (0) > > > > ---------------------------------------------------------------------- > > > > > -- > > > -- > > > *** An error occurred in MPI_Init > > > *** before MPI was initialized > > > *** MPI_ERRORS_ARE_FATAL (goodbye) > > > mpirun noticed that job rank 0 with PID 10071 on node > > > > node001 exited > > > > > on signal 1 (Hangup). > > > > > > > > > I would not expect mx messages as communication should not > > > > go through > > > > > the mx card? (This is a twin dual core shared memory node) > > > > The same > > > > > happens when testing on 2 nodes, using a hostfile. > > > I checked the state of the mx card with mx_endpoint_info > > > > and mx_info, > > > > > they are healthy and free. > > > What is missing here? > > > > > > Thanks > > > > > > Henk > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users