Any solution for the following problem?
On Fri, Jan 23, 2009 at 7:58 PM, Sangamesh B <forum....@gmail.com> wrote: > On Fri, Jan 23, 2009 at 5:41 PM, Jeff Squyres <jsquy...@cisco.com> wrote: >> On Jan 22, 2009, at 11:26 PM, Sangamesh B wrote: >> >>> We''ve a cluster with 23 nodes connected to IB switch and 8 nodes >>> have connected to ethernet switch. Master node is also connected to IB >>> switch. SGE(with tight integration, -pe orte) is used for >>> parallel/serial job submission. >>> >>> Open MPI-1.3 is installed on master node with IB support >>> (--with-openib=/usr). The same folder is copied to the remaining 23 IB >>> nodes. >> >> Sounds good. >> >>> Now what shall I do for remaining 8 ethernet nodes: >>> (1) Copy the same folder(IB) to these nodes >>> (2) Install Open MPI on one of the 8 eight ethernet nodes. Copy the >>> same to 7 nodes. >>> (3) Install an ethernet version of Open MPI on master node and copy to 8 >>> nodes. >> >> Either 1 or 2 is your best bet. >> >> Do you have OFED installed on all nodes (either explicitly, or included in >> your Linux distro)? > No >> >> If so, I believe that at least some users with configurations like this >> install OMPI with OFED support (--with-openib=/usr, as you mentioned above) >> on all nodes. OMPI will notice that there is no OpenFabrics-capable >> hardware on the ethernet-only nodes and will simply not use the openib BTL >> plugin. >> >> Note that OMPI v1.3 got better about being silent about the lack of >> OpenFabrics devices when the openib BTL is present (OMPI v1.2 issued a >> warning about this). >> >> How you intend to use this setup is up to you; you may want to restrict jobs >> to 100% IB or 100% ethernet via SGE, or you may want to let them mix, >> realizing that the overall parallel job may be slowed down to the speed of >> the slowest network (e.g., ethernet). >> > > Now I've two basic problems: > > (1) Open MPI 1.3 is configurred as: > # ./configure --prefix=/opt/mpi/openmpi/1.3/intel --with-sge > --with-openib=/usr | tee config_out > > But, > > /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) > > shows only one component. Is this ok? > > (2) Open MPI is itself not working > ssh: connect to host chemfist.iitk.ac.in port 22: Connection timed out > A daemon (pid 31343) died unexpectedly with status 1 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > mpirun: clean termination accomplished > > > On two nodes: > > # /opt/mpi/openmpi/1.3/intel/bin/mpirun -np 2 -hostfile ih hostname > bash: /opt/mpi/openmpi/1.3/intel/bin/orted: No such file or directory > -------------------------------------------------------------------------- > A daemon (pid 31184) died unexpectedly with status 127 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun was unable to cleanly terminate the daemons on the nodes shown > below. Additional manual cleanup may be required - please refer to > the "orte-clean" tool for assistance. > -------------------------------------------------------------------------- > ibc0 - daemon did not report back when launched > ibc1 - daemon did not report back when launched > > > #cat ih > ibc0 > ibc1 > > Everything is fine. > These ib interfaces are able to ping from master. > > # echo $LD_LIBRARY_PATH > /opt/mpi/openmpi/1.3/intel/lib:/opt/intel/cce/10.0.023/lib:/opt/intel/fce/10.0.023/lib:/opt/intel/mkl/10.0.5.025/lib/em64t:/opt/gridengine/lib/lx26-amd6 > > IB tests are also working fine. > Please help us to reslove this > >> Make sense? >> >> -- >> Jeff Squyres >> Cisco Systems >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >