Also check to ensure you are using the same version of OMPI on all nodes - this message usually means that a different version was used on at least one node.
> On Dec 23, 2016, at 1:58 AM, gil...@rist.or.jp wrote: > > Serguei, > > > this looks like a very different issue, orted cannot be remotely started. > > > that typically occurs if orted cannot find some dependencies > > (the Open MPI libs and/or the compiler runtime) > > > for example, from a node, ssh <other node> orted should not fail because of > unresolved dependencies. > > a simple trick is to replace > > mpirun ... > > with > > `which mpirun` ... > > > a better option (as long as you do not plan to relocate Open MPI install dir) > is to configure with > > --enable-mpirun-prefix-by-default > > > Cheers, > > > Gilles > > ----- Original Message ----- > > Hi All ! > As there are no any positive changes with "UDSM + IPoIB" problem since my > previous post, > we installed IPoIB on the cluster and "No OpenFabrics connection..." error > doesn't appear more. > But now OpenMPI reports about another problem: > > In app ERROR OUTPUT stream: > > [node2:14142] [[37935,0],0] ORTE_ERROR_LOG: Data unpack had inadequate space > in file base/plm_base_launch_support.c at line 1035 > > In app OUTPUT stream: > > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > -------------------------------------------------------------------------- > > When I'm trying to run the task using single node - all works properly. > But when I specify "run on 2 nodes", the problem appears. > > I tried to run ping using IPoIB addresses and all hosts are resolved > properly, > ping requests and replies are going over IB without any problems. > So all nodes (including head) see each other via IPoIB. > But MPI app fails. > > Same test task works perfect on all nodes being run with Ethernet transport > instead of InfiniBand. > > P.S. We use Torque resource manager to enqueue MPI tasks. > > Best regards, > Sergei. > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users