Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
On Sep 11, 2008, at 6:29 PM, Prasanna Ranganathan wrote: I have tried the following to no avail. On 499 machines running openMPI 1.2.7: mpirun -np 499 -bynode -hostfile nodelist /main/mpiHelloWorld ... With different combinations of the following parameters -mca btl_base_verbose 1 -mca btl_base_debug 2 -mca oob_base_verbose 1 -mca oob_tcp_debug 1 -mca oob_tcp_listen_mode listen_thread -mca btl_tcp_endpoint_cache 65536 -mca oob_tcp_peer_retries 120 I still get the No route to Host error messages. This is quite odd -- with the oob_tcp_listen_mode option, we have run jobs in the thousands of processes in the v1.2 series. The startup is still a bit slow (it's vastly improved in the upcoming v1.3 series), but we didn't run into problems like this. Can you absolutely verify that you are running 1.2.7 on all of your nodes and you have specified "-mca oob_tcp_listen_mode listen_thread" on the mpirun command line? The important part here is that when you invoke OMPI v1.2.7's mpirun on the head node, you are also using v1.2.7 on all the back-end nodes as well. Also, I tried with -mca pls_rsh_num_concurrent 499 --debug-daemons and did not get any additional useful debug output other than the error messages. I did notice one strange thing though. The following is always successful (atleast all my attempts) mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld but mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld --debug-daemons prints these error messages at the end from each of the nodes : [idx2:04064] [0,0,1] orted_recv_pls: received message from [0,0,0] [idx2:04064] [0,0,1] orted_recv_pls: received exit [idx2:04064] *** Process received signal *** [idx2:04064] Signal: Segmentation fault (11) [idx2:04064] Signal code: (128) [idx2:04064] Failing at address: (nil) [idx2:04064] [ 0] /lib/libpthread.so.0 [0x2b92cc729f30] [idx2:04064] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close +0x18) [0x2b92cc0202a2] [idx2:04064] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize +0x70) [0x2b92cc00b5ac] [idx2:04064] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20) [0x2b92cc00875c] [idx2:04064] [ 4] /usr/bin/orted(main+0x8a6) [0x4024ae] [idx2:04064] *** End of error message *** I am not sure if this points to the actual cause for these issues. Is is to do with the openMPI 1.2.7 having posix enabled in the current configuration on these nodes? POSIX threads enabled should not cause these issues. What you want to see in ompi_info output is the following: [6:46] svbu-mpi:~/hg/openib-fd-progress % ompi_info | grep thread Thread support: posix (mpi: no, progress: no) The two "no"'s are what are important here. -- Jeff Squyres Cisco Systems
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Hi, I have verified the openMPI version to be 1.2.7 on all the nodes and also ompi_info | grep thread is Thread support: posix (mpi: no, progress: no) on these machines. I get the error with and without -mca oob_tcp_listen_mode listen_thread. Sometimes, the startup takes too long with the listen_thread enabled and I have to resort to killing and restarting the program. Would the following matter in any way? 1. The head node (node where I start the mpi process) being a part of the cluster 2. The head node also being the root node (node with vpid 0) 3. The head node not being a part of the cluster I am currently trying the above stuff and other combinations such as tweaking -mca oob_tcp_thread_max_size. The test program I run is the following: #include #include int main(int argc, char **argv) { // Initialize MPI environment boost::mpi::environment env(argc, argv); if (!env.initialized()) { std::cout << "Could not initialize MPI environment!" << std::endl; return -1; } boost::mpi::communicator world; // Find out my identity in the default communicator int myrank = world.rank(); // Find out how many processes there are in the default communicator int ntasks = world.size(); char hn[256]; gethostname(hn, 255); std::cout << hn << " is node " << myrank << " of " << ntasks << std::endl; int allranks = boost::mpi::all_reduce(world, myrank, std::plus()); world.barrier(); if (myrank == 0) { std::cout << "ranks sum to " << allranks << std::endl; } // finalize MPI environment when env is destructed return 0; } I also tried a version without Boost::MPI with the same results. #include #include int main (int argc, char* argv[]) { int rank, size; MPI_Init (&argc, &argv); /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank);/* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size);/* get number of processes */ char hn[256]; gethostname(hn, 255); printf( "%s is node %d of %d\n", hn, rank, size ); int all_ranks; int count[1024] = {1}; MPI_Reduce_scatter (&rank,&all_ranks, count, MPI_INT, MPI_SUM, MPI_COMM_WORLD); MPI_Barrier (MPI_COMM_WORLD); if(rank == 0 ) printf( "ranks sum to %d\n",all_ranks); MPI_Finalize(); return 0; } Regards, Prasanna.
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Hi Prasanna, do you have any unusual ethernet interfaces on your nodes? I have seen similar problems when using IP over Infiniband. I'm not sure exactly why, but mixing interfaces of different types (ib0 and eth0 for example) can sometimes cause these problems, possibly because they are on different subnets. If you do have multiple interfaces on the same machine, use the btl_tcp_if_include and oop_tcp_if_include parameters to explicitly set the interfaces you want to use. I have also seen problems where eth0 on the head node is on a different subnet than eth0 on the compute nodes. mch 2008/9/12 Prasanna Ranganathan : > Hi, > > I have verified the openMPI version to be 1.2.7 on all the nodes and also > ompi_info | grep thread is Thread support: posix (mpi: no, progress: no) on > these machines. > > I get the error with and without -mca oob_tcp_listen_mode listen_thread. > Sometimes, the startup takes too long with the listen_thread enabled and I > have to resort to killing and restarting the program. > > Would the following matter in any way? > 1. The head node (node where I start the mpi process) being a part of the > cluster > 2. The head node also being the root node (node with vpid 0) > 3. The head node not being a part of the cluster > > I am currently trying the above stuff and other combinations such as > tweaking -mca oob_tcp_thread_max_size. > > The test program I run is the following: > > #include > #include > > int > main(int argc, char **argv) > { > // Initialize MPI environment > boost::mpi::environment env(argc, argv); > if (!env.initialized()) { >std::cout << "Could not initialize MPI environment!" << std::endl; >return -1; > } > boost::mpi::communicator world; > > // Find out my identity in the default communicator > int myrank = world.rank(); > > // Find out how many processes there are in the default communicator > int ntasks = world.size(); > > char hn[256]; > gethostname(hn, 255); > > std::cout << hn << " is node " << myrank << " of " << ntasks << std::endl; > > int allranks = boost::mpi::all_reduce(world, myrank, std::plus()); > > world.barrier(); > if (myrank == 0) { >std::cout << "ranks sum to " << allranks << std::endl; > } > > // finalize MPI environment when env is destructed > return 0; > } > > > I also tried a version without Boost::MPI with the same results. > > #include > #include > > > int main (int argc, char* argv[]) > { > int rank, size; > > MPI_Init (&argc, &argv); /* starts MPI */ > MPI_Comm_rank (MPI_COMM_WORLD, &rank);/* get current process id */ > MPI_Comm_size (MPI_COMM_WORLD, &size);/* get number of processes > */ > char hn[256]; > gethostname(hn, 255); > printf( "%s is node %d of %d\n", hn, rank, size ); > > int all_ranks; > int count[1024] = {1}; > MPI_Reduce_scatter (&rank,&all_ranks, count, MPI_INT, MPI_SUM, > MPI_COMM_WORLD); > MPI_Barrier (MPI_COMM_WORLD); > if(rank == 0 ) >printf( "ranks sum to %d\n",all_ranks); > MPI_Finalize(); > return 0; > } > > Regards, > > Prasanna. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Hi, I did make sure at the beginning that only eth0 was activated on all the nodes. Nevertheless, I am currently verifying the NIC configuration on all the nodes and making sure things are as expected. While trying different things, I did come across this peculiar error which I had detailed in one of my previous mails in this thread. I am testing the helloWorld program in the following trivial case: mpirun -np 1 -host localhost /main/mpiHelloWorld Which works fine. But, mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld always fails as follows: Daemon [0,0,1] checking in as pid 2059 on host localhost [idx1:02059] [0,0,1] orted: received launch callback idx1 is node 0 of 1 ranks sum to 0 [idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0] [idx1:02059] [0,0,1] orted_recv_pls: received exit [idx1:02059] *** Process received signal *** [idx1:02059] Signal: Segmentation fault (11) [idx1:02059] Signal code: (128) [idx1:02059] Failing at address: (nil) [idx1:02059] [ 0] /lib/libpthread.so.0 [0x2afa8c597f30] [idx1:02059] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18) [0x2afa8be8e2a2] [idx1:02059] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70) [0x2afa8be795ac] [idx1:02059] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20) [0x2afa8be7675c] [idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae] [idx1:02059] *** End of error message *** The failure happens with more verbose output when using the -d flag. Does this point to some bug in OpenMPI or am I missing something here? I have attached ompi_info output on this node. Regards, Prasanna. ompi_info.txt Description: Binary data
Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2
Prasanna, Please send me your /etc/make.conf and the contents of /var/db/pkg/sys-cluster/openmpi-1.2.7/ You can package this with the following command line: tar -cjf data.tbz /etc/make.conf /var/db/pkg/sys-cluster/openmpi-1.2.7/ And simply send me the data.tbz file. Thanks, Eric Prasanna Ranganathan wrote: Hi, I did make sure at the beginning that only eth0 was activated on all the nodes. Nevertheless, I am currently verifying the NIC configuration on all the nodes and making sure things are as expected. While trying different things, I did come across this peculiar error which I had detailed in one of my previous mails in this thread. I am testing the helloWorld program in the following trivial case: mpirun -np 1 -host localhost /main/mpiHelloWorld Which works fine. But, mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld always fails as follows: Daemon [0,0,1] checking in as pid 2059 on host localhost [idx1:02059] [0,0,1] orted: received launch callback idx1 is node 0 of 1 ranks sum to 0 [idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0] [idx1:02059] [0,0,1] orted_recv_pls: received exit [idx1:02059] *** Process received signal *** [idx1:02059] Signal: Segmentation fault (11) [idx1:02059] Signal code: (128) [idx1:02059] Failing at address: (nil) [idx1:02059] [ 0] /lib/libpthread.so.0 [0x2afa8c597f30] [idx1:02059] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18) [0x2afa8be8e2a2] [idx1:02059] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70) [0x2afa8be795ac] [idx1:02059] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20) [0x2afa8be7675c] [idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae] [idx1:02059] *** End of error message *** The failure happens with more verbose output when using the -d flag. Does this point to some bug in OpenMPI or am I missing something here? I have attached ompi_info output on this node. Regards, Prasanna. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users