Prasanna,
I opened up a bug report to enable a better control over the
threading options (http://bugs.gentoo.org/show_bug.cgi?id=237435). In
the meanwhile, if your helloWorld isn't too fluffy, could you send it
over (off list if you prefer) so I can take a look at it, the
Segmentation fault is probably hinting at another problem. Also, could
you send the output of ompi_info now that you've recompiled openmpi with
USE=-threads, I want to make sure the option went through as I hope it
should. Simply attach the file named out.txt after running the following
command:
ompi_info > out.txt
...RTF files tend to make my eyes cross over ;)
Thanks,
Eric
Prasanna Ranganathan wrote:
Hi,
I have tried the following to no avail.
On 499 machines running openMPI 1.2.7:
mpirun -np 499 -bynode -hostfile nodelist /main/mpiHelloWorld ...
With different combinations of the following parameters
-mca btl_base_verbose 1 -mca btl_base_debug 2 -mca oob_base_verbose 1 -mca
oob_tcp_debug 1 -mca oob_tcp_listen_mode listen_thread -mca
btl_tcp_endpoint_cache 65536 -mca oob_tcp_peer_retries 120
I still get the No route to Host error messages.
Also, I tried with -mca pls_rsh_num_concurrent 499 --debug-daemons and did
not get any additional useful debug output other than the error messages.
I did notice one strange thing though. The following is always successful
(atleast all my attempts)
mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
but
mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
--debug-daemons
prints these error messages at the end from each of the nodes :
[idx2:04064] [0,0,1] orted_recv_pls: received message from [0,0,0]
[idx2:04064] [0,0,1] orted_recv_pls: received exit
[idx2:04064] *** Process received signal ***
[idx2:04064] Signal: Segmentation fault (11)
[idx2:04064] Signal code: (128)
[idx2:04064] Failing at address: (nil)
[idx2:04064] [ 0] /lib/libpthread.so.0 [0x2b92cc729f30]
[idx2:04064] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18)
[0x2b92cc0202a2]
[idx2:04064] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70)
[0x2b92cc00b5ac]
[idx2:04064] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
[0x2b92cc00875c]
[idx2:04064] [ 4] /usr/bin/orted(main+0x8a6) [0x4024ae]
[idx2:04064] *** End of error message ***
I am not sure if this points to the actual cause for these issues. Is is to
do with the openMPI 1.2.7 having posix enabled in the current configuration
on these nodes?
Thanks again for your continued help.
Regards,
Prasanna.
Message: 2
Date: Thu, 11 Sep 2008 12:16:50 -0400
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] Need help resolving No route to host error
with OpenMPI 1.1.2
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <7110e2d0-eb89-4293-a241-8487174b4...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
On Sep 10, 2008, at 9:29 PM, Prasanna Ranganathan wrote:
I have upgraded to 1.2.7 and am still noticing the issue.
FWIW, we didn't change anything with regards to OOB and TCP from 1.2.6
-> 1.2.7, but it's still good to be at the latest version.
Try running with this MCA parameter:
mpirun --mca oob_tcp_listen_mode listen_thread ...
Sorry; I forgot that we did not enable that option by default in the
v1.2 series.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users