On Mon, Jun 9, 2014 at 3:31 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:
> On Jun 9, 2014, at 5:41 PM, Vineet Rawat <vineetraw...@gmail.com> wrote: > > > We've deployed OpenMPI on a small cluster but get a SEGV in orted. Debug > information is very limited as the cluster is at a remote customer site. > They have a network card with which I'm not familiar (Cisco Systems Inc VIC > P81E PCIe Ethernet NIC) and it seems capable of using the usNIC BTL. > > Unfortunately, this is the 1st generation Cisco VIC -- our usNIC BTL is > only enabled starting with the 2nd generation Cisco VIC (the 12xx series, > not the Pxxx series). > > So runs over this Ethernet NIC should be using just plain ol' TCP. > OK, that should be fine here. > > I'm suspicious that it might be at the root of the problem. They're also > bonding the 2 ports. > > FWIW, it's not necessary to bond the interfaces for Open MPI -- meaning > that Open MPI will automatically stripe large messages across multiple IP > interfaces, etc. So if they're bonding for the purposes of MPI bandwidth, > you can tell them to turn off the bonding. > They said they're doing it for resilience, not bandwidth. > > Also note that, by default, Open MPI's TCP MPI transport will aggressively > use *all* IP interfaces that it finds. So in your case, it will likely use > bond0, eth0, *and* eth1. Meaning: OMPI can effectively oversubscribe the > network coming out of each VIC. You might want to set a system-wide > default MCA parameter to have OMPI not use the bond0 interface. For > example, add this line to $prefix/etc/mca-params.conf: > > btl_tcp_if_include = eth0,eth1 > > This will have OMPI *only* use eth0 and eth1 -- it'll ignore lo and bond0. > OK, will do. > > > However, we're also doing a few unusual things which could be causing > problems. Firstly, we built OpenMPI (I tried 1.6.4 and 1.8.1) without the > ibverbs or usnic BTLs. Then, we only ship what (we think) we need: otrerun, > orted, libmpi, libmpi_cxx, libopen-rte and libopen-pal. Could there be a > dependency on some other binary executable or dlopen'ed library? We also > use a special plm_rsh_agent but we've used this approach for some time > without issue. > > All that sounds fine. > > Open MPI 1.8.1 is preferred; the 1.6.x series is pretty old at this point. > If there's a bug in 1.8.1, it's a whole lot easier for us to fix it in the > 1.8.x series. > Yes, we've been deploying 1.6.4 for a while and are wary of change. We only went to 1.8.1 to see if it changed anything related to this issue. I completely understand that any fixes, if needed, are likely to go in the latest version. > > > I tried a few different MCA settings, the most restrictive of which led > to the failure of this command: > > > > orted --debug --debug-daemons -mca ess env -mca orte_ess_jobid > 1925054464 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri > \"1925054464.0;tcp://10.xxx.xxx.xxx:40547\" --tree-spawn --mca > orte_base_help_aggregate 1 --mca plm_rsh_agent yyy --mca > btl_tcp_port_min_v4 2000 --mca btl_tcp_port_range_v4 100 --mca btl tcp,self > --mca btl_tcp_if_include bond0 --mca orte_create_session_dirs 0 --mca > plm_rsh_assume_same_shell 0 -mca plm rsh -mca orte_debug_daemons 1 -mca > orte_debug 1 -mca orte_tag_output 1 > > > > It seems that the host is set up such that the core file is generated > and immediately removed ("ulimit -c" is unlimited) but the abrt daemon is > doing something weird. > > As Ralph mentioned, can you verify that the correct version MPI libraries > are being picked up on the remote servers? E.g., is LD_LIBRARY_PATH being > set properly in the shell startup files on the remote servers (e.g., to > find the 1.8.1 shared libraries)? > > Also make sure that you install each version of Open MPI into a "clean" > directory -- don't install OMPI 1.6.x into /foo and then install OMPI 1.8.x > info /foo, too. The two versions are incompatible with each other, and > have conflicting/not-wholly-overlapping libraries. Meaning: if you install > OMPI 1.6.x into /foo, you should either "rm -rf /foo" before you install > OMPI 1.8.x into /foo, or just install OMPI 1.8.x into /bar. > The installations are entirely separate. The LD_LIBRARY_PATH is set up by our own launch wrapper and I'm confident it's correct. Vineet > Make sense? > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >