On Jun 9, 2014, at 2:41 PM, Vineet Rawat <vineetraw...@gmail.com> wrote:
> Hi, > > We've deployed OpenMPI on a small cluster but get a SEGV in orted. Debug > information is very limited as the cluster is at a remote customer site. They > have a network card with which I'm not familiar (Cisco Systems Inc VIC P81E > PCIe Ethernet NIC) and it seems capable of using the usNIC BTL. I'm > suspicious that it might be at the root of the problem. They're also bonding > the 2 ports. This shouldn't matter - the VIC should work fine. > > However, we're also doing a few unusual things which could be causing > problems. Firstly, we built OpenMPI (I tried 1.6.4 and 1.8.1) without the > ibverbs or usnic BTLs. Then, we only ship what (we think) we need: otrerun, > orted, libmpi, libmpi_cxx, libopen-rte and libopen-pal. Could there be a > dependency on some other binary executable or dlopen'ed library? We also use > a special plm_rsh_agent but we've used this approach for some time without > issue. Did you remember to include all the libraries under <prefix>/lib/openmpi? We need all of those or else the orted will fail. > > I tried a few different MCA settings, the most restrictive of which led to > the failure of this command: > > orted --debug --debug-daemons -mca ess env -mca orte_ess_jobid 1925054464 > -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri > \"1925054464.0;tcp://10.xxx.xxx.xxx:40547\" --tree-spawn --mca > orte_base_help_aggregate 1 --mca plm_rsh_agent yyy --mca btl_tcp_port_min_v4 > 2000 --mca btl_tcp_port_range_v4 100 --mca btl tcp,self --mca > btl_tcp_if_include bond0 --mca orte_create_session_dirs 0 --mca > plm_rsh_assume_same_shell 0 -mca plm rsh -mca orte_debug_daemons 1 -mca > orte_debug 1 -mca orte_tag_output 1 > > It seems that the host is set up such that the core file is generated and > immediately removed ("ulimit -c" is unlimited) but the abrt daemon is doing > something weird. I'll be trying to get access to the system so I can use > "--mca orte orte_daemon_spin" and attach a debugger (if that's how that's > done). If I'm able to debug or obtain a core file I'll provide more > information. I've attached some information regarding the hardware, OpenMPI's > configuration and ompi_info output. Any thoughts? > > Thanks, > Vineet > <orted_segv.tar.gz>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users