Hi, We've deployed OpenMPI on a small cluster but get a SEGV in orted. Debug information is very limited as the cluster is at a remote customer site. They have a network card with which I'm not familiar (Cisco Systems Inc VIC P81E PCIe Ethernet NIC) and it seems capable of using the usNIC BTL. I'm suspicious that it might be at the root of the problem. They're also bonding the 2 ports.
However, we're also doing a few unusual things which could be causing problems. Firstly, we built OpenMPI (I tried 1.6.4 and 1.8.1) without the ibverbs or usnic BTLs. Then, we only ship what (we think) we need: otrerun, orted, libmpi, libmpi_cxx, libopen-rte and libopen-pal. Could there be a dependency on some other binary executable or dlopen'ed library? We also use a special plm_rsh_agent but we've used this approach for some time without issue. I tried a few different MCA settings, the most restrictive of which led to the failure of this command: orted --debug --debug-daemons -mca ess env -mca orte_ess_jobid 1925054464 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri \"1925054464.0;tcp://10.xxx.xxx.xxx:40547\" --tree-spawn --mca orte_base_help_aggregate 1 --mca plm_rsh_agent yyy --mca btl_tcp_port_min_v4 2000 --mca btl_tcp_port_range_v4 100 --mca btl tcp,self --mca btl_tcp_if_include bond0 --mca orte_create_session_dirs 0 --mca plm_rsh_assume_same_shell 0 -mca plm rsh -mca orte_debug_daemons 1 -mca orte_debug 1 -mca orte_tag_output 1 It seems that the host is set up such that the core file is generated and immediately removed ("ulimit -c" is unlimited) but the abrt daemon is doing something weird. I'll be trying to get access to the system so I can use "--mca orte orte_daemon_spin" and attach a debugger (if that's how that's done). If I'm able to debug or obtain a core file I'll provide more information. I've attached some information regarding the hardware, OpenMPI's configuration and ompi_info output. Any thoughts? Thanks, Vineet
orted_segv.tar.gz
Description: GNU Zip compressed data