On Jun 9, 2014, at 2:41 PM, Vineet Rawat <vineetraw...@gmail.com> wrote:

> Hi,
> 
> We've deployed OpenMPI on a small cluster but get a SEGV in orted. Debug 
> information is very limited as the cluster is at a remote customer site. They 
> have a network card with which I'm not familiar (Cisco Systems Inc VIC P81E 
> PCIe Ethernet NIC) and it seems capable of using the usNIC BTL. I'm 
> suspicious that it might be at the root of the problem. They're also bonding 
> the 2 ports.

This shouldn't matter - the VIC should work fine.

> 
> However, we're also doing a few unusual things which could be causing 
> problems. Firstly, we built OpenMPI (I tried 1.6.4 and 1.8.1) without the 
> ibverbs or usnic BTLs. Then, we only ship what (we think) we need: otrerun, 
> orted, libmpi, libmpi_cxx, libopen-rte and libopen-pal. Could there be a 
> dependency on some other binary executable or dlopen'ed library? We also use 
> a special plm_rsh_agent but we've used this approach for some time without 
> issue.

Did you remember to include all the libraries under <prefix>/lib/openmpi? We 
need all of those or else the orted will fail.

> 
> I tried a few different MCA settings, the most restrictive of which led to 
> the failure of this command:
> 
> orted --debug --debug-daemons -mca ess env -mca orte_ess_jobid 1925054464 
> -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri 
> \"1925054464.0;tcp://10.xxx.xxx.xxx:40547\" --tree-spawn --mca 
> orte_base_help_aggregate 1 --mca plm_rsh_agent yyy --mca btl_tcp_port_min_v4 
> 2000 --mca btl_tcp_port_range_v4 100 --mca btl tcp,self --mca 
> btl_tcp_if_include bond0 --mca orte_create_session_dirs 0 --mca 
> plm_rsh_assume_same_shell 0 -mca plm rsh -mca orte_debug_daemons 1 -mca 
> orte_debug 1 -mca orte_tag_output 1
> 
> It seems that the host is set up such that the core file is generated and 
> immediately removed ("ulimit -c" is unlimited) but the abrt daemon is doing 
> something weird. I'll be trying to get access to the system so I can use 
> "--mca orte orte_daemon_spin" and attach a debugger (if that's how that's 
> done). If I'm able to debug or obtain a core file I'll provide more 
> information. I've attached some information regarding the hardware, OpenMPI's 
> configuration and ompi_info output. Any thoughts?
> 
> Thanks,
> Vineet
> <orted_segv.tar.gz>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to