On Mon, Jun 9, 2014 at 3:21 PM, Ralph Castain <r...@open-mpi.org> wrote:
> > On Jun 9, 2014, at 2:41 PM, Vineet Rawat <vineetraw...@gmail.com> wrote: > > Hi, > > We've deployed OpenMPI on a small cluster but get a SEGV in orted. Debug > information is very limited as the cluster is at a remote customer site. > They have a network card with which I'm not familiar (Cisco Systems Inc VIC > P81E PCIe Ethernet NIC) and it seems capable of using the usNIC BTL. I'm > suspicious that it might be at the root of the problem. They're also > bonding the 2 ports. > > > This shouldn't matter - the VIC should work fine. > Great, glad to hear that. > > > However, we're also doing a few unusual things which could be causing > problems. Firstly, we built OpenMPI (I tried 1.6.4 and 1.8.1) without the > ibverbs or usnic BTLs. Then, we only ship what (we think) we need: otrerun, > orted, libmpi, libmpi_cxx, libopen-rte and libopen-pal. Could there be a > dependency on some other binary executable or dlopen'ed library? We also > use a special plm_rsh_agent but we've used this approach for some time > without issue. > > > Did you remember to include all the libraries under <prefix>/lib/openmpi? > We need all of those or else the orted will fail. > No, we only included what seemed necessary (from ldd output and experience on other clusters). The only things in my <prefix>/lib/openmpi are libompi_dbg_msgq*. Is that what you're referring to? In <prefix>/lib for 12.8.1 (ignoring the VampirTrace libs) I could add libmpi_mpifh, libmpi_usempi, libompitrace and/or liboshmem. Anything needed there? Thanks for the help, Vineet > > I tried a few different MCA settings, the most restrictive of which led to > the failure of this command: > > orted --debug --debug-daemons -mca ess env -mca orte_ess_jobid 1925054464 > -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri > \"1925054464.0;tcp://10.xxx.xxx.xxx:40547\" --tree-spawn --mca > orte_base_help_aggregate 1 --mca plm_rsh_agent yyy --mca > btl_tcp_port_min_v4 2000 --mca btl_tcp_port_range_v4 100 --mca btl tcp,self > --mca btl_tcp_if_include bond0 --mca orte_create_session_dirs 0 --mca > plm_rsh_assume_same_shell 0 -mca plm rsh -mca orte_debug_daemons 1 -mca > orte_debug 1 -mca orte_tag_output 1 > > It seems that the host is set up such that the core file is generated and > immediately removed ("ulimit -c" is unlimited) but the abrt daemon is doing > something weird. I'll be trying to get access to the system so I can use > "--mca orte orte_daemon_spin" and attach a debugger (if that's how that's > done). If I'm able to debug or obtain a core file I'll provide more > information. I've attached some information regarding the hardware, > OpenMPI's configuration and ompi_info output. Any thoughts? > > Thanks, > Vineet > <orted_segv.tar.gz>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >