Looks to me like you are picking up a different OMPI installation on the remote node - check that your path and ld_library_path on the remote host are being set correctly On Jan 24, 2014, at 9:41 AM, etcamargo <etcama...@inf.ufpr.br> wrote:
> Hi, All! > > Please, I have a problem to run a simple "hello world" program on different > hosts. The hosts are virtual machines located in the same net. The program > works fine only on one host, the ssh is ok between the machines and nfs is > ok, sharing the executable files between the machines. > > a) $ mpirun -hostfile machines -v -np 2 ./hello > > [achel:15275] [[32727,0],0] ORTE_ERROR_LOG: Out of resource in file > base/plm_base_launch_support.c at line 482 > [latrappe:16467] OPAL dss:unpack: got type 49 when expecting type 38 > [latrappe:16467] [[32727,0],1] ORTE_ERROR_LOG: Pack data mismatch in file > ../../../orte/orted/orted_comm.c at line 235 > [latrappe:16467] [[32727,0],1] routed:binomial: Connection to lifeline > [[32727,0],0] lost > > > b) $ mpirun -mca plm_base_verbose 5 -hostfile machines -v -np 2 ./hello > > [achel:17020] mca:base:select:( plm) Querying component [rsh] > [achel:17020] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL > [achel:17020] mca:base:select:( plm) Query of component [rsh] set priority > to 10 > [achel:17020] mca:base:select:( plm) Querying component [slurm] > [achel:17020] mca:base:select:( plm) Skipping component [slurm]. Query > failed to return a module > [achel:17020] mca:base:select:( plm) Selected component [rsh] > [achel:17020] plm:base:set_hnp_name: initial bias 17020 nodename hash > 2714559920 > [achel:17020] plm:base:set_hnp_name: final jobfam 1536 > [achel:17020] [[1536,0],0] plm:rsh_setup on agent ssh : rsh path NULL > [achel:17020] [[1536,0],0] plm:base:receive start comm > [achel:17020] released to spawn > [achel:17020] [[1536,0],0] plm:base:setup_vm > [achel:17020] [[1536,0],0] plm:base:setup_vm creating map > [achel:17020] [[1536,0],0] plm:base:setup_vm add new daemon [[1536,0],1] > [achel:17020] [[1536,0],0] plm:base:setup_vm assigning new daemon > [[1536,0],1] to node latrappe.c3local > [achel:17020] [[1536,0],0] plm:rsh: launching vm > [achel:17020] [[1536,0],0] plm:rsh: local shell: 0 (bash) > [achel:17020] [[1536,0],0] plm:rsh: assuming same remote shell as local shell > [achel:17020] [[1536,0],0] plm:rsh: remote shell: 0 (bash) > [achel:17020] [[1536,0],0] plm:rsh: final template argv: > /usr/bin/ssh <template> orted -mca ess env -mca orte_ess_jobid > 100663296 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 -mca > orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca plm_base_verbose 5 > -mca plm rsh > [achel:17020] [[1536,0],0] plm:rsh: launching on node latrappe.c3local > [achel:17020] [[1536,0],0] plm:rsh: recording launch of daemon [[1536,0],1] > [achel:17020] [[1536,0],0] plm:base:daemon_callback > [achel:17020] [[1536,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh > latrappe.c3local orted -mca ess env -mca orte_ess_jobid 100663296 -mca > orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri > "100663296.0;tcp://10.254.222.5:37564" -mca plm_base_verbose 5 -mca plm rsh] > [latrappe:18212] mca:base:select:( plm) Querying component [rsh] > [latrappe:18212] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [latrappe:18212] mca:base:select:( plm) Selected component [rsh] > [achel:17020] [[1536,0],0] plm:base:orted_report_launch from daemon > [[1536,0],1] via [[1536,0],1] > [achel:17020] [[1536,0],0] ORTE_ERROR_LOG: Out of resource in file > base/plm_base_launch_support.c at line 482 > [achel:17020] [[1536,0],0] plm:base:orted_report_launch failed for daemon > [[1536,0],1] (via [[1536,0],1]) at contact > 100663296.1;tcp://10.254.222.7:33825 > [achel:17020] [[1536,0],0] plm:base:orted_cmd sending orted_exit commands > [achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit abnormal term ordered > [achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit sending cmd to > [[1536,0],1] > [achel:17020] [[1536,0],0] plm:base:orted_cmd message to [[1536,0],1] sent > [achel:17020] [[1536,0],0] plm:base:orted_cmd all messages sent > [achel:17020] [[1536,0],0] plm:tm: daemon launch failed on error (null) > [latrappe:18212] OPAL dss:unpack: got type 49 when expecting type 38 > [latrappe:18212] [[1536,0],1] ORTE_ERROR_LOG: Pack data mismatch in file > ../../../orte/orted/orted_comm.c at line 235 > [achel:17020] [[1536,0],0] plm:base:receive stop comm > [latrappe:18212] [[1536,0],1] routed:binomial: Connection to lifeline > [[1536,0],0] lost > > Thanks in advance, > > Edson > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users