[OMPI users] about using mpi-thread-multiple
Hi, I would like to know what is the mpi version recomended for running multiple mpi call per process, i.e., MPI_THREAD_MULTIPLE in MPI_Init_thread(); Thanks, Edson
[OMPI users] monitoring the status of processors
Hi, All I would like to know if there is a (MPI) tool for monitoring the status of a processor (and your cores) at runtime, i.e., while I am running a MPI application. Let's suppose that some physical processors become overloaded while a MPI application is running. I am looking for a way to know which are the "busy" or the "slow" processors. Thanks in advance! Edson
[OMPI users] Connection to lifeline lost
Hi, All! Please, I have a problem to run a simple "hello world" program on different hosts. The hosts are virtual machines located in the same net. The program works fine only on one host, the ssh is ok between the machines and nfs is ok, sharing the executable files between the machines. a) $ mpirun -hostfile machines -v -np 2 ./hello [achel:15275] [[32727,0],0] ORTE_ERROR_LOG: Out of resource in file base/plm_base_launch_support.c at line 482 [latrappe:16467] OPAL dss:unpack: got type 49 when expecting type 38 [latrappe:16467] [[32727,0],1] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/orted/orted_comm.c at line 235 [latrappe:16467] [[32727,0],1] routed:binomial: Connection to lifeline [[32727,0],0] lost b) $ mpirun -mca plm_base_verbose 5 -hostfile machines -v -np 2 ./hello [achel:17020] mca:base:select:( plm) Querying component [rsh] [achel:17020] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [achel:17020] mca:base:select:( plm) Query of component [rsh] set priority to 10 [achel:17020] mca:base:select:( plm) Querying component [slurm] [achel:17020] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [achel:17020] mca:base:select:( plm) Selected component [rsh] [achel:17020] plm:base:set_hnp_name: initial bias 17020 nodename hash 2714559920 [achel:17020] plm:base:set_hnp_name: final jobfam 1536 [achel:17020] [[1536,0],0] plm:rsh_setup on agent ssh : rsh path NULL [achel:17020] [[1536,0],0] plm:base:receive start comm [achel:17020] released to spawn [achel:17020] [[1536,0],0] plm:base:setup_vm [achel:17020] [[1536,0],0] plm:base:setup_vm creating map [achel:17020] [[1536,0],0] plm:base:setup_vm add new daemon [[1536,0],1] [achel:17020] [[1536,0],0] plm:base:setup_vm assigning new daemon [[1536,0],1] to node latrappe.c3local [achel:17020] [[1536,0],0] plm:rsh: launching vm [achel:17020] [[1536,0],0] plm:rsh: local shell: 0 (bash) [achel:17020] [[1536,0],0] plm:rsh: assuming same remote shell as local shell [achel:17020] [[1536,0],0] plm:rsh: remote shell: 0 (bash) [achel:17020] [[1536,0],0] plm:rsh: final template argv: /usr/bin/ssh orted -mca ess env -mca orte_ess_jobid 100663296 -mca orte_ess_vpid -mca orte_ess_num_procs 2 -mca orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca plm_base_verbose 5 -mca plm rsh [achel:17020] [[1536,0],0] plm:rsh: launching on node latrappe.c3local [achel:17020] [[1536,0],0] plm:rsh: recording launch of daemon [[1536,0],1] [achel:17020] [[1536,0],0] plm:base:daemon_callback [achel:17020] [[1536,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh latrappe.c3local orted -mca ess env -mca orte_ess_jobid 100663296 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca plm_base_verbose 5 -mca plm rsh] [latrappe:18212] mca:base:select:( plm) Querying component [rsh] [latrappe:18212] mca:base:select:( plm) Query of component [rsh] set priority to 10 [latrappe:18212] mca:base:select:( plm) Selected component [rsh] [achel:17020] [[1536,0],0] plm:base:orted_report_launch from daemon [[1536,0],1] via [[1536,0],1] [achel:17020] [[1536,0],0] ORTE_ERROR_LOG: Out of resource in file base/plm_base_launch_support.c at line 482 [achel:17020] [[1536,0],0] plm:base:orted_report_launch failed for daemon [[1536,0],1] (via [[1536,0],1]) at contact 100663296.1;tcp://10.254.222.7:33825 [achel:17020] [[1536,0],0] plm:base:orted_cmd sending orted_exit commands [achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit abnormal term ordered [achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit sending cmd to [[1536,0],1] [achel:17020] [[1536,0],0] plm:base:orted_cmd message to [[1536,0],1] sent [achel:17020] [[1536,0],0] plm:base:orted_cmd all messages sent [achel:17020] [[1536,0],0] plm:tm: daemon launch failed on error (null) [latrappe:18212] OPAL dss:unpack: got type 49 when expecting type 38 [latrappe:18212] [[1536,0],1] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/orted/orted_comm.c at line 235 [achel:17020] [[1536,0],0] plm:base:receive stop comm [latrappe:18212] [[1536,0],1] routed:binomial: Connection to lifeline [[1536,0],0] lost Thanks in advance, Edson
Re: [OMPI users] Connection to lifeline lost
You are right. The problem was solved put the entire path of one mpi version: /home/myuser/openmpi-x/bin/mpirun -hostfile machines -np 2 ./hello Thanks, Edson Em 24-01-2014 16:00, Ralph Castain escreveu: Looks to me like you are picking up a different OMPI installation on the remote node - check that your path and ld_library_path on the remote host are being set correctly On Jan 24, 2014, at 9:41 AM, etcamargo wrote: Hi, All! Please, I have a problem to run a simple "hello world" program on different hosts. The hosts are virtual machines located in the same net. The program works fine only on one host, the ssh is ok between the machines and nfs is ok, sharing the executable files between the machines. a) $ mpirun -hostfile machines -v -np 2 ./hello [achel:15275] [[32727,0],0] ORTE_ERROR_LOG: Out of resource in file base/plm_base_launch_support.c at line 482 [latrappe:16467] OPAL dss:unpack: got type 49 when expecting type 38 [latrappe:16467] [[32727,0],1] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/orted/orted_comm.c at line 235 [latrappe:16467] [[32727,0],1] routed:binomial: Connection to lifeline [[32727,0],0] lost b) $ mpirun -mca plm_base_verbose 5 -hostfile machines -v -np 2 ./hello [achel:17020] mca:base:select:( plm) Querying component [rsh] [achel:17020] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [achel:17020] mca:base:select:( plm) Query of component [rsh] set priority to 10 [achel:17020] mca:base:select:( plm) Querying component [slurm] [achel:17020] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [achel:17020] mca:base:select:( plm) Selected component [rsh] [achel:17020] plm:base:set_hnp_name: initial bias 17020 nodename hash 2714559920 [achel:17020] plm:base:set_hnp_name: final jobfam 1536 [achel:17020] [[1536,0],0] plm:rsh_setup on agent ssh : rsh path NULL [achel:17020] [[1536,0],0] plm:base:receive start comm [achel:17020] released to spawn [achel:17020] [[1536,0],0] plm:base:setup_vm [achel:17020] [[1536,0],0] plm:base:setup_vm creating map [achel:17020] [[1536,0],0] plm:base:setup_vm add new daemon [[1536,0],1] [achel:17020] [[1536,0],0] plm:base:setup_vm assigning new daemon [[1536,0],1] to node latrappe.c3local [achel:17020] [[1536,0],0] plm:rsh: launching vm [achel:17020] [[1536,0],0] plm:rsh: local shell: 0 (bash) [achel:17020] [[1536,0],0] plm:rsh: assuming same remote shell as local shell [achel:17020] [[1536,0],0] plm:rsh: remote shell: 0 (bash) [achel:17020] [[1536,0],0] plm:rsh: final template argv: /usr/bin/ssh orted -mca ess env -mca orte_ess_jobid 100663296 -mca orte_ess_vpid -mca orte_ess_num_procs 2 -mca orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca plm_base_verbose 5 -mca plm rsh [achel:17020] [[1536,0],0] plm:rsh: launching on node latrappe.c3local [achel:17020] [[1536,0],0] plm:rsh: recording launch of daemon [[1536,0],1] [achel:17020] [[1536,0],0] plm:base:daemon_callback [achel:17020] [[1536,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh latrappe.c3local orted -mca ess env -mca orte_ess_jobid 100663296 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri "100663296.0;tcp://10.254.222.5:37564" -mca plm_base_verbose 5 -mca plm rsh] [latrappe:18212] mca:base:select:( plm) Querying component [rsh] [latrappe:18212] mca:base:select:( plm) Query of component [rsh] set priority to 10 [latrappe:18212] mca:base:select:( plm) Selected component [rsh] [achel:17020] [[1536,0],0] plm:base:orted_report_launch from daemon [[1536,0],1] via [[1536,0],1] [achel:17020] [[1536,0],0] ORTE_ERROR_LOG: Out of resource in file base/plm_base_launch_support.c at line 482 [achel:17020] [[1536,0],0] plm:base:orted_report_launch failed for daemon [[1536,0],1] (via [[1536,0],1]) at contact 100663296.1;tcp://10.254.222.7:33825 [achel:17020] [[1536,0],0] plm:base:orted_cmd sending orted_exit commands [achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit abnormal term ordered [achel:17020] [[1536,0],0] plm:base:orted_cmd:orted_exit sending cmd to [[1536,0],1] [achel:17020] [[1536,0],0] plm:base:orted_cmd message to [[1536,0],1] sent [achel:17020] [[1536,0],0] plm:base:orted_cmd all messages sent [achel:17020] [[1536,0],0] plm:tm: daemon launch failed on error (null) [latrappe:18212] OPAL dss:unpack: got type 49 when expecting type 38 [latrappe:18212] [[1536,0],1] ORTE_ERROR_LOG: Pack data mismatch in file ../../../orte/orted/orted_comm.c at line 235 [achel:17020] [[1536,0],0] plm:base:receive stop comm [latrappe:18212] [[1536,0],1] routed:binomial: Connection to lifeline [[1536,0],0] lost Thanks in advance, Edson ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] NAS Parallel Benchmark implementation for (open) MPI/C
Hi, All I am looking for a NAS Parallel Bechmark (NAS-PB) reference implementation coded in MPI/C language. I see that the NAS official website has a MPI/fortran implementation. There is a NAS-PB reference implementation in (open)MPI/C? Thanks in advance, Edson