Thank you,

i added the parameters and I figured out, that the ip table firewall was 
messing up something, so I disabled it on both machines.
But now I get another error:

[superuser@localhost ~]$ mpirun --host 192.168.54.56 --leave-session-attached 
-mca plm_base_verbose 5 -mca oob_base_verbose 5 hostname clear
[localhost.localdomain:10884] mca:base:select:(  plm) Querying component 
[isolated]
[localhost.localdomain:10884] mca:base:select:(  plm) Query of component 
[isolated] set priority to 0
[localhost.localdomain:10884] mca:base:select:(  plm) Querying component [rsh]
[localhost.localdomain:10884] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : 
rsh path NULL
[localhost.localdomain:10884] mca:base:select:(  plm) Query of component [rsh] 
set priority to 10
[localhost.localdomain:10884] mca:base:select:(  plm) Querying component [slurm]
[localhost.localdomain:10884] mca:base:select:(  plm) Skipping component 
[slurm]. Query failed to return a module
[localhost.localdomain:10884] mca:base:select:(  plm) Selected component [rsh]
[localhost.localdomain:10884] plm:base:set_hnp_name: initial bias 10884 
nodename hash 724106151
[localhost.localdomain:10884] plm:base:set_hnp_name: final jobfam 64011
[localhost.localdomain:10884] mca:oob:select: checking available component tcp
[localhost.localdomain:10884] mca:oob:select: Querying component [tcp]
[localhost.localdomain:10884] oob:tcp: component_available called
[localhost.localdomain:10884] [[64011,0],0] creating OOB-TCP module for 
interface eth0
[localhost.localdomain:10884] [[64011,0],0] creating OOB-TCP module for 
interface virbr0
[localhost.localdomain:10884] [[64011,0],0] TCP STARTUP
[localhost.localdomain:10884] [[64011,0],0] attempting to bind to IPv4 port 0
[localhost.localdomain:10884] mca:oob:select: Adding component to end
[localhost.localdomain:10884] mca:oob:select: Found 1 active transports
[localhost.localdomain:10884] [[64011,0],0] plm:rsh_setup on agent ssh : rsh 
path NULL
[localhost.localdomain:10884] [[64011,0],0] plm:base:receive start comm
[localhost.localdomain:10884] [[64011,0],0] plm:base:setup_job
[localhost.localdomain:10884] [[64011,0],0] plm:base:setup_vm
[localhost.localdomain:10884] [[64011,0],0] plm:base:setup_vm creating map
[localhost.localdomain:10884] [[64011,0],0] setup:vm: working unmanaged 
allocation
[localhost.localdomain:10884] [[64011,0],0] using dash_host
[localhost.localdomain:10884] [[64011,0],0] checking node 192.168.54.56
[localhost.localdomain:10884] [[64011,0],0] plm:base:setup_vm add new daemon 
[[64011,0],1]
[localhost.localdomain:10884] [[64011,0],0] plm:base:setup_vm assigning new 
daemon [[64011,0],1] to node 192.168.54.56
[localhost.localdomain:10884] [[64011,0],0] plm:rsh: launching vm
[localhost.localdomain:10884] [[64011,0],0] plm:rsh: local shell: 0 (bash)
[localhost.localdomain:10884] [[64011,0],0] plm:rsh: assuming same remote shell 
as local shell
[localhost.localdomain:10884] [[64011,0],0] plm:rsh: remote shell: 0 (bash)
[localhost.localdomain:10884] [[64011,0],0] plm:rsh: final template argv:
        /usr/bin/ssh <template>  orted -mca ess env -mca orte_ess_jobid 
4195024896 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 2 -mca 
orte_hnp_uri "4195024896.0;tcp://192.168.54.137,192.168.122.1:45032" 
--tree-spawn -mca plm_base_verbose 5 -mca oob_base_verbose 5 -mca plm rsh -mca 
orte_leave_session_attached 1
[localhost.localdomain:10884] [[64011,0],0] plm:rsh:launch daemon 0 not a child 
of mine
[localhost.localdomain:10884] [[64011,0],0] plm:rsh: adding node 192.168.54.56 
to launch list
[localhost.localdomain:10884] [[64011,0],0] plm:rsh: activating launch event
[localhost.localdomain:10884] [[64011,0],0] plm:rsh: recording launch of daemon 
[[64011,0],1]
[localhost.localdomain:10884] [[64011,0],0] plm:rsh: executing: (/usr/bin/ssh) 
[/usr/bin/ssh 192.168.54.56  orted -mca ess env -mca orte_ess_jobid 4195024896 
-mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri 
"4195024896.0;tcp://192.168.54.137,192.168.122.1:45032" --tree-spawn -mca 
plm_base_verbose 5 -mca oob_base_verbose 5 -mca plm rsh -mca 
orte_leave_session_attached 1]
[CUDAServer:04970] mca:base:select:(  plm) Querying component [rsh]
[CUDAServer:04970] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[CUDAServer:04970] mca:base:select:(  plm) Selected component [rsh]
[CUDAServer:04970] mca:oob:select: checking available component tcp
[CUDAServer:04970] mca:oob:select: Querying component [tcp]
[CUDAServer:04970] oob:tcp: component_available called
[CUDAServer:04970] [[64011,0],1] TCP STARTUP
[CUDAServer:04970] [[64011,0],1] attempting to bind to IPv4 port 0
[CUDAServer:04970] mca:oob:select: Adding component to end
[CUDAServer:04970] mca:oob:select: Found 1 active transports
[CUDAServer:04970] [[64011,0],1]: set_addr to uri 
4195024896.0;tcp://192.168.54.137,192.168.122.1:45032
[CUDAServer:04970] [[64011,0],1]:set_addr checking if peer [[64011,0],0] is 
reachable via component tcp
[CUDAServer:04970] [[64011,0],1] oob:tcp: working peer [[64011,0],0] address 
tcp://192.168.54.137,192.168.122.1:45032
[CUDAServer:04970] [[64011,0],1]:tcp set addr for peer [[64011,0],0]
[CUDAServer:04970] [[64011,0],1]: peer [[64011,0],0] is reachable via component 
tcp
[CUDAServer:04970] [[64011,0],1] OOB_SEND: rml_oob_send.c:199
[CUDAServer:04970] [[64011,0],1] oob:base:send to target [[64011,0],0]
[CUDAServer:04970] [[64011,0],1] oob:tcp:send_nb to peer [[64011,0],0]:10
[CUDAServer:04970] [[64011,0],1] tcp:send_nb to peer [[64011,0],0]
[CUDAServer:04970] [[64011,0],1]:[oob_tcp.c:508] post send to [[64011,0],0]
[CUDAServer:04970] [[64011,0],1]:[oob_tcp.c:442] processing send to peer 
[[64011,0],0]:10
[CUDAServer:04970] [[64011,0],1]:[oob_tcp.c:476] queue pending to [[64011,0],0]
[CUDAServer:04970] [[64011,0],1] tcp:send_nb: initiating connection to 
[[64011,0],0]
[CUDAServer:04970] [[64011,0],1]:[oob_tcp.c:490] connect to [[64011,0],0]
[localhost.localdomain:10884] [[64011,0],0] connection_handler: working 
connection (12, 0) 192.168.54.56:38362
[CUDAServer:04970] [[64011,0],1] MESSAGE SEND COMPLETE TO [[64011,0],0] OF 
12963 BYTES ON SOCKET 9
[localhost.localdomain:10884] [[64011,0],0] ORTE_ERROR_LOG: Data unpack failed 
in file base/plm_base_launch_support.c at line 964  
<==ERROR=============================================
[localhost.localdomain:10884] [[64011,0],0] plm:base:orted_cmd sending 
orted_exit commands
[localhost.localdomain:10884] [[64011,0],0] plm:base:receive stop comm
[localhost.localdomain:10884] [[64011,0],0] TCP SHUTDOWN
[localhost.localdomain:10884] [[64011,0],0] RELEASING PEER OBJ [[64011,0],1]
[localhost.localdomain:10884] [[64011,0],0] CLOSING SOCKET 12
[CUDAServer:04970] [[64011,0],1]-[[64011,0],0] mca_oob_tcp_msg_recv: peer 
closed connection
[CUDAServer:04970] [[64011,0],1] TCP SHUTDOWN
[CUDAServer:04970] [[64011,0],1] RELEASING PEER OBJ [[64011,0],0]
[CUDAServer:04970] [[64011,0],1] CLOSING SOCKET 9


Regards

Benjamin Giehle

Reply via email to