On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:

-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5

I'm afraid the output will be a tad verbose, but I would appreciate seeing it. Might also tell us something about the lib issue.


Command line was:

/usr/local/openmpi/bin/mpirun -mca plm_base_verbose 5 --debug-daemons - mca odls_base_verbose 5 -n 16 --host xserve03,xserve04 ../build/mitgcmuv


Starting: ../results//TasGaussRestart16
[saturna.cluster:07360] mca:base:select:(  plm) Querying component [rsh]
[saturna.cluster:07360] mca:base:select:( plm) Query of component [rsh] set priority to 10 [saturna.cluster:07360] mca:base:select:( plm) Querying component [slurm] [saturna.cluster:07360] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[saturna.cluster:07360] mca:base:select:(  plm) Querying component [tm]
[saturna.cluster:07360] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [saturna.cluster:07360] mca:base:select:( plm) Querying component [xgrid] [saturna.cluster:07360] mca:base:select:( plm) Skipping component [xgrid]. Query failed to return a module
[saturna.cluster:07360] mca:base:select:(  plm) Selected component [rsh]
[saturna.cluster:07360] plm:base:set_hnp_name: initial bias 7360 nodename hash 1656374957
[saturna.cluster:07360] plm:base:set_hnp_name: final jobfam 14551
[saturna.cluster:07360] [[14551,0],0] plm:base:receive start comm
[saturna.cluster:07360] mca: base: component_find: ras "mca_ras_dash_host" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] mca: base: component_find: ras "mca_ras_hostfile" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] mca: base: component_find: ras "mca_ras_localhost" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] mca: base: component_find: ras "mca_ras_xgrid" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] mca:base:select:( odls) Querying component [default] [saturna.cluster:07360] mca:base:select:( odls) Query of component [default] set priority to 1 [saturna.cluster:07360] mca:base:select:( odls) Selected component [default] [saturna.cluster:07360] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [saturna.cluster:07360] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] [[14551,0],0] plm:rsh: setting up job [14551,1]
[saturna.cluster:07360] [[14551,0],0] plm:base:setup_job for job [14551,1]
[saturna.cluster:07360] [[14551,0],0] plm:rsh: local shell: 0 (bash)
[saturna.cluster:07360] [[14551,0],0] plm:rsh: assuming same remote shell as local shell
[saturna.cluster:07360] [[14551,0],0] plm:rsh: remote shell: 0 (bash)
[saturna.cluster:07360] [[14551,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi/bin/orted --debug-daemons -mca ess env -mca orte_ess_jobid 953614336 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 3 --hnp-uri "953614336.0;tcp:// 142.104.154.96:49622;tcp://192.168.2.254:49622" -mca plm_base_verbose 5 -mca odls_base_verbose 5 [saturna.cluster:07360] [[14551,0],0] plm:rsh: launching on node xserve03 [saturna.cluster:07360] [[14551,0],0] plm:rsh: recording launch of daemon [[14551,0],1] [saturna.cluster:07360] [[14551,0],0] plm:rsh: executing: (//usr/bin/ ssh) [/usr/bin/ssh xserve03 PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: $LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi/bin/ orted --debug-daemons -mca ess env -mca orte_ess_jobid 953614336 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "953614336.0;tcp:// 142.104.154.96:49622;tcp://192.168.2.254:49622" -mca plm_base_verbose 5 -mca odls_base_verbose 5]
Daemon was launched on xserve03.local - beginning to initialize
[xserve03.local:40708] mca:base:select:( odls) Querying component [default] [xserve03.local:40708] mca:base:select:( odls) Query of component [default] set priority to 1 [xserve03.local:40708] mca:base:select:( odls) Selected component [default] [xserve03.local:40708] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve03.local:40708] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
Daemon [[14551,0],1] checking in as pid 40708 on host xserve03.local
Daemon [[14551,0],1] not using static ports
[saturna.cluster:07360] [[14551,0],0] plm:rsh: launching on node xserve04 [saturna.cluster:07360] [[14551,0],0] plm:rsh: recording launch of daemon [[14551,0],2] [saturna.cluster:07360] [[14551,0],0] plm:rsh: executing: (//usr/bin/ ssh) [/usr/bin/ssh xserve04 PATH=/usr/local/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/openmpi/lib: $LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /usr/local/openmpi/bin/ orted --debug-daemons -mca ess env -mca orte_ess_jobid 953614336 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "953614336.0;tcp:// 142.104.154.96:49622;tcp://192.168.2.254:49622" -mca plm_base_verbose 5 -mca odls_base_verbose 5]
Daemon was launched on xserve04.local - beginning to initialize
[xserve04.local:40450] mca:base:select:( odls) Querying component [default] [xserve04.local:40450] mca:base:select:( odls) Query of component [default] set priority to 1 [xserve04.local:40450] mca:base:select:( odls) Selected component [default] [xserve04.local:40450] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve04.local:40450] mca: base: component_find: iof "mca_iof_svc" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
Daemon [[14551,0],2] checking in as pid 40450 on host xserve04.local
Daemon [[14551,0],2] not using static ports
[saturna.cluster:07360] [[14551,0],0] plm:base:daemon_callback
[saturna.cluster:07360] progressed_wait: base/ plm_base_launch_support.c 459 [xserve04.local:40450] [[14551,0],2] orted: up and running - waiting for commands! [saturna.cluster:07360] defining message event: base/ plm_base_launch_support.c 423 [saturna.cluster:07360] defining message event: base/ plm_base_launch_support.c 423 [saturna.cluster:07360] [[14551,0],0] plm:base:orted_report_launch from daemon [[14551,0],1] [xserve03.local:40708] [[14551,0],1] orted: up and running - waiting for commands! [saturna.cluster:07360] [[14551,0],0] plm:base:orted_report_launch completed for daemon [[14551,0],1] [saturna.cluster:07360] [[14551,0],0] plm:base:orted_report_launch from daemon [[14551,0],2] [saturna.cluster:07360] [[14551,0],0] plm:base:orted_report_launch completed for daemon [[14551,0],2]
[saturna.cluster:07360] [[14551,0],0] plm:base:daemon_callback completed
[saturna.cluster:07360] [[14551,0],0] plm:base:launch_apps for job [14551,1]
[saturna.cluster:07360] defining message event: grpcomm_bad_module.c 183
[saturna.cluster:07360] [[14551,0],0] plm:base:report_launched for job [14551,1] [saturna.cluster:07360] progressed_wait: base/ plm_base_launch_support.c 712 [saturna.cluster:07360] [[14551,0],0] orte:daemon:cmd:processor called by [[14551,0],0] for tag 1 [saturna.cluster:07360] [[14551,0],0] node[0].name saturna daemon 0 arch ffc90200 [saturna.cluster:07360] [[14551,0],0] node[1].name xserve03 daemon 1 arch ffc90200 [saturna.cluster:07360] [[14551,0],0] node[2].name xserve04 daemon 2 arch ffc90200 [saturna.cluster:07360] [[14551,0],0] orted_cmd: received add_local_procs
[saturna.cluster:07360] [[14551,0],0] odls:constructing child list
[saturna.cluster:07360] [[14551,0],0] odls:construct_child_list unpacking data to launch job [14551,1] [saturna.cluster:07360] [[14551,0],0] odls:construct_child_list adding new jobdat for job [14551,1] [saturna.cluster:07360] [[14551,0],0] odls:construct_child_list unpacking 1 app_contexts [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 1 on node 2 with daemon 2 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 2 on node 1 with daemon 1 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 3 on node 2 with daemon 2 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 4 on node 1 with daemon 1 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 5 on node 2 with daemon 2 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 6 on node 1 with daemon 1 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 7 on node 2 with daemon 2 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 8 on node 1 with daemon 1 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 9 on node 2 with daemon 2 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 10 on node 1 with daemon 1 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 11 on node 2 with daemon 2 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 12 on node 1 with daemon 1 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 13 on node 2 with daemon 2 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 14 on node 1 with daemon 1 [saturna.cluster:07360] [[14551,0],0] odls:constructing child list - checking proc 15 on node 2 with daemon 2 [saturna.cluster:07360] [[14551,0],0] odls:construct:child: num_participating 2 [saturna.cluster:07360] [[14551,0],0] odls:launch found 4 processors for 0 children and set oversubscribed to false [saturna.cluster:07360] [[14551,0],0] odls:launch reporting job [14551,1] launch status [saturna.cluster:07360] defining message event: base/ odls_base_default_fns.c 1219
[saturna.cluster:07360] [[14551,0],0] odls:launch setting waitpids
[saturna.cluster:07360] [[14551,0],0] orte:daemon:send_relay
[saturna.cluster:07360] [[14551,0],0] orte:daemon:send_relay sending relay msg to 1 [saturna.cluster:07360] [[14551,0],0] orte:daemon:send_relay sending relay msg to 2 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launch from daemon [[14551,0],0] [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launch completed processing [xserve04.local:40450] [[14551,0],2] node[0].name saturna daemon 0 arch ffc90200 [xserve04.local:40450] [[14551,0],2] node[1].name xserve03 daemon 1 arch ffc90200 [xserve04.local:40450] [[14551,0],2] node[2].name xserve04 daemon 2 arch ffc90200
[xserve04.local:40450] [[14551,0],2] orted_cmd: received add_local_procs
[xserve03.local:40708] [[14551,0],1] node[0].name saturna daemon 0 arch ffc90200 [xserve03.local:40708] [[14551,0],1] node[1].name xserve03 daemon 1 arch ffc90200 [xserve03.local:40708] [[14551,0],1] node[2].name xserve04 daemon 2 arch ffc90200
[xserve03.local:40708] [[14551,0],1] orted_cmd: received add_local_procs
[saturna.cluster:07360] defining message event: base/ plm_base_launch_support.c 668 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launch reissuing non-blocking recv [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launch from daemon [[14551,0],1] [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],0] from daemon [[14551,0],1]: pid 40710 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],2] from daemon [[14551,0],1]: pid 40711 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],4] from daemon [[14551,0],1]: pid 40712 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],6] from daemon [[14551,0],1]: pid 40713 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],8] from daemon [[14551,0],1]: pid 40714 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],10] from daemon [[14551,0],1]: pid 40715 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],12] from daemon [[14551,0],1]: pid 40716 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],14] from daemon [[14551,0],1]: pid 40717 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launch completed processing [saturna.cluster:07360] defining message event: base/ plm_base_launch_support.c 668 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launch reissuing non-blocking recv [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launch from daemon [[14551,0],2] [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],1] from daemon [[14551,0],2]: pid 40452 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],3] from daemon [[14551,0],2]: pid 40453 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],5] from daemon [[14551,0],2]: pid 40454 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],7] from daemon [[14551,0],2]: pid 40455 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],9] from daemon [[14551,0],2]: pid 40456 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],11] from daemon [[14551,0],2]: pid 40457 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],13] from daemon [[14551,0],2]: pid 40458 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launched for proc [[14551,1],15] from daemon [[14551,0],2]: pid 40459 state 2 exit 0 [saturna.cluster:07360] [[14551,0],0] plm:base:app_report_launch completed processing [saturna.cluster:07360] [[14551,0],0] plm:base:report_launched all apps reported
[saturna.cluster:07360] [[14551,0],0] plm:base:launch wiring up iof
[saturna.cluster:07360] [[14551,0],0] plm:base:launch completed for job [14551,1] [xserve03.local:40708] [[14551,0],1] orted_recv: received sync+nidmap from local proc [[14551,1],0] [xserve03.local:40708] [[14551,0],1] orted_recv: received sync+nidmap from local proc [[14551,1],2] [xserve03.local:40708] [[14551,0],1] orted_recv: received sync+nidmap from local proc [[14551,1],4] [xserve04.local:40450] [[14551,0],2] orted_recv: received sync+nidmap from local proc [[14551,1],3] [xserve04.local:40450] [[14551,0],2] orted_recv: received sync+nidmap from local proc [[14551,1],1]
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve03.local:40710] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve03.local:40711] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve03.local:40712] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve04.local:40453] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve04.local:40450] [[14551,0],2] orted_recv: received sync+nidmap from local proc [[14551,1],7] [xserve04.local:40452] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve03.local:40708] [[14551,0],1] orted_recv: received sync+nidmap from local proc [[14551,1],6] [xserve04.local:40450] [[14551,0],2] orted_recv: received sync+nidmap from local proc [[14551,1],5]
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve03.local:40708] [[14551,0],1] orted_recv: received sync+nidmap from local proc [[14551,1],8] [xserve04.local:40455] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve03.local:40713] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve04.local:40454] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve03.local:40708] [[14551,0],1] orted_cmd: received collective data cmd
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve03.local:40714] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve04.local:40450] [[14551,0],2] orted_recv: received sync+nidmap from local proc [[14551,1],9] [xserve03.local:40708] [[14551,0],1] orted_recv: received sync+nidmap from local proc [[14551,1],10] [xserve03.local:40708] [[14551,0],1] orted_recv: received sync+nidmap from local proc [[14551,1],12] [xserve03.local:40708] [[14551,0],1] orted_cmd: received collective data cmd [xserve03.local:40708] [[14551,0],1] orted_cmd: received collective data cmd [xserve04.local:40450] [[14551,0],2] orted_cmd: received collective data cmd [saturna.cluster:07360] defining message event: base/ routed_base_receive.c 153 [xserve03.local:40708] [[14551,0],1] orted_recv: received sync+nidmap from local proc [[14551,1],14] [xserve04.local:40450] [[14551,0],2] orted_recv: received sync+nidmap from local proc [[14551,1],11] [xserve04.local:40450] [[14551,0],2] orted_recv: received sync+nidmap from local proc [[14551,1],15] [xserve04.local:40450] [[14551,0],2] orted_cmd: received collective data cmd [xserve04.local:40450] [[14551,0],2] orted_recv: received sync+nidmap from local proc [[14551,1],13] [saturna.cluster:07360] defining message event: base/ routed_base_receive.c 153
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve03.local:40715] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve03.local:40716] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve03.local:40717] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve04.local:40456] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve04.local:40457] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve03.local:40708] [[14551,0],1] orted_cmd: received collective data cmd
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve04.local:40459] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored
[saturna.cluster:07360] defining message event: iof_hnp_receive.c 227
[xserve04.local:40458] mca: base: component_find: rcache "mca_rcache_rb" uses an MCA interface that is not recognized (component MCA v1.0.0 != supported MCA v2.0.0) -- ignored [xserve04.local:40450] [[14551,0],2] orted_cmd: received collective data cmd [xserve03.local:40708] [[14551,0],1] orted_cmd: received collective data cmd [xserve04.local:40450] [[14551,0],2] orted_cmd: received collective data cmd [xserve03.local:40708] [[14551,0],1] orted_cmd: received collective data cmd [xserve03.local:40708] [[14551,0],1] orted_cmd: received collective data cmd [saturna.cluster:07360] [[14551,0],0] orted_recv_cmd: received message from [[14551,0],1]
[saturna.cluster:07360] defining message event: orted/orted_comm.c 159
[xserve03.local:40708] [[14551,0],1] orted_cmd: received collective data cmd
[saturna.cluster:07360] [[14551,0],0] orted_recv_cmd: reissued recv
[saturna.cluster:07360] [[14551,0],0] orte:daemon:cmd:processor called by [[14551,0],1] for tag 1 [saturna.cluster:07360] [[14551,0],0] orted_cmd: received collective data cmd
[saturna.cluster:07360] [[14551,0],0] odls: daemon collective called
[saturna.cluster:07360] [[14551,0],0] odls: daemon collective for job [14551,1] from [[14551,0],1] type 2 num_collected 1 num_participating 2 num_contributors 8 [saturna.cluster:07360] [[14551,0],0] orte:daemon:cmd:processor: processing commands completed [xserve04.local:40450] [[14551,0],2] orted_cmd: received collective data cmd [xserve04.local:40450] [[14551,0],2] orted_cmd: received collective data cmd [xserve04.local:40450] [[14551,0],2] orted_cmd: received collective data cmd [saturna.cluster:07360] [[14551,0],0] orted_recv_cmd: received message from [[14551,0],2] [xserve04.local:40450] [[14551,0],2] orted_cmd: received collective data cmd
[saturna.cluster:07360] defining message event: orted/orted_comm.c 159
[saturna.cluster:07360] [[14551,0],0] orted_recv_cmd: reissued recv
[saturna.cluster:07360] [[14551,0],0] orte:daemon:cmd:processor called by [[14551,0],2] for tag 1 [saturna.cluster:07360] [[14551,0],0] orted_cmd: received collective data cmd
[saturna.cluster:07360] [[14551,0],0] odls: daemon collective called
[saturna.cluster:07360] [[14551,0],0] odls: daemon collective for job [14551,1] from [[14551,0],2] type 2 num_collected 2 num_participating 2 num_contributors 16 [saturna.cluster:07360] [[14551,0],0] odls: daemon collective HNP - xcasting to job [14551,1] [saturna.cluster:07360] [[14551,0],0] ORTE_ERROR_LOG: Buffer type (described vs non-described) mismatch - operation not allowed in file base/odls_base_default_fns.c at line 2475 [saturna.cluster:07360] [[14551,0],0] orte:daemon:cmd:processor: processing commands completed

^C[saturna.cluster:07360] defining timer event: 0 sec 0 usec at orterun.c:1128
Killed by signal 2.
mpirun: killing job...

Killed by signal 2.
[saturna.cluster:07360] [[14551,0],0]:orterun.c(1031) updating exit status to 1 [saturna.cluster:07360] [[14551,0],0] plm:base:orted_cmd sending kill_local_procs cmds [saturna.cluster:07360] [[14551,0],0] plm:base:orted_cmd:kill_local_procs abnormal term ordered [saturna.cluster:07360] defining message event: base/ plm_base_orted_cmds.c 276 [saturna.cluster:07360] [[14551,0],0] plm:base:orted_cmd:kill_local_procs sending cmd to [[14551,0],1] [saturna.cluster:07360] [[14551,0],0] plm:base:orted_cmd message to [[14551,0],1] sent [saturna.cluster:07360] [[14551,0],0] plm:base:orted_cmd:kill_local_procs sending cmd to [[14551,0],2] [saturna.cluster:07360] [[14551,0],0] plm:base:orted_cmd message to [[14551,0],2] sent [saturna.cluster:07360] [[14551,0],0] plm:base:orted_cmd all messages sent [saturna.cluster:07360] defining timeout: 0 sec 2000 usec at base/ plm_base_orted_cmds.c:321
[saturna.cluster:07360] progressed_wait: base/plm_base_orted_cmds.c 324
[saturna.cluster:07360] defining timeout: 0 sec 16000 usec at orterun.c:1066 [saturna.cluster:07360] [[14551,0],0] orte:daemon:cmd:processor called by [[14551,0],0] for tag 1 [saturna.cluster:07360] [[14551,0],0] odls:kill_local_proc working on job [WILDCARD] [saturna.cluster:07360] defining message event: base/ odls_base_default_fns.c 2267 [saturna.cluster:07360] [[14551,0],0] orte:daemon:cmd:processor: processing commands completed [saturna.cluster:07360] [[14551,0],0] plm:base:check_job_completed called with NULL pointer [saturna.cluster:07360] [[14551,0],0] plm:base:check_job_completed job [14551,1] is not terminated
[saturna.cluster:07360] [[14551,0],0] daemon 2 failed with status 255
[saturna.cluster:07360] [[14551,0],0] plm:base:launch_failed abort in progress, ignoring report
[saturna.cluster:07360] [[14551,0],0] daemon 1 failed with status 255
[saturna.cluster:07360] [[14551,0],0] plm:base:launch_failed abort in progress, ignoring report [saturna.cluster:07360] [[14551,0],0] plm:base:receive got message from [[14551,0],1] [saturna.cluster:07360] defining message event: base/ plm_base_receive.c 327 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got message from [[14551,0],2] [saturna.cluster:07360] defining message event: base/ plm_base_receive.c 327 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for job [14551,1] [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 0 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 2 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 4 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 6 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 8 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 10 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 12 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 14 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:check_job_completed for job [14551,1] - num_terminated 8 num_procs 16 [saturna.cluster:07360] [[14551,0],0] plm:base:check_job_completed declared job [14551,1] aborted by proc [[14551,1],0] with code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:check_job_completed job [14551,1] is not terminated [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for job [14551,1] [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 1 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 3 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 5 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 7 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 9 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 11 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 13 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:receive got update_proc_state for vpid 15 state 400 exit_code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:check_job_completed for job [14551,1] - num_terminated 16 num_procs 16 [saturna.cluster:07360] [[14551,0],0] plm:base:check_job_completed declared job [14551,1] aborted by proc [[14551,1],0] with code 0 [saturna.cluster:07360] [[14551,0],0] plm:base:check_job_completed all jobs terminated - waking up
[saturna.cluster:07360] [[14551,0],0] calling job_complete trigger
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 40710 on node xserve03 exited on signal 0 (Signal 0).
--------------------------------------------------------------------------
16 total processes killed (some possibly by mpirun during cleanup)
[saturna.cluster:07360] [[14551,0],0] plm:base:orted_cmd sending orted_exit commands [saturna.cluster:07360] [[14551,0],0] plm:base:orted_cmd:orted_exit abnormal term ordered [saturna.cluster:07360] defining message event: base/ plm_base_orted_cmds.c 142 [saturna.cluster:07360] defining timeout: 0 sec 0 usec at base/ plm_base_orted_cmds.c:186
[saturna.cluster:07360] progressed_wait: base/plm_base_orted_cmds.c 189
[saturna.cluster:07360] defining timeout: 0 sec 3000 usec at orterun.c: 752 [saturna.cluster:07360] [[14551,0],0] orte:daemon:cmd:processor called by [[14551,0],0] for tag 1
[saturna.cluster:07360] [[14551,0],0] orted_cmd: received exit
[saturna.cluster:07360] [[14551,0],0] odls:kill_local_proc working on job [WILDCARD] [saturna.cluster:07360] [[14551,0],0] plm:base:check_job_completed for job [14551,0] - num_terminated 3 num_procs 3 [saturna.cluster:07360] [[14551,0],0] plm:base:check_job_completed declared job [14551,0] failed to start by proc [[14551,0],1]
[saturna.cluster:07360] [[14551,0],0] calling orted_exit trigger
mpirun: clean termination accomplished


Reply via email to