On 11/06/07, Adrian Knoth <a...@drcomp.erfurt.thur.de> wrote:

What's the exact problem? compute-node -> frontend? I don't think you
have two processes on the frontend node, and even if you do, they should
use shared memory.

I stopped there being more than a single process on the frontend node
- this had no effect on the problem. The problem is that the processes
seem unable to communicate data to each other, although I can ssh
between machines with no problem ( I have set up passphraseless keys).


Use tcpdump and/or recompile with debug enabled. In addition, set
WANT_PEER_DUMP in ompi/mca/btl/tcp/btl_tcp_endpoint.c to 1 (line 120)
and recompile, thus giving you more debug output.

Depending on your OMPI version, you can also add

mpi_preconnect_all=1

to your ~/.openmpi/mca-params.conf, by this establishing all connections
during MPI_Init().

I can't use tcpdump as i don't have root access, but I have made the
change to btl_tcp_endpoint.c that you mention, rebuilt (make
distclean... ./configure --enable-debug) OpenMPI, rebuilt the
application against the new version of openMPI and re-ran the program.
This is the output I see (with -np 3, and only 1 slot on the
frontend):

[steinbeck.phys.ucl.ac.uk:08475] [0,0,0] setting up session dir with
[steinbeck.phys.ucl.ac.uk:08475]        universe default-universe-8475
[steinbeck.phys.ucl.ac.uk:08475]        user jgu
[steinbeck.phys.ucl.ac.uk:08475]        host steinbeck.phys.ucl.ac.uk
[steinbeck.phys.ucl.ac.uk:08475]        jobid 0
[steinbeck.phys.ucl.ac.uk:08475]        procid 0
[steinbeck.phys.ucl.ac.uk:08475] procdir:
/tmp/openmpi-sessions-...@steinbeck.phys.ucl.ac.uk_0/default-universe-8475/0/0
[steinbeck.phys.ucl.ac.uk:08475] jobdir:
/tmp/openmpi-sessions-...@steinbeck.phys.ucl.ac.uk_0/default-universe-8475/0
[steinbeck.phys.ucl.ac.uk:08475] unidir:
/tmp/openmpi-sessions-...@steinbeck.phys.ucl.ac.uk_0/default-universe-8475
[steinbeck.phys.ucl.ac.uk:08475] top:
openmpi-sessions-...@steinbeck.phys.ucl.ac.uk_0
[steinbeck.phys.ucl.ac.uk:08475] tmp: /tmp
[steinbeck.phys.ucl.ac.uk:08475] [0,0,0] contact_file
/tmp/openmpi-sessions-...@steinbeck.phys.ucl.ac.uk_0/default-universe-8475/universe-s
etup.txt
[steinbeck.phys.ucl.ac.uk:08475] [0,0,0] wrote setup file
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: local csh: 0, local sh: 1
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: assuming same remote shell
as local shell
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: remote csh: 0, remote sh: 1
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: final template argv:
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh:     /usr/bin/ssh <template>
orted --debug --debug-daemons --bootproxy 1 --name <template> --num_p
rocs 3 --vpid_start 0 --nodename <template> --universe
j...@steinbeck.phys.ucl.ac.uk:default-universe-8475 --nsreplica
"0.0.0;tcp://128.40.5
.39:37256;tcp://192.168.1.1:37256" --gprreplica
"0.0.0;tcp://128.40.5.39:37256;tcp://192.168.1.1:37256"
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: launching on node frontend
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: frontend is a LOCAL node
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: changing to directory /homes/jgu
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: executing:
(/cluster/data/jgu/bin/orted) orted --debug --debug-daemons
--bootproxy 1 --name 0.0.1
--num_procs 3 --vpid_start 0 --nodename frontend --universe
j...@steinbeck.phys.ucl.ac.uk:default-universe-8475 --nsreplica
"0.0.0;tcp://12
8.40.5.39:37256;tcp://192.168.1.1:37256" --gprreplica
"0.0.0;tcp://128.40.5.39:37256;tcp://192.168.1.1:37256" --set-sid
[BIBINPUTS=.::/amp/
tex// NNTPSERVER=nntp-server.ucl.ac.uk SSH_AGENT_PID=8473
HOSTNAME=steinbeck.phys.ucl.ac.uk BSTINPUTS=.::/amp/tex// TERM=screen
SHELL=/bin/
bash HISTSIZE=1000 TMPDIR=/tmp SSH_CLIENT=128.40.5.249 55312 22
QTDIR=/usr/lib64/qt-3.3 SSH_TTY=/dev/pts/0 USER=jgu
LD_LIBRARY_PATH=:/clust
er/data/jgu/lib:/cluster/data/jgu/lib
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=0
1;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:
*.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*
.gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:
SSH_AUTH_SOCK=/tmp/ssh-KjHUoC8472/agent.8472 TERMCAP=SC|screen|VT 1
00/ANSI X3.64 virtual terminal:\
       :DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\
       :cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\
       :do=^J:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\
       :le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\
       :li#24:co#80:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\
       :cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\
       :im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\
       :ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\
       :ti=\E[?1049h:te=\E[?1049l:us=\E[4m:ue=\E[24m:so=\E[3m:\
       :se=\E[23m:mb=\E[5m:md=\E[1m:mr=\E[7m:me=\E[m:ms:\
       :Co#8:pa#64:AF=\E[3%dm:AB=\E[4%dm:op=\E[39;49m:AX:\
       :vb=\Eg:G0:as=\E(0:ae=\E(B:\
       
:ac=\140\140aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~..--++,,hhII00:\
       :po=\E[5i:pf=\E[4i:Z0=\E[?3h:Z1=\E[?3l:k0=\E[10~:\
       :k1=\EOP:k2=\EOQ:k3=\EOR:k4=\EOS:k5=\E[15~:k6=\E[17~:\
       :k7=\E[18~:k8=\E[19~:k9=\E[20~:k;=\E[21~:F1=\E[23~:\
       :F2=\E[24~:F3=\EO2P:F4=\EO2Q:F5=\EO2R:F6=\EO2S:\
       :F7=\E[15;2~:F8=\E[17;2~:F9=\E[18;2~:FA=\E[19;2~:kb=^H:\
       :K2=\EOE:kB=\E[Z:*4=\E[3;2~:*7=\E[1;2F:#2=\E[1;2H:\
       :#3=\E[2;2~:#4=\E[1;2D:%c=\E[6;2~:%e=\E[5;2~:%i=\E[1;2C:\
       :kh=\E[1~:@1=\E[1~:kH=\E[4~:@7=\E[4~:kN=\E[6~:kP=\E[5~:\
       :kI=\E[2~:kD=\E[3~:ku=\EOA:kd=\EOB:kr=\EOC:kl=\EOD:km:
KDEDIR=/usr MOZ_PLUGIN_PATH=/usr/local/plugins
MAIL=/var/spool/mail/jgu PATH
=/usr/kerberos/bin:/usr/local/bin64:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:.:/cluster/data/jgu/bin:/cluster/data/jgu/bin
STY=1936.pts-
0.steinbeck INPUTRC=/etc/inputrc
PWD=/cluster/data/jgu/wrk/ethene_hhg_align LANG=en_GB.UTF-8
LM_LICENSE_FILE=/homes/jgu/licenses:2600@hadry
a.phys.ucl.ac.uk:27...@steinbeck.phys.ucl.ac.uk:1...@zuserver1.star.ucl.ac.uk
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass TEXINPUTS=
.::/amp/tex/styles// SHLVL=3 HOME=/homes/jgu LOGNAME=jgu WINDOW=0
SSH_CONNECTION=128.40.5.249 55312 128.40.5.39 22
LESSOPEN=|/usr/bin/lessp
ipe.sh %s PROMPT_COMMAND=echo -ne
"\033_${USER}@${HOSTNAME%%.*}:${PWD/#$HOME/~}\033\\" GLOBAL=skip
G_BROKEN_FILENAMES=1 NAG_KUSARI_FILE=had
rya.phys.ucl.ac.uk:7733 _=/cluster/data/jgu/bin/mpirun
OMPI_MCA_rds_hostfile_path=/cluster/data/jgu/etc/hostfile
OMPI_MCA_orte_debug=1 OMPI
_MCA_orte_debug_daemons=1 OMPI_MCA_seed=0]
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: launching on node node0
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: node0 is a REMOTE node
[steinbeck.phys.ucl.ac.uk:08475] pls:rsh: executing: (//usr/bin/ssh)
/usr/bin/ssh node0 orted --debug --debug-daemons --bootproxy 1 --name
0.0.2 --num_procs 3 --vpid_start 0 --nodename node0 --universe
j...@steinbeck.phys.ucl.ac.uk:default-universe-8475 --nsreplica
"0.0.0;tcp://
128.40.5.39:37256;tcp://192.168.1.1:37256" --gprreplica
"0.0.0;tcp://128.40.5.39:37256;tcp://192.168.1.1:37256"
[BIBINPUTS=.::/amp/tex// NN
TPSERVER=nntp-server.ucl.ac.uk SSH_AGENT_PID=8473
HOSTNAME=steinbeck.phys.ucl.ac.uk BSTINPUTS=.::/amp/tex// TERM=screen
SHELL=/bin/bash HIS
TSIZE=1000 TMPDIR=/tmp SSH_CLIENT=128.40.5.249 55312 22
QTDIR=/usr/lib64/qt-3.3 SSH_TTY=/dev/pts/0 USER=jgu
LD_LIBRARY_PATH=:/cluster/data/
jgu/lib:/cluster/data/jgu/lib
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;
41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01
;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01;
35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35:
SSH_AUTH_SOCK=/tmp/ssh-KjHUoC8472/agent.8472 TERMCAP=SC|screen|VT
100/ANSI
X3.64 virtual terminal:\
       :DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\
       :cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\
       :do=^J:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\
       :le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\
       :li#24:co#80:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\
       :cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\
       :im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\
       :ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\
       :ti=\E[?1049h:te=\E[?1049l:us=\E[4m:ue=\E[24m:so=\E[3m:\
       :se=\E[23m:mb=\E[5m:md=\E[1m:mr=\E[7m:me=\E[m:ms:\
       :Co#8:pa#64:AF=\E[3%dm:AB=\E[4%dm:op=\E[39;49m:AX:\
       :vb=\Eg:G0:as=\E(0:ae=\E(B:\
       
:ac=\140\140aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~..--++,,hhII00:\
       :po=\E[5i:pf=\E[4i:Z0=\E[?3h:Z1=\E[?3l:k0=\E[10~:\
       :k1=\EOP:k2=\EOQ:k3=\EOR:k4=\EOS:k5=\E[15~:k6=\E[17~:\
       :k7=\E[18~:k8=\E[19~:k9=\E[20~:k;=\E[21~:F1=\E[23~:\
       :F2=\E[24~:F3=\EO2P:F4=\EO2Q:F5=\EO2R:F6=\EO2S:\
       :F7=\E[15;2~:F8=\E[17;2~:F9=\E[18;2~:FA=\E[19;2~:kb=^H:\
       :K2=\EOE:kB=\E[Z:*4=\E[3;2~:*7=\E[1;2F:#2=\E[1;2H:\
       :#3=\E[2;2~:#4=\E[1;2D:%c=\E[6;2~:%e=\E[5;2~:%i=\E[1;2C:\
       :kh=\E[1~:@1=\E[1~:kH=\E[4~:@7=\E[4~:kN=\E[6~:kP=\E[5~:\
       :kI=\E[2~:kD=\E[3~:ku=\EOA:kd=\EOB:kr=\EOC:kl=\EOD:km:
KDEDIR=/usr MOZ_PLUGIN_PATH=/usr/local/plugins
MAIL=/var/spool/mail/jgu PATH
=/usr/kerberos/bin:/usr/local/bin64:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:.:/cluster/data/jgu/bin:/cluster/data/jgu/bin
STY=1936.pts-
0.steinbeck INPUTRC=/etc/inputrc
PWD=/cluster/data/jgu/wrk/ethene_hhg_align LANG=en_GB.UTF-8
LM_LICENSE_FILE=/homes/jgu/licenses:2600@hadry
a.phys.ucl.ac.uk:27...@steinbeck.phys.ucl.ac.uk:1...@zuserver1.star.ucl.ac.uk
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass TEXINPUTS=
.::/amp/tex/styles// SHLVL=3 HOME=/homes/jgu LOGNAME=jgu WINDOW=0
SSH_CONNECTION=128.40.5.249 55312 128.40.5.39 22
LESSOPEN=|/usr/bin/lessp
ipe.sh %s PROMPT_COMMAND=echo -ne
"\033_${USER}@${HOSTNAME%%.*}:${PWD/#$HOME/~}\033\\" GLOBAL=skip
G_BROKEN_FILENAMES=1 NAG_KUSARI_FILE=had
rya.phys.ucl.ac.uk:7733 _=/cluster/data/jgu/bin/mpirun
OMPI_MCA_rds_hostfile_path=/cluster/data/jgu/etc/hostfile
OMPI_MCA_orte_debug=1 OMPI
_MCA_orte_debug_daemons=1 OMPI_MCA_seed=0]
[steinbeck.phys.ucl.ac.uk:08476] [0,0,1] setting up session dir with
[steinbeck.phys.ucl.ac.uk:08476]        universe default-universe-8475
[steinbeck.phys.ucl.ac.uk:08476]        user jgu
[steinbeck.phys.ucl.ac.uk:08476]        host frontend
[steinbeck.phys.ucl.ac.uk:08476]        jobid 0
[steinbeck.phys.ucl.ac.uk:08476]        procid 1
[steinbeck.phys.ucl.ac.uk:08476] procdir:
/tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475/0/1
[steinbeck.phys.ucl.ac.uk:08476] jobdir:
/tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475/0
[steinbeck.phys.ucl.ac.uk:08476] unidir:
/tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475
[steinbeck.phys.ucl.ac.uk:08476] top: openmpi-sessions-jgu@frontend_0
[steinbeck.phys.ucl.ac.uk:08476] tmp: /tmp
Daemon [0,0,1] checking in as pid 8476 on host frontend
[node0.cluster:08628] [0,0,2] setting up session dir with
[node0.cluster:08628]   universe default-universe-8475
[node0.cluster:08628]   user jgu
[node0.cluster:08628]   host node0
[node0.cluster:08628]   jobid 0
[node0.cluster:08628]   procid 2
[node0.cluster:08628] procdir:
/tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/0/2
[node0.cluster:08628] jobdir:
/tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/0
[node0.cluster:08628] unidir:
/tmp/openmpi-sessions-jgu@node0_0/default-universe-8475
[node0.cluster:08628] top: openmpi-sessions-jgu@node0_0
[node0.cluster:08628] tmp: /tmp
Daemon [0,0,2] checking in as pid 8628 on host node0
[steinbeck.phys.ucl.ac.uk:08476] [0,0,1] orted: received launch callback
[node0.cluster:08628] [0,0,2] orted: received launch callback
[steinbeck.phys.ucl.ac.uk:08478] [0,1,0] setting up session dir with
[steinbeck.phys.ucl.ac.uk:08478]        universe default-universe-8475
[steinbeck.phys.ucl.ac.uk:08478]        user jgu
[steinbeck.phys.ucl.ac.uk:08478]        host frontend
[steinbeck.phys.ucl.ac.uk:08478]        jobid 1
[steinbeck.phys.ucl.ac.uk:08478]        procid 0
[steinbeck.phys.ucl.ac.uk:08478] procdir:
/tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475/1/0
[steinbeck.phys.ucl.ac.uk:08478] jobdir:
/tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475/1
[steinbeck.phys.ucl.ac.uk:08478] unidir:
/tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475
[steinbeck.phys.ucl.ac.uk:08478] top: openmpi-sessions-jgu@frontend_0
[steinbeck.phys.ucl.ac.uk:08478] tmp: /tmp
[node0.cluster:08650] [0,1,1] setting up session dir with
[node0.cluster:08650]   universe default-universe-8475
[node0.cluster:08650]   user jgu
[node0.cluster:08650]   host node0
[node0.cluster:08650]   jobid 1
[steinbeck.phys.ucl.ac.uk:08478] unidir:
/tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475
[steinbeck.phys.ucl.ac.uk:08478] top: openmpi-sessions-jgu@frontend_0
[steinbeck.phys.ucl.ac.uk:08478] tmp: /tmp
[node0.cluster:08650] [0,1,1] setting up session dir with
[node0.cluster:08650]   universe default-universe-8475
[node0.cluster:08650]   user jgu
[node0.cluster:08650]   host node0
[node0.cluster:08650]   jobid 1
[node0.cluster:08650]   procid 1
[node0.cluster:08650] procdir:
/tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/1/1
[node0.cluster:08650] jobdir:
/tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/1
[node0.cluster:08650] unidir:
/tmp/openmpi-sessions-jgu@node0_0/default-universe-8475
[node0.cluster:08650] top: openmpi-sessions-jgu@node0_0
[node0.cluster:08650] tmp: /tmp
[node0.cluster:08651] [0,1,2] setting up session dir with
[node0.cluster:08651]   universe default-universe-8475
[node0.cluster:08651]   user jgu
[node0.cluster:08651]   host node0
[node0.cluster:08651]   jobid 1
[node0.cluster:08651]   procid 2
[node0.cluster:08651] procdir:
/tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/1/2
[node0.cluster:08651] jobdir:
/tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/1
[node0.cluster:08651] unidir:
/tmp/openmpi-sessions-jgu@node0_0/default-universe-8475
[node0.cluster:08651] top: openmpi-sessions-jgu@node0_0
[node0.cluster:08651] tmp: /tmp
[steinbeck.phys.ucl.ac.uk:08475] spawn: in job_state_callback(jobid =
1, state = 0x4)
[steinbeck.phys.ucl.ac.uk:08475] Info: Setting up debugger process
table for applications
 MPIR_being_debugged = 0
 MPIR_debug_gate = 0
 MPIR_debug_state = 1
 MPIR_acquired_pre_main = 0
 MPIR_i_am_starter = 0
 MPIR_proctable_size = 3
 MPIR_proctable:
   (i, host, exe, pid) = (0, frontend,
/export/data/jgu/wrk/ethene_hhg_align/align-cls-mpi, 8478)
   (i, host, exe, pid) = (1, node0,
/export/data/jgu/wrk/ethene_hhg_align/align-cls-mpi, 8650)
   (i, host, exe, pid) = (2, node0,
/export/data/jgu/wrk/ethene_hhg_align/align-cls-mpi, 8651)
[steinbeck.phys.ucl.ac.uk:08478] [0,1,0] ompi_mpi_init completed
[frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=110
[frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=110

Reply via email to