On 11/06/07, Adrian Knoth <a...@drcomp.erfurt.thur.de> wrote:
What's the exact problem? compute-node -> frontend? I don't think you have two processes on the frontend node, and even if you do, they should use shared memory.
I stopped there being more than a single process on the frontend node - this had no effect on the problem. The problem is that the processes seem unable to communicate data to each other, although I can ssh between machines with no problem ( I have set up passphraseless keys).
Use tcpdump and/or recompile with debug enabled. In addition, set WANT_PEER_DUMP in ompi/mca/btl/tcp/btl_tcp_endpoint.c to 1 (line 120) and recompile, thus giving you more debug output. Depending on your OMPI version, you can also add mpi_preconnect_all=1 to your ~/.openmpi/mca-params.conf, by this establishing all connections during MPI_Init().
I can't use tcpdump as i don't have root access, but I have made the change to btl_tcp_endpoint.c that you mention, rebuilt (make distclean... ./configure --enable-debug) OpenMPI, rebuilt the application against the new version of openMPI and re-ran the program. This is the output I see (with -np 3, and only 1 slot on the frontend): [steinbeck.phys.ucl.ac.uk:08475] [0,0,0] setting up session dir with [steinbeck.phys.ucl.ac.uk:08475] universe default-universe-8475 [steinbeck.phys.ucl.ac.uk:08475] user jgu [steinbeck.phys.ucl.ac.uk:08475] host steinbeck.phys.ucl.ac.uk [steinbeck.phys.ucl.ac.uk:08475] jobid 0 [steinbeck.phys.ucl.ac.uk:08475] procid 0 [steinbeck.phys.ucl.ac.uk:08475] procdir: /tmp/openmpi-sessions-...@steinbeck.phys.ucl.ac.uk_0/default-universe-8475/0/0 [steinbeck.phys.ucl.ac.uk:08475] jobdir: /tmp/openmpi-sessions-...@steinbeck.phys.ucl.ac.uk_0/default-universe-8475/0 [steinbeck.phys.ucl.ac.uk:08475] unidir: /tmp/openmpi-sessions-...@steinbeck.phys.ucl.ac.uk_0/default-universe-8475 [steinbeck.phys.ucl.ac.uk:08475] top: openmpi-sessions-...@steinbeck.phys.ucl.ac.uk_0 [steinbeck.phys.ucl.ac.uk:08475] tmp: /tmp [steinbeck.phys.ucl.ac.uk:08475] [0,0,0] contact_file /tmp/openmpi-sessions-...@steinbeck.phys.ucl.ac.uk_0/default-universe-8475/universe-s etup.txt [steinbeck.phys.ucl.ac.uk:08475] [0,0,0] wrote setup file [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: local csh: 0, local sh: 1 [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: assuming same remote shell as local shell [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: remote csh: 0, remote sh: 1 [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: final template argv: [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: /usr/bin/ssh <template> orted --debug --debug-daemons --bootproxy 1 --name <template> --num_p rocs 3 --vpid_start 0 --nodename <template> --universe j...@steinbeck.phys.ucl.ac.uk:default-universe-8475 --nsreplica "0.0.0;tcp://128.40.5 .39:37256;tcp://192.168.1.1:37256" --gprreplica "0.0.0;tcp://128.40.5.39:37256;tcp://192.168.1.1:37256" [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: launching on node frontend [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: frontend is a LOCAL node [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: changing to directory /homes/jgu [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: executing: (/cluster/data/jgu/bin/orted) orted --debug --debug-daemons --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename frontend --universe j...@steinbeck.phys.ucl.ac.uk:default-universe-8475 --nsreplica "0.0.0;tcp://12 8.40.5.39:37256;tcp://192.168.1.1:37256" --gprreplica "0.0.0;tcp://128.40.5.39:37256;tcp://192.168.1.1:37256" --set-sid [BIBINPUTS=.::/amp/ tex// NNTPSERVER=nntp-server.ucl.ac.uk SSH_AGENT_PID=8473 HOSTNAME=steinbeck.phys.ucl.ac.uk BSTINPUTS=.::/amp/tex// TERM=screen SHELL=/bin/ bash HISTSIZE=1000 TMPDIR=/tmp SSH_CLIENT=128.40.5.249 55312 22 QTDIR=/usr/lib64/qt-3.3 SSH_TTY=/dev/pts/0 USER=jgu LD_LIBRARY_PATH=:/clust er/data/jgu/lib:/cluster/data/jgu/lib LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=0 1;05;37;41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31: *.taz=01;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:* .gif=01;35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35: SSH_AUTH_SOCK=/tmp/ssh-KjHUoC8472/agent.8472 TERMCAP=SC|screen|VT 1 00/ANSI X3.64 virtual terminal:\ :DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\ :cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\ :do=^J:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\ :le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\ :li#24:co#80:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\ :cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\ :im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\ :ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\ :ti=\E[?1049h:te=\E[?1049l:us=\E[4m:ue=\E[24m:so=\E[3m:\ :se=\E[23m:mb=\E[5m:md=\E[1m:mr=\E[7m:me=\E[m:ms:\ :Co#8:pa#64:AF=\E[3%dm:AB=\E[4%dm:op=\E[39;49m:AX:\ :vb=\Eg:G0:as=\E(0:ae=\E(B:\ :ac=\140\140aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~..--++,,hhII00:\ :po=\E[5i:pf=\E[4i:Z0=\E[?3h:Z1=\E[?3l:k0=\E[10~:\ :k1=\EOP:k2=\EOQ:k3=\EOR:k4=\EOS:k5=\E[15~:k6=\E[17~:\ :k7=\E[18~:k8=\E[19~:k9=\E[20~:k;=\E[21~:F1=\E[23~:\ :F2=\E[24~:F3=\EO2P:F4=\EO2Q:F5=\EO2R:F6=\EO2S:\ :F7=\E[15;2~:F8=\E[17;2~:F9=\E[18;2~:FA=\E[19;2~:kb=^H:\ :K2=\EOE:kB=\E[Z:*4=\E[3;2~:*7=\E[1;2F:#2=\E[1;2H:\ :#3=\E[2;2~:#4=\E[1;2D:%c=\E[6;2~:%e=\E[5;2~:%i=\E[1;2C:\ :kh=\E[1~:@1=\E[1~:kH=\E[4~:@7=\E[4~:kN=\E[6~:kP=\E[5~:\ :kI=\E[2~:kD=\E[3~:ku=\EOA:kd=\EOB:kr=\EOC:kl=\EOD:km: KDEDIR=/usr MOZ_PLUGIN_PATH=/usr/local/plugins MAIL=/var/spool/mail/jgu PATH =/usr/kerberos/bin:/usr/local/bin64:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:.:/cluster/data/jgu/bin:/cluster/data/jgu/bin STY=1936.pts- 0.steinbeck INPUTRC=/etc/inputrc PWD=/cluster/data/jgu/wrk/ethene_hhg_align LANG=en_GB.UTF-8 LM_LICENSE_FILE=/homes/jgu/licenses:2600@hadry a.phys.ucl.ac.uk:27...@steinbeck.phys.ucl.ac.uk:1...@zuserver1.star.ucl.ac.uk SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass TEXINPUTS= .::/amp/tex/styles// SHLVL=3 HOME=/homes/jgu LOGNAME=jgu WINDOW=0 SSH_CONNECTION=128.40.5.249 55312 128.40.5.39 22 LESSOPEN=|/usr/bin/lessp ipe.sh %s PROMPT_COMMAND=echo -ne "\033_${USER}@${HOSTNAME%%.*}:${PWD/#$HOME/~}\033\\" GLOBAL=skip G_BROKEN_FILENAMES=1 NAG_KUSARI_FILE=had rya.phys.ucl.ac.uk:7733 _=/cluster/data/jgu/bin/mpirun OMPI_MCA_rds_hostfile_path=/cluster/data/jgu/etc/hostfile OMPI_MCA_orte_debug=1 OMPI _MCA_orte_debug_daemons=1 OMPI_MCA_seed=0] [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: launching on node node0 [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: node0 is a REMOTE node [steinbeck.phys.ucl.ac.uk:08475] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh node0 orted --debug --debug-daemons --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename node0 --universe j...@steinbeck.phys.ucl.ac.uk:default-universe-8475 --nsreplica "0.0.0;tcp:// 128.40.5.39:37256;tcp://192.168.1.1:37256" --gprreplica "0.0.0;tcp://128.40.5.39:37256;tcp://192.168.1.1:37256" [BIBINPUTS=.::/amp/tex// NN TPSERVER=nntp-server.ucl.ac.uk SSH_AGENT_PID=8473 HOSTNAME=steinbeck.phys.ucl.ac.uk BSTINPUTS=.::/amp/tex// TERM=screen SHELL=/bin/bash HIS TSIZE=1000 TMPDIR=/tmp SSH_CLIENT=128.40.5.249 55312 22 QTDIR=/usr/lib64/qt-3.3 SSH_TTY=/dev/pts/0 USER=jgu LD_LIBRARY_PATH=:/cluster/data/ jgu/lib:/cluster/data/jgu/lib LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37; 41:ex=01;32:*.cmd=01;32:*.exe=01;32:*.com=01;32:*.btm=01;32:*.bat=01;32:*.sh=01;32:*.csh=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01 ;31:*.lzh=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tz=01;31:*.rpm=01;31:*.cpio=01;31:*.jpg=01;35:*.gif=01; 35:*.bmp=01;35:*.xbm=01;35:*.xpm=01;35:*.png=01;35:*.tif=01;35: SSH_AUTH_SOCK=/tmp/ssh-KjHUoC8472/agent.8472 TERMCAP=SC|screen|VT 100/ANSI X3.64 virtual terminal:\ :DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\ :cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\ :do=^J:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\ :le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\ :li#24:co#80:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\ :cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\ :im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\ :ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\ :ti=\E[?1049h:te=\E[?1049l:us=\E[4m:ue=\E[24m:so=\E[3m:\ :se=\E[23m:mb=\E[5m:md=\E[1m:mr=\E[7m:me=\E[m:ms:\ :Co#8:pa#64:AF=\E[3%dm:AB=\E[4%dm:op=\E[39;49m:AX:\ :vb=\Eg:G0:as=\E(0:ae=\E(B:\ :ac=\140\140aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~..--++,,hhII00:\ :po=\E[5i:pf=\E[4i:Z0=\E[?3h:Z1=\E[?3l:k0=\E[10~:\ :k1=\EOP:k2=\EOQ:k3=\EOR:k4=\EOS:k5=\E[15~:k6=\E[17~:\ :k7=\E[18~:k8=\E[19~:k9=\E[20~:k;=\E[21~:F1=\E[23~:\ :F2=\E[24~:F3=\EO2P:F4=\EO2Q:F5=\EO2R:F6=\EO2S:\ :F7=\E[15;2~:F8=\E[17;2~:F9=\E[18;2~:FA=\E[19;2~:kb=^H:\ :K2=\EOE:kB=\E[Z:*4=\E[3;2~:*7=\E[1;2F:#2=\E[1;2H:\ :#3=\E[2;2~:#4=\E[1;2D:%c=\E[6;2~:%e=\E[5;2~:%i=\E[1;2C:\ :kh=\E[1~:@1=\E[1~:kH=\E[4~:@7=\E[4~:kN=\E[6~:kP=\E[5~:\ :kI=\E[2~:kD=\E[3~:ku=\EOA:kd=\EOB:kr=\EOC:kl=\EOD:km: KDEDIR=/usr MOZ_PLUGIN_PATH=/usr/local/plugins MAIL=/var/spool/mail/jgu PATH =/usr/kerberos/bin:/usr/local/bin64:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:.:/cluster/data/jgu/bin:/cluster/data/jgu/bin STY=1936.pts- 0.steinbeck INPUTRC=/etc/inputrc PWD=/cluster/data/jgu/wrk/ethene_hhg_align LANG=en_GB.UTF-8 LM_LICENSE_FILE=/homes/jgu/licenses:2600@hadry a.phys.ucl.ac.uk:27...@steinbeck.phys.ucl.ac.uk:1...@zuserver1.star.ucl.ac.uk SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass TEXINPUTS= .::/amp/tex/styles// SHLVL=3 HOME=/homes/jgu LOGNAME=jgu WINDOW=0 SSH_CONNECTION=128.40.5.249 55312 128.40.5.39 22 LESSOPEN=|/usr/bin/lessp ipe.sh %s PROMPT_COMMAND=echo -ne "\033_${USER}@${HOSTNAME%%.*}:${PWD/#$HOME/~}\033\\" GLOBAL=skip G_BROKEN_FILENAMES=1 NAG_KUSARI_FILE=had rya.phys.ucl.ac.uk:7733 _=/cluster/data/jgu/bin/mpirun OMPI_MCA_rds_hostfile_path=/cluster/data/jgu/etc/hostfile OMPI_MCA_orte_debug=1 OMPI _MCA_orte_debug_daemons=1 OMPI_MCA_seed=0] [steinbeck.phys.ucl.ac.uk:08476] [0,0,1] setting up session dir with [steinbeck.phys.ucl.ac.uk:08476] universe default-universe-8475 [steinbeck.phys.ucl.ac.uk:08476] user jgu [steinbeck.phys.ucl.ac.uk:08476] host frontend [steinbeck.phys.ucl.ac.uk:08476] jobid 0 [steinbeck.phys.ucl.ac.uk:08476] procid 1 [steinbeck.phys.ucl.ac.uk:08476] procdir: /tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475/0/1 [steinbeck.phys.ucl.ac.uk:08476] jobdir: /tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475/0 [steinbeck.phys.ucl.ac.uk:08476] unidir: /tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475 [steinbeck.phys.ucl.ac.uk:08476] top: openmpi-sessions-jgu@frontend_0 [steinbeck.phys.ucl.ac.uk:08476] tmp: /tmp Daemon [0,0,1] checking in as pid 8476 on host frontend [node0.cluster:08628] [0,0,2] setting up session dir with [node0.cluster:08628] universe default-universe-8475 [node0.cluster:08628] user jgu [node0.cluster:08628] host node0 [node0.cluster:08628] jobid 0 [node0.cluster:08628] procid 2 [node0.cluster:08628] procdir: /tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/0/2 [node0.cluster:08628] jobdir: /tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/0 [node0.cluster:08628] unidir: /tmp/openmpi-sessions-jgu@node0_0/default-universe-8475 [node0.cluster:08628] top: openmpi-sessions-jgu@node0_0 [node0.cluster:08628] tmp: /tmp Daemon [0,0,2] checking in as pid 8628 on host node0 [steinbeck.phys.ucl.ac.uk:08476] [0,0,1] orted: received launch callback [node0.cluster:08628] [0,0,2] orted: received launch callback [steinbeck.phys.ucl.ac.uk:08478] [0,1,0] setting up session dir with [steinbeck.phys.ucl.ac.uk:08478] universe default-universe-8475 [steinbeck.phys.ucl.ac.uk:08478] user jgu [steinbeck.phys.ucl.ac.uk:08478] host frontend [steinbeck.phys.ucl.ac.uk:08478] jobid 1 [steinbeck.phys.ucl.ac.uk:08478] procid 0 [steinbeck.phys.ucl.ac.uk:08478] procdir: /tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475/1/0 [steinbeck.phys.ucl.ac.uk:08478] jobdir: /tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475/1 [steinbeck.phys.ucl.ac.uk:08478] unidir: /tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475 [steinbeck.phys.ucl.ac.uk:08478] top: openmpi-sessions-jgu@frontend_0 [steinbeck.phys.ucl.ac.uk:08478] tmp: /tmp [node0.cluster:08650] [0,1,1] setting up session dir with [node0.cluster:08650] universe default-universe-8475 [node0.cluster:08650] user jgu [node0.cluster:08650] host node0 [node0.cluster:08650] jobid 1 [steinbeck.phys.ucl.ac.uk:08478] unidir: /tmp/openmpi-sessions-jgu@frontend_0/default-universe-8475 [steinbeck.phys.ucl.ac.uk:08478] top: openmpi-sessions-jgu@frontend_0 [steinbeck.phys.ucl.ac.uk:08478] tmp: /tmp [node0.cluster:08650] [0,1,1] setting up session dir with [node0.cluster:08650] universe default-universe-8475 [node0.cluster:08650] user jgu [node0.cluster:08650] host node0 [node0.cluster:08650] jobid 1 [node0.cluster:08650] procid 1 [node0.cluster:08650] procdir: /tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/1/1 [node0.cluster:08650] jobdir: /tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/1 [node0.cluster:08650] unidir: /tmp/openmpi-sessions-jgu@node0_0/default-universe-8475 [node0.cluster:08650] top: openmpi-sessions-jgu@node0_0 [node0.cluster:08650] tmp: /tmp [node0.cluster:08651] [0,1,2] setting up session dir with [node0.cluster:08651] universe default-universe-8475 [node0.cluster:08651] user jgu [node0.cluster:08651] host node0 [node0.cluster:08651] jobid 1 [node0.cluster:08651] procid 2 [node0.cluster:08651] procdir: /tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/1/2 [node0.cluster:08651] jobdir: /tmp/openmpi-sessions-jgu@node0_0/default-universe-8475/1 [node0.cluster:08651] unidir: /tmp/openmpi-sessions-jgu@node0_0/default-universe-8475 [node0.cluster:08651] top: openmpi-sessions-jgu@node0_0 [node0.cluster:08651] tmp: /tmp [steinbeck.phys.ucl.ac.uk:08475] spawn: in job_state_callback(jobid = 1, state = 0x4) [steinbeck.phys.ucl.ac.uk:08475] Info: Setting up debugger process table for applications MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 3 MPIR_proctable: (i, host, exe, pid) = (0, frontend, /export/data/jgu/wrk/ethene_hhg_align/align-cls-mpi, 8478) (i, host, exe, pid) = (1, node0, /export/data/jgu/wrk/ethene_hhg_align/align-cls-mpi, 8650) (i, host, exe, pid) = (2, node0, /export/data/jgu/wrk/ethene_hhg_align/align-cls-mpi, 8651) [steinbeck.phys.ucl.ac.uk:08478] [0,1,0] ompi_mpi_init completed [frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=110 [frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=110