I am having a problem on my linux desktop where mpi_init hangs for
approximately 64 seconds if I have my vpn client connected but runs immediately
if I disconnect the vpn. I've picked through the FAQ and Google but have failed
to come up with a solution.

Some potentially relevant information: I am using Open MPI 1.4.3 under ubuntu
12.04.1 and Cisco AnyConnect VPN Client. (I have also downloaded openmpi 1.6.4
and built it from source but believe it behaves the same way.)

Some potentially irrelevant information: I believe SSH tunneling is disabled by
the vpn.  While the vpn is connected, ifconfig shows an extra interface
(cscotun0 with inet addr:10.248.17.27 that shows up in the contact.txt file:

wt217:~/wrk/mpi> cat /tmp/openmpi-sessions-dab143@wt217_0/29142/contact.txt
1909850112.0;tcp://192.168.1.3:48237;tcp://10.248.17.27:48237
22001

The code is simply

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv)
{
    MPI_Init(&argc, &argv);
    MPI_Finalize();
    return 0;
}

I compile it using "mpicc -g mpi_hello.c -o mpi_hello" and execute it using
"mpirun -d -v ./mpi_hello". (The problem occurs whether or not I asked for more
than one processor.) With verbosity on, I get the following output:

wt217:~/wrk/mpi> mpirun -d -v ./mpi_hello
[wt217:22015] procdir: /tmp/openmpi-sessions-dab143@wt217_0/29144/0/0
[wt217:22015] jobdir: /tmp/openmpi-sessions-dab143@wt217_0/29144/0
[wt217:22015] top: openmpi-sessions-dab143@wt217_0
[wt217:22015] tmp: /tmp
[wt217:22015] [[29144,0],0] node[0].name wt217 daemon 0 arch ffc91200
[wt217:22015] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 1
  MPIR_proctable:
    (i, host, exe, pid) = (0, wt217, /home/dab143/wrk/mpi/./mpi_hello, 22016)
[wt217:22016] procdir: /tmp/openmpi-sessions-dab143@wt217_0/29144/1/0
[wt217:22016] jobdir: /tmp/openmpi-sessions-dab143@wt217_0/29144/1
[wt217:22016] top: openmpi-sessions-dab143@wt217_0
[wt217:22016] tmp: /tmp
<hangs for approximately 64 seconds>
[wt217:22016] [[29144,1],0] node[0].name wt217 daemon 0 arch ffc91200
[wt217:22016] sess_dir_finalize: proc session dir not empty - leaving
[wt217:22015] sess_dir_finalize: proc session dir not empty - leaving
[wt217:22015] sess_dir_finalize: job session dir not empty - leaving
[wt217:22015] sess_dir_finalize: proc session dir not empty - leaving
orterun: exiting with status 0

The code hangs for approximately 64 seconds after the line that reads "tmp:
/tmp".

If I attach gdb to the process during this time, the stack trace (attached)
shows that the pause is in __GI___poll in /sysdeps/unix/sysv/linux/poll.c:83.

If I add "-mca oob_tcp_if_exclude cscotun0", then the corresponding address for
that vpn interface no longer shows up in contact.txt, but the problem remains.
I also add "-mca btl ^cscotun0 -mca btl_tcp_if_exclude cscotun0" with no effect.

Any idea what is hanging this up or how I can get more information as to what
is going on during the pause? I assume connecting the vpn has caused mpi_init
to look for something that isn't available and that eventually times out, but I
don't know what.

Output from ompi_info and the gdb stack trace is attached.

Thanks,
David

Attachment: stack.txt.bz2
Description: Binary data

Attachment: ompi_info.txt.bz2
Description: Binary data

Reply via email to