Hello. I'm trying to use Open MPI 1.2.3 on a cluster of dual-processor AMD64 nodes. These nodes are all connected via gigabit ethernet on a private, self-contained IP network. The OS is GNU/Linux, gcc 4.1.2, kernel 2.6.21 . Open MPI was configured with --prefix=/usr/local and installed via make install. Compilation and installation went successfully. I have verified that non-interactive logins contain /usr/local/bin in the PATH, and ld.so.conf has an entry for the Open MPI lib dir (and ld.so.cache is up-to-date). This is a more-or-less "vanilla" installation without any external schedulers/resource managers.
I am simply trying to test Open MPI for the first time (we previously used LAM), and trying to do so via trivial system executables like "env". The problem is this: if I invoke mpirun such that it needs to launch on nodes other than the one I'm invoking it on, it seems to launch and then hang. Ctrl+C yields a "mpirun: killing job..." message, but the job never dies. I have to suspend the job and use kill -9, otherwise it doesn't die. If I invoke on the host I'm logged into (any node in the hostfile), without any host specification or hostfile provision, it works fine (i.e. job is run on local machine). My mpirun hostfile contains entries like: node1.x86-64 slots=2 max_slots=2 so for example, if I do: headnode $ mpirun -hostfile runnodes.txt -np 1 env where runnodes.txt does not contain any entry for the headnode, then mpirun hangs as I described. I have verified that I can do: headnode $ ssh node1.x86-64 env which works fine. Even using mpirun -v, I can't seem to find a command-line option which would give me the diagnostic information to figure out where mpirun gets stuck, what it has done up to that point, etc. How can I figure out what's going wrong? Is there a way to verbosely report the actions taken so far? These machines have multiple onboard ethernet interfaces with only one configured to communicate with the cluster, but even using the "--mca btl_tcp_if_include eth1" argument to mpirun makes no difference. The only potential thing I could come up with is as follows. All name resolution is done via /etc/hosts and no DNS server is present. However, the cluster actually contains machines of 2 different architectures, and we wanted nodes to be named node<#>.<arch> where # goes from 1 to N, and example archs would be x86-64 or alpha . To make this work, the init scripts on the machines set the hostname to the fully-qualified node name, e.g. node1.x86-64.cluster , rather than the typical practice of just the name preceding the first dot. In /etc/resolv.conf, the "domain" keyword is set to <arch>.<TLD>, e.g. x86-64.cluster . All the /etc/hosts entries do contain the node names in the format of node<#>.<arch> as well as the fully-qualified versions. So, other than setting the hostname to the fully-qualified value, this is a fairly typical GNU/Linux setup. We used the same practice with LAM and it never posed a problem, but thought I'd mention it in just in case. ____________________________________________________________________________________ Now that's room service! Choose from over 150,000 hotels in 45,000 destinations on Yahoo! Travel to find your fit. http://farechase.yahoo.com/promo-generic-14795097