tachyon.a04.aist.go.jp tachyon.a04.aist.go.jp gehirn.local gehirn.local
(the .local uses zeroconfig to find the address of gehirn -- it works). Running a parallel job on my own machine (-np 2) everything is fine. The job runs in parallel; it is faster and the output is correct. When I try running with -np 4 to use an additional g5 dual cpu machine, my job hangs whilst churning large amounts of cpu (runaway processes). This continues without output until I break the process with a ^C (which terminates them on all machines). I am running the task via ssh using a ssh-agent. Might anyone have any idea what possibly could be wrong. I have attached my config.log and ompi_info files (bzip2'ed) to this mail as specified in the mailing list instructions. This should be a simple thing I am guessing, but it is taking too much time to figure it out on my own (e.g. I couldn't find a FAQ or a user question/reply that answered this).
Paul Fons
config.log.bz2
Description: Binary data
ompi_info.log.bz2
Description: Binary data
Script started on Tue Sep 5 16:01:18 2006[tachyon:exafs/feff85/zno] paulfons% mpiexec -machinefile machinefile -np 2 host name
tachyon.a04.aist.go.jp tachyon.a04.aist.go.jp[tachyon:exafs/feff85/zno] paulfons% mpiexec -machinefile machinefile -np 2 /opt/feff/feff85/rdinp
Number of processors = 2 Feff 8.40 XANES: name: zincite ZnO formula: ZnO sites: Zn1,O1 refer1: wyckoff, vol 1, ch III, p 111 refer2: schoen: notes1:[tachyon:exafs/feff85/zno] paulfons% mpiexec -machinefile machinefile -np 2 hostname
tachyon.a04.aist.go.jp tachyon.a04.aist.go.jp dhcp054092.a04.aist.go.jp dhcp054092.a04.aist.go.jp[tachyon:exafs/feff85/zno] paulfons% mpiexec -machinefile machinefile -np 4 /opt/feff/feff85/rdinp
Number of processors = 4 Feff 8.40 XANES: name: zincite ZnO formula: ZnO sites: Zn1,O1 refer1: wyckoff, vol 1, ch III, p 111 refer2: schoen: notes1: ^Cmpiexec: killing job...
smime.p7s
Description: S/MIME cryptographic signature