Greetings, I have been testing OpenMPI 1.0rc3 on a rack of 8 2-processor (single core) Opteron systems connected via both Gigabit Ethernet and Myrinet. My testing has been mostly successful, although I have run into a recurring issue on a few MPI applications. The symptom is that the computation seems to progress nearly to completion, and then suddenly just hangs without terminating. One code that demonstrates this is the Tachyon parallel raytracer, available at:
http://jedi.ks.uiuc.edu/~johns/raytracer/ I am using PGI 6.0-5 to compile OpenMPI, so that may be part of the root cause of this particular problem. I have attached the output of config.log to this message. Here is the output from ompi_info: Open MPI: 1.0rc3r7730 Open MPI SVN revision: r7730 Open RTE: 1.0rc3r7730 Open RTE SVN revision: r7730 OPAL: 1.0rc3r7730 OPAL SVN revision: r7730 Prefix: /opt/openmpi-1.0rc3-pgi-6.0 Configured architecture: x86_64-unknown-linux-gnu Configured by: root Configured on: Mon Oct 17 10:10:28 PDT 2005 Configure host: castor00 Built by: root Built on: Mon Oct 17 10:29:20 PDT 2005 Built host: castor00 C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes C compiler: pgcc C compiler absolute: /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgcc C++ compiler: pgCC C++ compiler absolute: /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgCC Fortran77 compiler: pgf77 Fortran77 compiler abs: /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf77 Fortran90 compiler: pgf90 Fortran90 compiler abs: /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf90 C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: no Thread support: posix (mpi: no, progress: no) Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: 1 MCA memory: malloc_hooks (MCA v1.0, API v1.0, Component v1.0) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.0) MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.0) MCA timer: linux (MCA v1.0, API v1.0, Component v1.0) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.0) MCA coll: self (MCA v1.0, API v1.0, Component v1.0) MCA coll: sm (MCA v1.0, API v1.0, Component v1.0) MCA io: romio (MCA v1.0, API v1.0, Component v1.0) MCA mpool: gm (MCA v1.0, API v1.0, Component v1.0) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0) MCA pml: teg (MCA v1.0, API v1.0, Component v1.0) MCA pml: uniq (MCA v1.0, API v1.0, Component v1.0) MCA ptl: gm (MCA v1.0, API v1.0, Component v1.0) MCA ptl: self (MCA v1.0, API v1.0, Component v1.0) MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0) MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA btl: gm (MCA v1.0, API v1.0, Component v1.0) MCA btl: self (MCA v1.0, API v1.0, Component v1.0) MCA btl: sm (MCA v1.0, API v1.0, Component v1.0) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.0) MCA gpr: null (MCA v1.0, API v1.0, Component v1.0) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.0) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0) MCA iof: svc (MCA v1.0, API v1.0, Component v1.0) MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0) MCA ns: replica (MCA v1.0, API v1.0, Component v1.0) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.0) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.0) MCA ras: localhost (MCA v1.0, API v1.0, Component v1.0) MCA ras: slurm (MCA v1.0, API v1.0, Component v1.0) MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.0) MCA rds: resfile (MCA v1.0, API v1.0, Component v1.0) MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.0) MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.0) MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.0) MCA rml: oob (MCA v1.0, API v1.0, Component v1.0) MCA pls: daemon (MCA v1.0, API v1.0, Component v1.0) MCA pls: fork (MCA v1.0, API v1.0, Component v1.0) MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0) MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0) MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0) MCA sds: env (MCA v1.0, API v1.0, Component v1.0) MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0) MCA sds: seed (MCA v1.0, API v1.0, Component v1.0) MCA sds: singleton (MCA v1.0, API v1.0, Component v1.0) MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0) Here is the command-line I am using to invoke OpenMPI for my build of Tachyon: /opt/openmpi-1.0rc3-pgi-6.0/bin/mpirun --prefix /opt/openmpi-1.0rc3-pgi-6.0 --mca pls_rsh_agent rsh --hostfile hostfile.gigeth -np 16 tachyon_base.mpi -o scene.tga scene.dat Attaching gdb to one of the hung processes, I get the following stack trace: (gdb) bt #0 0x0000002a95d6b87d in opal_sys_timer_get_cycles () from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0 #1 0x0000002a95d83509 in opal_timer_base_get_cycles () from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0 #2 0x0000002a95d8370c in opal_progress () from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0 #3 0x0000002a95a6d8a5 in opal_condition_wait () from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0 #4 0x0000002a95a6de49 in ompi_request_wait_all () from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0 #5 0x0000002a95937602 in PMPI_Waitall () from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0 #6 0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635a60) at parallel.c:229 #7 0x000000000040b515 in renderscene (scene=0x6394d0) at render.c:285 #8 0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at api.c:95 #9 0x0000000000418ac7 in main (argc=6, argv=0x7fbfffec38) at main.c:431 (gdb) So based on this stack trace, it appears that the application is hanging on an MPI_Waitall call for some reason. Does anyone have any ideas as to why this might be happening? If this is covered in the FAQ somewhere, then please accept my apologies in advance. Many thanks, +chris -- Chris Parrott 5204 E. Ben White Blvd., M/S 628 Product Development Engineer Austin, TX 78741 Computational Products Group (512) 602-8710 / (512) 602-7745 (fax) Advanced Micro Devices chris.parr...@amd.com
config.log.bz2
Description: config.log.bz2