Tim, I just tried this same code again with 1.0rc4, and I still see the same symptom. The gdb stack trace for a hung process looks a bit different this time, however:
(gdb) bt #0 0x0000002a98a085e1 in mca_bml_r2_progress () from /opt/openmpi-1.0rc4-pgi-6.0/lib/openmpi/mca_bml_r2.so #1 0x0000002a986c3080 in mca_pml_ob1_progress () from /opt/openmpi-1.0rc4-pgi-6.0/lib/openmpi/mca_pml_ob1.so #2 0x0000002a95d8378c in opal_progress () from /opt/openmpi-1.0rc4-pgi-6.0/lib/libopal.so.0 #3 0x0000002a95a6d8a5 in opal_condition_wait () from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0 #4 0x0000002a95a6de49 in ompi_request_wait_all () from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0 #5 0x0000002a95937602 in PMPI_Waitall () from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0 #6 0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635e10) at parallel.c:229 #7 0x000000000040b515 in renderscene (scene=0x6394d0) at render.c:285 #8 0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at api.c:95 #9 0x0000000000418ac7 in main (argc=6, argv=0x7fbfffec38) at main.c:431 (gdb) It still seems to be stuck in the MPI_Waitall call, for some reason. Any ideas? If you need any additional information from me, please let me know. Thanks in advance, +chris -- Chris Parrott 5204 E. Ben White Blvd., M/S 628 Product Development Engineer Austin, TX 78741 Computational Products Group (512) 602-8710 / (512) 602-7745 (fax) Advanced Micro Devices chris.parr...@amd.com > -----Original Message----- > From: Tim S. Woodall [mailto:twood...@lanl.gov] > Sent: Monday, October 17, 2005 2:01 PM > To: Open MPI Users > Cc: Parrott, Chris > Subject: Re: [O-MPI users] OpenMPI hang issue > > > Hello Chris, > > Please give the next release candidate a try. There was an > issue w/ the GM port that was likely causing this. > > Thanks, > Tim > > > Parrott, Chris wrote: > > Greetings, > > > > I have been testing OpenMPI 1.0rc3 on a rack of 8 > 2-processor (single > > core) Opteron systems connected via both Gigabit Ethernet > and Myrinet. > > My testing has been mostly successful, although I have run into a > > recurring issue on a few MPI applications. The symptom is that the > > computation seems to progress nearly to completion, and > then suddenly > > just hangs without terminating. One code that demonstrates this is > > the Tachyon parallel raytracer, available at: > > > > http://jedi.ks.uiuc.edu/~johns/raytracer/ > > > > I am using PGI 6.0-5 to compile OpenMPI, so that may be part of the > > root cause of this particular problem. > > > > I have attached the output of config.log to this message. > Here is the > > output from ompi_info: > > > > Open MPI: 1.0rc3r7730 > > Open MPI SVN revision: r7730 > > Open RTE: 1.0rc3r7730 > > Open RTE SVN revision: r7730 > > OPAL: 1.0rc3r7730 > > OPAL SVN revision: r7730 > > Prefix: /opt/openmpi-1.0rc3-pgi-6.0 Configured > > architecture: x86_64-unknown-linux-gnu > > Configured by: root > > Configured on: Mon Oct 17 10:10:28 PDT 2005 > > Configure host: castor00 > > Built by: root > > Built on: Mon Oct 17 10:29:20 PDT 2005 > > Built host: castor00 > > C bindings: yes > > C++ bindings: yes > > Fortran77 bindings: yes (all) > > Fortran90 bindings: yes > > C compiler: pgcc > > C compiler absolute: > > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgcc > > C++ compiler: pgCC > > C++ compiler absolute: > > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgCC > > Fortran77 compiler: pgf77 > > Fortran77 compiler abs: > > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf77 > > Fortran90 compiler: pgf90 > > Fortran90 compiler abs: > > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf90 > > C profiling: yes > > C++ profiling: yes > > Fortran77 profiling: yes > > Fortran90 profiling: yes > > C++ exceptions: no > > Thread support: posix (mpi: no, progress: no) > > Internal debug support: no > > MPI parameter check: runtime > > Memory profiling support: no > > Memory debugging support: no > > libltdl support: 1 > > MCA memory: malloc_hooks (MCA v1.0, API v1.0, > Component > > v1.0) > > MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0) > > MCA maffinity: first_use (MCA v1.0, API v1.0, > Component v1.0) > > MCA maffinity: libnuma (MCA v1.0, API v1.0, > Component v1.0) > > MCA timer: linux (MCA v1.0, API v1.0, Component v1.0) > > MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) > > MCA allocator: bucket (MCA v1.0, API v1.0, > Component v1.0) > > MCA coll: basic (MCA v1.0, API v1.0, Component v1.0) > > MCA coll: self (MCA v1.0, API v1.0, Component v1.0) > > MCA coll: sm (MCA v1.0, API v1.0, Component v1.0) > > MCA io: romio (MCA v1.0, API v1.0, Component v1.0) > > MCA mpool: gm (MCA v1.0, API v1.0, Component v1.0) > > MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0) > > MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0) > > MCA pml: teg (MCA v1.0, API v1.0, Component v1.0) > > MCA pml: uniq (MCA v1.0, API v1.0, Component v1.0) > > MCA ptl: gm (MCA v1.0, API v1.0, Component v1.0) > > MCA ptl: self (MCA v1.0, API v1.0, Component v1.0) > > MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0) > > MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0) > > MCA btl: gm (MCA v1.0, API v1.0, Component v1.0) > > MCA btl: self (MCA v1.0, API v1.0, Component v1.0) > > MCA btl: sm (MCA v1.0, API v1.0, Component v1.0) > > MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) > > MCA topo: unity (MCA v1.0, API v1.0, Component v1.0) > > MCA gpr: null (MCA v1.0, API v1.0, Component v1.0) > > MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0) > > MCA gpr: replica (MCA v1.0, API v1.0, > Component v1.0) > > MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0) > > MCA iof: svc (MCA v1.0, API v1.0, Component v1.0) > > MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0) > > MCA ns: replica (MCA v1.0, API v1.0, > Component v1.0) > > MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) > > MCA ras: dash_host (MCA v1.0, API v1.0, > Component v1.0) > > MCA ras: hostfile (MCA v1.0, API v1.0, > Component v1.0) > > MCA ras: localhost (MCA v1.0, API v1.0, > Component v1.0) > > MCA ras: slurm (MCA v1.0, API v1.0, Component v1.0) > > MCA rds: hostfile (MCA v1.0, API v1.0, > Component v1.0) > > MCA rds: resfile (MCA v1.0, API v1.0, > Component v1.0) > > MCA rmaps: round_robin (MCA v1.0, API v1.0, Component > > v1.0) > > MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.0) > > MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.0) > > MCA rml: oob (MCA v1.0, API v1.0, Component v1.0) > > MCA pls: daemon (MCA v1.0, API v1.0, > Component v1.0) > > MCA pls: fork (MCA v1.0, API v1.0, Component v1.0) > > MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0) > > MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0) > > MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0) > > MCA sds: env (MCA v1.0, API v1.0, Component v1.0) > > MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0) > > MCA sds: seed (MCA v1.0, API v1.0, Component v1.0) > > MCA sds: singleton (MCA v1.0, API v1.0, > Component v1.0) > > MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0) > > > > > > Here is the command-line I am using to invoke OpenMPI for > my build of > > Tachyon: > > > > /opt/openmpi-1.0rc3-pgi-6.0/bin/mpirun --prefix > > /opt/openmpi-1.0rc3-pgi-6.0 --mca pls_rsh_agent rsh --hostfile > > hostfile.gigeth -np 16 tachyon_base.mpi -o scene.tga scene.dat > > > > Attaching gdb to one of the hung processes, I get the > following stack > > trace: > > > > (gdb) bt > > #0 0x0000002a95d6b87d in opal_sys_timer_get_cycles () > > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0 > > #1 0x0000002a95d83509 in opal_timer_base_get_cycles () > > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0 > > #2 0x0000002a95d8370c in opal_progress () > > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0 > > #3 0x0000002a95a6d8a5 in opal_condition_wait () > > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0 > > #4 0x0000002a95a6de49 in ompi_request_wait_all () > > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0 > > #5 0x0000002a95937602 in PMPI_Waitall () > > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0 > > #6 0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635a60) > > at parallel.c:229 > > #7 0x000000000040b515 in renderscene (scene=0x6394d0) at > render.c:285 > > #8 0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at > > api.c:95 #9 0x0000000000418ac7 in main (argc=6, > argv=0x7fbfffec38) at > > main.c:431 > > (gdb) > > > > So based on this stack trace, it appears that the application is > > hanging on an MPI_Waitall call for some reason. > > > > Does anyone have any ideas as to why this might be > happening? If this > > is covered in the FAQ somewhere, then please accept my apologies in > > advance. > > > > Many thanks, > > > > +chris > > > > -- > > Chris Parrott 5204 E. Ben White Blvd., M/S 628 > > Product Development Engineer Austin, TX 78741 > > Computational Products Group (512) 602-8710 / (512) > 602-7745 (fax) > > Advanced Micro Devices chris.parr...@amd.com > > > > > > > ---------------------------------------------------------------------- > > -- > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > http://www.open-> mpi.org/mailman/listinfo.cgi/users > >