Sorry to take so long to reply -- as a token of my apology, accept this patch to your Make-arch to fix up a few entries with LAM/MPI and add entries for Open MPI (yay open source!). :-)

(note that the LAM/MPI and Open MPI entries are identical except for the ARCH strings)

We have committed a bunch of fixes post-rc4 that seem to have fixed the problems in your raytracer app -- I know that we still have some bugs left, but I am able to run tachyon with the 2balls.dat sample file over Myrinet with 16 processes.

I just initiated a snapshot tarball creation; should be up on the web site under the "nightly snapshots" downloads section in ~30 minutes: http://www.open-mpi.org/nightly/v1.0/. Look for r7924.

Can you give it a whirl again with this tarball (or svn checkout)?

Thanks!

Attachment: make.patch
Description: Binary data




On Oct 18, 2005, at 2:24 PM, Parrott, Chris wrote:


Tim,

I just tried this same code again with 1.0rc4, and I still see the same
symptom.  The gdb stack trace for a hung process looks a bit different
this time, however:

(gdb) bt
#0  0x0000002a98a085e1 in mca_bml_r2_progress ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/openmpi/mca_bml_r2.so
#1  0x0000002a986c3080 in mca_pml_ob1_progress ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/openmpi/mca_pml_ob1.so
#2  0x0000002a95d8378c in opal_progress ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/libopal.so.0
#3  0x0000002a95a6d8a5 in opal_condition_wait ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
#4  0x0000002a95a6de49 in ompi_request_wait_all ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
#5  0x0000002a95937602 in PMPI_Waitall ()
   from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
#6  0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635e10)
    at parallel.c:229
#7  0x000000000040b515 in renderscene (scene=0x6394d0) at render.c:285
#8  0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at
api.c:95
#9 0x0000000000418ac7 in main (argc=6, argv=0x7fbfffec38) at main.c:431
(gdb)


It still seems to be stuck in the MPI_Waitall call, for some reason.

Any ideas?  If you need any additional information from me, please let
me know.

Thanks in advance,

+chris

--
Chris Parrott                    5204 E. Ben White Blvd., M/S 628
Product Development Engineer     Austin, TX 78741
Computational Products Group     (512) 602-8710 / (512) 602-7745 (fax)
Advanced Micro Devices           chris.parr...@amd.com

-----Original Message-----
From: Tim S. Woodall [mailto:twood...@lanl.gov]
Sent: Monday, October 17, 2005 2:01 PM
To: Open MPI Users
Cc: Parrott, Chris
Subject: Re: [O-MPI users] OpenMPI hang issue


Hello Chris,

Please give the next release candidate a try. There was an
issue w/ the GM port that was likely causing this.

Thanks,
Tim


Parrott, Chris wrote:
Greetings,

I have been testing OpenMPI 1.0rc3 on a rack of 8
2-processor (single
core) Opteron systems connected via both Gigabit Ethernet
and Myrinet.
My testing has been mostly successful, although I have run into a
recurring issue on a few MPI applications.  The symptom is that the
computation seems to progress nearly to completion, and
then suddenly
just hangs without terminating.  One code that demonstrates this is
the Tachyon parallel raytracer, available at:

  http://jedi.ks.uiuc.edu/~johns/raytracer/

I am using PGI 6.0-5 to compile OpenMPI, so that may be part of the
root cause of this particular problem.

I have attached the output of config.log to this message.
Here is the
output from ompi_info:

                Open MPI: 1.0rc3r7730
   Open MPI SVN revision: r7730
                Open RTE: 1.0rc3r7730
   Open RTE SVN revision: r7730
                    OPAL: 1.0rc3r7730
       OPAL SVN revision: r7730
                  Prefix: /opt/openmpi-1.0rc3-pgi-6.0  Configured
architecture: x86_64-unknown-linux-gnu
           Configured by: root
           Configured on: Mon Oct 17 10:10:28 PDT 2005
          Configure host: castor00
                Built by: root
                Built on: Mon Oct 17 10:29:20 PDT 2005
              Built host: castor00
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
              C compiler: pgcc
     C compiler absolute:
/net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgcc
            C++ compiler: pgCC
   C++ compiler absolute:
/net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgCC
      Fortran77 compiler: pgf77
  Fortran77 compiler abs:
/net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf77
      Fortran90 compiler: pgf90
  Fortran90 compiler abs:
/net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf90
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: 1
              MCA memory: malloc_hooks (MCA v1.0, API v1.0,
Component
v1.0)
           MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0)
           MCA maffinity: first_use (MCA v1.0, API v1.0,
Component v1.0)
           MCA maffinity: libnuma (MCA v1.0, API v1.0,
Component v1.0)
               MCA timer: linux (MCA v1.0, API v1.0, Component v1.0)
           MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
           MCA allocator: bucket (MCA v1.0, API v1.0,
Component v1.0)
                MCA coll: basic (MCA v1.0, API v1.0, Component v1.0)
                MCA coll: self (MCA v1.0, API v1.0, Component v1.0)
                MCA coll: sm (MCA v1.0, API v1.0, Component v1.0)
                  MCA io: romio (MCA v1.0, API v1.0, Component v1.0)
               MCA mpool: gm (MCA v1.0, API v1.0, Component v1.0)
               MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0)
                 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0)
                 MCA pml: teg (MCA v1.0, API v1.0, Component v1.0)
                 MCA pml: uniq (MCA v1.0, API v1.0, Component v1.0)
                 MCA ptl: gm (MCA v1.0, API v1.0, Component v1.0)
                 MCA ptl: self (MCA v1.0, API v1.0, Component v1.0)
                 MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0)
                 MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0)
                 MCA btl: gm (MCA v1.0, API v1.0, Component v1.0)
                 MCA btl: self (MCA v1.0, API v1.0, Component v1.0)
                 MCA btl: sm (MCA v1.0, API v1.0, Component v1.0)
                 MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
                MCA topo: unity (MCA v1.0, API v1.0, Component v1.0)
                 MCA gpr: null (MCA v1.0, API v1.0, Component v1.0)
                 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0)
                 MCA gpr: replica (MCA v1.0, API v1.0,
Component v1.0)
                 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0)
                 MCA iof: svc (MCA v1.0, API v1.0, Component v1.0)
                  MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0)
                  MCA ns: replica (MCA v1.0, API v1.0,
Component v1.0)
                 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
                 MCA ras: dash_host (MCA v1.0, API v1.0,
Component v1.0)
                 MCA ras: hostfile (MCA v1.0, API v1.0,
Component v1.0)
                 MCA ras: localhost (MCA v1.0, API v1.0,
Component v1.0)
                 MCA ras: slurm (MCA v1.0, API v1.0, Component v1.0)
                 MCA rds: hostfile (MCA v1.0, API v1.0,
Component v1.0)
                 MCA rds: resfile (MCA v1.0, API v1.0,
Component v1.0)
               MCA rmaps: round_robin (MCA v1.0, API v1.0, Component
v1.0)
                MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.0)
                MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.0)
                 MCA rml: oob (MCA v1.0, API v1.0, Component v1.0)
                 MCA pls: daemon (MCA v1.0, API v1.0,
Component v1.0)
                 MCA pls: fork (MCA v1.0, API v1.0, Component v1.0)
                 MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0)
                 MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0)
                 MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0)
                 MCA sds: env (MCA v1.0, API v1.0, Component v1.0)
                 MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0)
                 MCA sds: seed (MCA v1.0, API v1.0, Component v1.0)
                 MCA sds: singleton (MCA v1.0, API v1.0,
Component v1.0)
                 MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0)


Here is the command-line I am using to invoke OpenMPI for
my build of
Tachyon:

/opt/openmpi-1.0rc3-pgi-6.0/bin/mpirun --prefix
/opt/openmpi-1.0rc3-pgi-6.0 --mca pls_rsh_agent rsh --hostfile
hostfile.gigeth -np 16 tachyon_base.mpi -o scene.tga scene.dat

Attaching gdb to one of the hung processes, I get the
following stack
trace:

(gdb) bt
#0  0x0000002a95d6b87d in opal_sys_timer_get_cycles ()
   from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
#1  0x0000002a95d83509 in opal_timer_base_get_cycles ()
   from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
#2  0x0000002a95d8370c in opal_progress ()
   from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
#3  0x0000002a95a6d8a5 in opal_condition_wait ()
   from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
#4  0x0000002a95a6de49 in ompi_request_wait_all ()
   from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
#5  0x0000002a95937602 in PMPI_Waitall ()
   from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
#6  0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635a60)
    at parallel.c:229
#7  0x000000000040b515 in renderscene (scene=0x6394d0) at
render.c:285
#8  0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at
api.c:95 #9  0x0000000000418ac7 in main (argc=6,
argv=0x7fbfffec38) at
main.c:431
(gdb)

So based on this stack trace, it appears that the application is
hanging on an MPI_Waitall call for some reason.

Does anyone have any ideas as to why this might be
happening?  If this
is covered in the FAQ somewhere, then please accept my apologies in
advance.

Many thanks,

+chris

--
Chris Parrott                    5204 E. Ben White Blvd., M/S 628
Product Development Engineer     Austin, TX 78741
Computational Products Group     (512) 602-8710 / (512)
602-7745 (fax)
Advanced Micro Devices           chris.parr...@amd.com



----------------------------------------------------------------------
--

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-> mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Reply via email to