Can you try this with the current trunk (r23587 or later)? I just added a number of new features and bug fixes, and I would be interested to see if it fixes the problem. In particular I suspect that this might be related to the Init/Finalize bounding of the checkpoint region.
-- Josh On Aug 10, 2010, at 2:18 PM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> wrote: > Josh > > Please find attached is the python program that reproduces the hang that > I described. Initial part of this file describes the prerequisite > modules and the steps to reproduce the problem. Please let me know if > you have any questions in reproducing the hang. > > Please note that, if I add the following lines at the end of the program > (in case sleep_time is True), the problem disappears ie; program resumes > successfully after successful completion of checkpoint. > # Add following lines at the end for sleep_time is True > else: > time.sleep(0.1) > # End of added lines > > > Thanks a lot for your time in looking into this issue. > > Regards > Ananda > > Ananda B Mudar, PMP > Senior Technical Architect > Wipro Technologies > Ph: 972 765 8093 > ananda.mu...@wipro.com > > > -----Original Message----- > Date: Mon, 9 Aug 2010 16:37:58 -0400 > From: Joshua Hursey <jjhur...@open-mpi.org> > Subject: Re: [OMPI users] Checkpointing mpi4py program > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org> > Content-Type: text/plain; charset=windows-1252 > > I have not tried to checkpoint an mpi4py application, so I cannot say > for sure if it works or not. You might be hitting something with the > Python runtime interacting in an odd way with either Open MPI or BLCR. > > Can you attach a debugger and get a backtrace on a stuck checkpoint? > That might show us where things are held up. > > -- Josh > > > On Aug 9, 2010, at 4:04 PM, <ananda.mu...@wipro.com> > <ananda.mu...@wipro.com> wrote: > >> Hi >> >> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR > 0.8.2. When I run ompi-checkpoint on the program written using mpi4py, I > see that program doesn?t resume sometimes after successful checkpoint > creation. This doesn?t occur always meaning the program resumes after > successful checkpoint creation most of the time and completes > successfully. Has anyone tested the checkpoint/restart functionality > with mpi4py programs? Are there any best practices that I should keep in > mind while checkpointing mpi4py programs? >> >> Thanks for your time >> - Ananda >> Please do not print this email unless it is absolutely necessary. >> >> The information contained in this electronic message and any > attachments to this message are intended for the exclusive use of the > addressee(s) and may contain proprietary, confidential or privileged > information. If you are not the intended recipient, you should not > disseminate, distribute or copy this e-mail. Please notify the sender > immediately and destroy all copies of this message and any attachments. >> >> WARNING: Computer viruses can be transmitted via email. The recipient > should check this email and any attachments for the presence of viruses. > The company accepts no liability for any damage caused by any virus > transmitted by this email. >> >> www.wipro.com >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > ------------------------------ > > Message: 8 > Date: Mon, 9 Aug 2010 13:50:03 -0700 > From: John Hsu <john...@willowgarage.com> > Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5 > To: Open MPI Users <us...@open-mpi.org> > Message-ID: > <AANLkTim63t=wQMeWfHWNnvnVKajOe92e7NG3X=war...@mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with > -npernode > 11), so I proceeded to bump up -npernode to 12: > > $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX > -npernode > 12 --mca btl_sm_use_knem 0 ./bin/mpi_test > > and the same error occurs, > > (gdb) bt > #0 0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6 > #1 0x00007fcca7e5ea4b in epoll_dispatch () > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 > #2 0x00007fcca7e665fa in opal_event_base_loop () > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 > #3 0x00007fcca7e37e69 in opal_progress () > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 > #4 0x00007fcca15b6e95 in mca_pml_ob1_recv () > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so > #5 0x00007fcca7dd635c in PMPI_Recv () > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 > #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800, > buf=0x7fff2a0d7e00, > count=1, datatype=..., source=23, tag=100, status=...) > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 > #7 0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028) > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:30 > (gdb) > > > (gdb) bt > #0 0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6 > #1 0x00007f5dc454ba4b in epoll_dispatch () > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 > #2 0x00007f5dc45535fa in opal_event_base_loop () > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 > #3 0x00007f5dc4524e69 in opal_progress () > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 > #4 0x00007f5dbdca4b1d in mca_pml_ob1_send () > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so > #5 0x00007f5dc44c574f in PMPI_Send () > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 > #6 0x000000000040adda in MPI::Comm::Send (this=0x612800, > buf=0x7fff6e0c0790, > count=1, datatype=..., dest=0, tag=100) > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 > #7 0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8) > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:38 > (gdb) > > > > > On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > >> In your first mail, you mentioned that you are testing the new knem >> support. >> >> Can you try disabling knem and see if that fixes the problem? (i.e., > run >> with --mca btl_sm_use_knem 0") If it fixes the issue, that might mean > we >> have a knem-based bug. >> >> >> >> On Aug 6, 2010, at 1:42 PM, John Hsu wrote: >> >>> Hi, >>> >>> sorry for the confusion, that was indeed the trunk version of things > I >> was running. >>> >>> Here's the same problem using >>> >>> >> > http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar. > bz2 >>> >>> command-line: >>> >>> ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX > -npernode >> 11 ./bin/mpi_test >>> >>> back trace on sender: >>> >>> (gdb) bt >>> #0 0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6 >>> #1 0x00007fa004f43a4b in epoll_dispatch () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #2 0x00007fa004f4b5fa in opal_event_base_loop () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #3 0x00007fa004f1ce69 in opal_progress () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #4 0x00007f9ffe69be95 in mca_pml_ob1_recv () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so >>> #5 0x00007fa004ebb35c in PMPI_Recv () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800, >> buf=0x7fff8f5cbb50, count=1, datatype=..., source=29, >>> tag=100, status=...) >>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 >>> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78) >>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:30 >>> (gdb) >>> >>> back trace on receiver: >>> >>> (gdb) bt >>> #0 0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6 >>> #1 0x00007fcce2f1ea4b in epoll_dispatch () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #2 0x00007fcce2f265fa in opal_event_base_loop () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #3 0x00007fcce2ef7e69 in opal_progress () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #4 0x00007fccdc677b1d in mca_pml_ob1_send () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so >>> #5 0x00007fcce2e9874f in PMPI_Send () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800, >> buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100) >>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 >>> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48) >>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:38 >>> (gdb) >>> >>> and attached is my mpi_test file for reference. >>> >>> thanks, >>> John >>> >>> >>> On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <r...@open-mpi.org> > wrote: >>> You clearly have an issue with version confusion. The file cited in > your >> warning: >>> >>>> [wgsg0:29074] Warning -- mutex was double locked from > errmgr_hnp.c:772 >>> >>> does not exist in 1.5rc5. It only exists in the developer's trunk at > this >> time. Check to ensure you have the right paths set, blow away the > install >> area (in case you have multiple versions installed on top of each > other), >> etc. >>> >>> >>> >>> On Aug 5, 2010, at 5:16 PM, John Hsu wrote: >>> >>>> Hi All, >>>> I am new to openmpi and have encountered an issue using > pre-release >> 1.5rc5, for a simple mpi code (see attached). In this test, nodes 1 > to n >> sends out a random number to node 0, node 0 sums all numbers received. >>>> >>>> This code works fine on 1 machine with any number of nodes, and on > 3 >> machines running 10 nodes per machine, but when we try to run 11 nodes > per >> machine this warning appears: >>>> >>>> [wgsg0:29074] Warning -- mutex was double locked from > errmgr_hnp.c:772 >>>> >>>> And node 0 (master summing node) hangs on receiving plus another > random >> node hangs on sending indefinitely. Below are the back traces: >>>> >>>> (gdb) bt >>>> #0 0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6 >>>> #1 0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0, >> arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215 >>>> #2 0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0, >> flags=2) at event.c:838 >>>> #3 0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766 >>>> #4 0x00007f0c604ebb5a in opal_progress () at >> runtime/opal_progress.c:189 >>>> #5 0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0, >> m=0x7f0c60800400) at ../../../../opal/threads/ >>>> condition.h:99 >>>> #6 0x00007f0c59b79dff in ompi_request_wait_completion > (req=0x2538d80) >> at ../../../../ompi/request/request.h:377 >>>> #7 0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0, >> count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40, >>>> status=0x7fff90f62668) at pml_ob1_irecv.c:104 >>>> #8 0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1, >> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40, >> status=0x7fff90f62668) >>>> at precv.c:78 >>>> #9 0x000000000040ae14 in MPI::Comm::Recv (this=0x612800, >> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100, > status=...) >>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op > enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 >>>> #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8) >>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp > i_test/src/mpi_test.cpp:30 >>>> (gdb) >>>> >>>> and for sender is: >>>> >>>> (gdb) bt >>>> #0 0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6 >>>> #1 0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880, >> arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215 >>>> #2 0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880, >> flags=2) at event.c:838 >>>> #3 0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766 >>>> #4 0x00007fedba59c43a in opal_progress () at >> runtime/opal_progress.c:189 >>>> #5 0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0, >> m=0x7fedba8ba740) >>>> at ../../../../opal/threads/condition.h:99 >>>> #6 0x00007fedb279742e in ompi_request_wait_completion > (req=0x2392d80) >> at ../../../../ompi/request/request.h:377 >>>> #7 0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210, > count=100, >> datatype=0x612600, dst=0, tag=100, >>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at >> pml_ob1_isend.c:125 >>>> #8 0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100, >> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80) >>>> at psend.c:75 >>>> #9 0x000000000040ae52 in MPI::Comm::Send (this=0x612800, >> buf=0x23b6210, count=100, datatype=..., dest=0, tag=100) >>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op > enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 >>>> #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658) >>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp > i_test/src/mpi_test.cpp:42 >>>> (gdb) >>>> >>>> The "deadlock" appears to be a machine dependent race condition, >> different machines fails with different combinations of nodes / > machine. >>>> >>>> Below is my command line for reference: >>>> >>>> $ ../openmpi_devel/bin/mpirun -x PATH -hostfile >> hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca >> orte_base_help_aggregate 0 -mca opal_debug_locks 1 ./bin/mpi_test >>>> >>>> The problem does not exist in release 1.4.2 or earlier. We are > testing >> unreleased codes for potential knem benefits, but can fall back to > 1.4.2 if >> necessary. >>>> >>>> My apologies in advance if I've missed something basic, thanks for > any >> help :) >>>> >>>> regards, >>>> John >>>> <test.cpp>_______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> <mpi_test.cpp>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Message: 9 > Date: Mon, 9 Aug 2010 23:02:51 +0200 > From: Riccardo Murri <riccardo.mu...@gmail.com> > Subject: Re: [OMPI users] MPI Template Datatype? > To: Open MPI Users <us...@open-mpi.org> > Message-ID: > <AANLkTi=Peq+CQ6t+EXaKhwOT=wd0b8vjwc88shxqr...@mail.gmail.com> > Content-Type: text/plain; charset=UTF-8 > > Hi Alexandru, > > you can read all about Boost.MPI at: > > http://www.boost.org/doc/libs/1_43_0/doc/html/mpi.html > > > On Mon, Aug 9, 2010 at 10:27 PM, Alexandru Blidaru <alexs...@gmail.com> > wrote: >> I basically have to implement a 4D vector. An additional goal of my > project >> is to support char, int, float and double datatypes in the vector. > > If your "vector" is fixed-size (i.e., all vectors are comprised of > 4 elements), then you can likely dispose of std::vector, use > C-style arrays with templated send/receive calls (that would > be just interfaces to MPI_Send/MPI_Recv) > > // BEWARE: untested code!!! > > template <typename T> > int send(T* vector, int dest, int tag, MPI_Comm comm) { > throw std::logic_error("called generic MyVector::send"); > }; > > template <typename T> > int recv(T* vector, int source, int tag, MPI_Comm comm) { > throw std::logic_error("called generic MyVector::send"); > }; > > and then you specialize the template for the types you actually use: > > template <> > int send<double>(int* vector, int dest, int tag, MPI_Comm comm) > { > return MPI_Send(vector, 4, MPI_DOUBLE, dest, tag, comm); > }; > > template <> > int recv<double>(int* vector, int src, int tag, MPI_Comm comm) > { > return MPI_Recv(vector, 4, MPI_DOUBLE, dest, tag, comm); > }; > > // etc. > > However, let me warn you that it would likely take more time and > effort to write all the template specializations and get them working > than just use Boost.MPI. > > Best regards, > Riccardo > > > ------------------------------ > > Message: 10 > Date: Mon, 9 Aug 2010 17:42:26 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5 > To: "Open MPI Users" <us...@open-mpi.org> > Cc: Brice Goglin <brice.gog...@inria.fr> > Message-ID: <7283451e-8c4a-4f15-b8e5-649349abb...@cisco.com> > Content-Type: text/plain; charset=us-ascii > > I've opened a ticket about this -- if it's an actual problem, it's a 1.5 > blocker: > > https://svn.open-mpi.org/trac/ompi/ticket/2530 > > What version of knem and Linux are you using? > > > > On Aug 9, 2010, at 4:50 PM, John Hsu wrote: > >> problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with > -npernode 11), so I proceeded to bump up -npernode to 12: >> >> $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX > -npernode 12 --mca btl_sm_use_knem 0 ./bin/mpi_test >> >> and the same error occurs, >> >> (gdb) bt >> #0 0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6 >> #1 0x00007fcca7e5ea4b in epoll_dispatch () >> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >> #2 0x00007fcca7e665fa in opal_event_base_loop () >> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >> #3 0x00007fcca7e37e69 in opal_progress () >> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >> #4 0x00007fcca15b6e95 in mca_pml_ob1_recv () >> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so >> #5 0x00007fcca7dd635c in PMPI_Recv () >> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800, > buf=0x7fff2a0d7e00, >> count=1, datatype=..., source=23, tag=100, status=...) >> at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 >> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028) >> at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:30 >> (gdb) >> >> >> (gdb) bt >> #0 0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6 >> #1 0x00007f5dc454ba4b in epoll_dispatch () >> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >> #2 0x00007f5dc45535fa in opal_event_base_loop () >> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >> #3 0x00007f5dc4524e69 in opal_progress () >> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >> #4 0x00007f5dbdca4b1d in mca_pml_ob1_send () >> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so >> #5 0x00007f5dc44c574f in PMPI_Send () >> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800, > buf=0x7fff6e0c0790, >> count=1, datatype=..., dest=0, tag=100) >> at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 >> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8) >> at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:38 >> (gdb) >> >> >> >> >> On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquy...@cisco.com> > wrote: >> In your first mail, you mentioned that you are testing the new knem > support. >> >> Can you try disabling knem and see if that fixes the problem? (i.e., > run with --mca btl_sm_use_knem 0") If it fixes the issue, that might > mean we have a knem-based bug. >> >> >> >> On Aug 6, 2010, at 1:42 PM, John Hsu wrote: >> >>> Hi, >>> >>> sorry for the confusion, that was indeed the trunk version of things > I was running. >>> >>> Here's the same problem using >>> >>> > http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar. > bz2 >>> >>> command-line: >>> >>> ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX > -npernode 11 ./bin/mpi_test >>> >>> back trace on sender: >>> >>> (gdb) bt >>> #0 0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6 >>> #1 0x00007fa004f43a4b in epoll_dispatch () >>> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #2 0x00007fa004f4b5fa in opal_event_base_loop () >>> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #3 0x00007fa004f1ce69 in opal_progress () >>> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #4 0x00007f9ffe69be95 in mca_pml_ob1_recv () >>> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so >>> #5 0x00007fa004ebb35c in PMPI_Recv () >>> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800, > buf=0x7fff8f5cbb50, count=1, datatype=..., source=29, >>> tag=100, status=...) >>> at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 >>> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78) >>> at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:30 >>> (gdb) >>> >>> back trace on receiver: >>> >>> (gdb) bt >>> #0 0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6 >>> #1 0x00007fcce2f1ea4b in epoll_dispatch () >>> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #2 0x00007fcce2f265fa in opal_event_base_loop () >>> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #3 0x00007fcce2ef7e69 in opal_progress () >>> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #4 0x00007fccdc677b1d in mca_pml_ob1_send () >>> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so >>> #5 0x00007fcce2e9874f in PMPI_Send () >>> from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800, > buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100) >>> at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 >>> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48) >>> at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:38 >>> (gdb) >>> >>> and attached is my mpi_test file for reference. >>> >>> thanks, >>> John >>> >>> >>> On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <r...@open-mpi.org> > wrote: >>> You clearly have an issue with version confusion. The file cited in > your warning: >>> >>>> [wgsg0:29074] Warning -- mutex was double locked from > errmgr_hnp.c:772 >>> >>> does not exist in 1.5rc5. It only exists in the developer's trunk at > this time. Check to ensure you have the right paths set, blow away the > install area (in case you have multiple versions installed on top of > each other), etc. >>> >>> >>> >>> On Aug 5, 2010, at 5:16 PM, John Hsu wrote: >>> >>>> Hi All, >>>> I am new to openmpi and have encountered an issue using > pre-release 1.5rc5, for a simple mpi code (see attached). In this test, > nodes 1 to n sends out a random number to node 0, node 0 sums all > numbers received. >>>> >>>> This code works fine on 1 machine with any number of nodes, and on > 3 machines running 10 nodes per machine, but when we try to run 11 nodes > per machine this warning appears: >>>> >>>> [wgsg0:29074] Warning -- mutex was double locked from > errmgr_hnp.c:772 >>>> >>>> And node 0 (master summing node) hangs on receiving plus another > random node hangs on sending indefinitely. Below are the back traces: >>>> >>>> (gdb) bt >>>> #0 0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6 >>>> #1 0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0, > arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215 >>>> #2 0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0, > flags=2) at event.c:838 >>>> #3 0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766 >>>> #4 0x00007f0c604ebb5a in opal_progress () at > runtime/opal_progress.c:189 >>>> #5 0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0, > m=0x7f0c60800400) at ../../../../opal/threads/ >>>> condition.h:99 >>>> #6 0x00007f0c59b79dff in ompi_request_wait_completion > (req=0x2538d80) at ../../../../ompi/request/request.h:377 >>>> #7 0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0, > count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40, >>>> status=0x7fff90f62668) at pml_ob1_irecv.c:104 >>>> #8 0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1, > type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40, > status=0x7fff90f62668) >>>> at precv.c:78 >>>> #9 0x000000000040ae14 in MPI::Comm::Recv (this=0x612800, > buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100, > status=...) >>>> at > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op > enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 >>>> #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8) >>>> at > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp > i_test/src/mpi_test.cpp:30 >>>> (gdb) >>>> >>>> and for sender is: >>>> >>>> (gdb) bt >>>> #0 0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6 >>>> #1 0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880, > arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215 >>>> #2 0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880, > flags=2) at event.c:838 >>>> #3 0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766 >>>> #4 0x00007fedba59c43a in opal_progress () at > runtime/opal_progress.c:189 >>>> #5 0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0, > m=0x7fedba8ba740) >>>> at ../../../../opal/threads/condition.h:99 >>>> #6 0x00007fedb279742e in ompi_request_wait_completion > (req=0x2392d80) at ../../../../ompi/request/request.h:377 >>>> #7 0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210, > count=100, datatype=0x612600, dst=0, tag=100, >>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at > pml_ob1_isend.c:125 >>>> #8 0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100, > type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80) >>>> at psend.c:75 >>>> #9 0x000000000040ae52 in MPI::Comm::Send (this=0x612800, > buf=0x23b6210, count=100, datatype=..., dest=0, tag=100) >>>> at > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op > enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 >>>> #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658) >>>> at > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp > i_test/src/mpi_test.cpp:42 >>>> (gdb) >>>> >>>> The "deadlock" appears to be a machine dependent race condition, > different machines fails with different combinations of nodes / machine. >>>> >>>> Below is my command line for reference: >>>> >>>> $ ../openmpi_devel/bin/mpirun -x PATH -hostfile > hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca > orte_base_help_aggregate 0 -mca opal_debug_locks 1 ./bin/mpi_test >>>> >>>> The problem does not exist in release 1.4.2 or earlier. We are > testing unreleased codes for potential knem benefits, but can fall back > to 1.4.2 if necessary. >>>> >>>> My apologies in advance if I've missed something basic, thanks for > any help :) >>>> >>>> regards, >>>> John >>>> <test.cpp>_______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> <mpi_test.cpp>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 11 > Date: Mon, 9 Aug 2010 14:48:04 -0700 > From: John Hsu <john...@willowgarage.com> > Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5 > To: Open MPI Users <us...@open-mpi.org> > Cc: Brice Goglin <brice.gog...@inria.fr> > Message-ID: > <aanlktimpmgtuzmsdmgafreonzzdx9krpz+wtxrgah...@mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > I've replied in the ticket. > https://svn.open-mpi.org/trac/ompi/ticket/2530#comment:2 > thanks! > John > > On Mon, Aug 9, 2010 at 2:42 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > >> I've opened a ticket about this -- if it's an actual problem, it's a > 1.5 >> blocker: >> >> https://svn.open-mpi.org/trac/ompi/ticket/2530 >> >> What version of knem and Linux are you using? >> >> >> >> On Aug 9, 2010, at 4:50 PM, John Hsu wrote: >> >>> problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with >> -npernode 11), so I proceeded to bump up -npernode to 12: >>> >>> $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX >> -npernode 12 --mca btl_sm_use_knem 0 ./bin/mpi_test >>> >>> and the same error occurs, >>> >>> (gdb) bt >>> #0 0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6 >>> #1 0x00007fcca7e5ea4b in epoll_dispatch () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #2 0x00007fcca7e665fa in opal_event_base_loop () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #3 0x00007fcca7e37e69 in opal_progress () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #4 0x00007fcca15b6e95 in mca_pml_ob1_recv () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so >>> #5 0x00007fcca7dd635c in PMPI_Recv () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800, >> buf=0x7fff2a0d7e00, >>> count=1, datatype=..., source=23, tag=100, status=...) >>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 >>> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028) >>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:30 >>> (gdb) >>> >>> >>> (gdb) bt >>> #0 0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6 >>> #1 0x00007f5dc454ba4b in epoll_dispatch () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #2 0x00007f5dc45535fa in opal_event_base_loop () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #3 0x00007f5dc4524e69 in opal_progress () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #4 0x00007f5dbdca4b1d in mca_pml_ob1_send () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so >>> #5 0x00007f5dc44c574f in PMPI_Send () >>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800, >> buf=0x7fff6e0c0790, >>> count=1, datatype=..., dest=0, tag=100) >>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 >>> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8) >>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:38 >>> (gdb) >>> >>> >>> >>> >>> On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquy...@cisco.com> > wrote: >>> In your first mail, you mentioned that you are testing the new knem >> support. >>> >>> Can you try disabling knem and see if that fixes the problem? > (i.e., run >> with --mca btl_sm_use_knem 0") If it fixes the issue, that might mean > we >> have a knem-based bug. >>> >>> >>> >>> On Aug 6, 2010, at 1:42 PM, John Hsu wrote: >>> >>>> Hi, >>>> >>>> sorry for the confusion, that was indeed the trunk version of > things I >> was running. >>>> >>>> Here's the same problem using >>>> >>>> >> > http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar. > bz2 >>>> >>>> command-line: >>>> >>>> ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX >> -npernode 11 ./bin/mpi_test >>>> >>>> back trace on sender: >>>> >>>> (gdb) bt >>>> #0 0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6 >>>> #1 0x00007fa004f43a4b in epoll_dispatch () >>>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>>> #2 0x00007fa004f4b5fa in opal_event_base_loop () >>>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>>> #3 0x00007fa004f1ce69 in opal_progress () >>>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>>> #4 0x00007f9ffe69be95 in mca_pml_ob1_recv () >>>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so >>>> #5 0x00007fa004ebb35c in PMPI_Recv () >>>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>>> #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800, >> buf=0x7fff8f5cbb50, count=1, datatype=..., source=29, >>>> tag=100, status=...) >>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 >>>> #7 0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78) >>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:30 >>>> (gdb) >>>> >>>> back trace on receiver: >>>> >>>> (gdb) bt >>>> #0 0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6 >>>> #1 0x00007fcce2f1ea4b in epoll_dispatch () >>>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>>> #2 0x00007fcce2f265fa in opal_event_base_loop () >>>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>>> #3 0x00007fcce2ef7e69 in opal_progress () >>>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>>> #4 0x00007fccdc677b1d in mca_pml_ob1_send () >>>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/openmpi/mca_pml_ob1.so >>>> #5 0x00007fcce2e9874f in PMPI_Send () >>>> from >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/lib/libmpi.so.0 >>>> #6 0x000000000040adda in MPI::Comm::Send (this=0x612800, >> buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100) >>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 >>>> #7 0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48) >>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp > i/mpi_test/src/mpi_test.cpp:38 >>>> (gdb) >>>> >>>> and attached is my mpi_test file for reference. >>>> >>>> thanks, >>>> John >>>> >>>> >>>> On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <r...@open-mpi.org> >> wrote: >>>> You clearly have an issue with version confusion. The file cited > in >> your warning: >>>> >>>>> [wgsg0:29074] Warning -- mutex was double locked from >> errmgr_hnp.c:772 >>>> >>>> does not exist in 1.5rc5. It only exists in the developer's trunk > at >> this time. Check to ensure you have the right paths set, blow away the >> install area (in case you have multiple versions installed on top of > each >> other), etc. >>>> >>>> >>>> >>>> On Aug 5, 2010, at 5:16 PM, John Hsu wrote: >>>> >>>>> Hi All, >>>>> I am new to openmpi and have encountered an issue using > pre-release >> 1.5rc5, for a simple mpi code (see attached). In this test, nodes 1 > to n >> sends out a random number to node 0, node 0 sums all numbers received. >>>>> >>>>> This code works fine on 1 machine with any number of nodes, and > on 3 >> machines running 10 nodes per machine, but when we try to run 11 nodes > per >> machine this warning appears: >>>>> >>>>> [wgsg0:29074] Warning -- mutex was double locked from >> errmgr_hnp.c:772 >>>>> >>>>> And node 0 (master summing node) hangs on receiving plus another >> random node hangs on sending indefinitely. Below are the back traces: >>>>> >>>>> (gdb) bt >>>>> #0 0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6 >>>>> #1 0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0, >> arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215 >>>>> #2 0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0, >> flags=2) at event.c:838 >>>>> #3 0x00007f0c6053ac27 in opal_event_loop (flags=2) at > event.c:766 >>>>> #4 0x00007f0c604ebb5a in opal_progress () at >> runtime/opal_progress.c:189 >>>>> #5 0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0, >> m=0x7f0c60800400) at ../../../../opal/threads/ >>>>> condition.h:99 >>>>> #6 0x00007f0c59b79dff in ompi_request_wait_completion >> (req=0x2538d80) at ../../../../ompi/request/request.h:377 >>>>> #7 0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0, >> count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40, >>>>> status=0x7fff90f62668) at pml_ob1_irecv.c:104 >>>>> #8 0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, > count=1, >> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40, >> status=0x7fff90f62668) >>>>> at precv.c:78 >>>>> #9 0x000000000040ae14 in MPI::Comm::Recv (this=0x612800, >> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100, > status=...) >>>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op > enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 >>>>> #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8) >>>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp > i_test/src/mpi_test.cpp:30 >>>>> (gdb) >>>>> >>>>> and for sender is: >>>>> >>>>> (gdb) bt >>>>> #0 0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6 >>>>> #1 0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880, >> arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215 >>>>> #2 0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880, >> flags=2) at event.c:838 >>>>> #3 0x00007fedba5edbaf in opal_event_loop (flags=2) at > event.c:766 >>>>> #4 0x00007fedba59c43a in opal_progress () at >> runtime/opal_progress.c:189 >>>>> #5 0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0, >> m=0x7fedba8ba740) >>>>> at ../../../../opal/threads/condition.h:99 >>>>> #6 0x00007fedb279742e in ompi_request_wait_completion >> (req=0x2392d80) at ../../../../ompi/request/request.h:377 >>>>> #7 0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210, > count=100, >> datatype=0x612600, dst=0, tag=100, >>>>> sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at >> pml_ob1_isend.c:125 >>>>> #8 0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100, >> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80) >>>>> at psend.c:75 >>>>> #9 0x000000000040ae52 in MPI::Comm::Send (this=0x612800, >> buf=0x23b6210, count=100, datatype=..., dest=0, tag=100) >>>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op > enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 >>>>> #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658) >>>>> at >> > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp > i_test/src/mpi_test.cpp:42 >>>>> (gdb) >>>>> >>>>> The "deadlock" appears to be a machine dependent race condition, >> different machines fails with different combinations of nodes / > machine. >>>>> >>>>> Below is my command line for reference: >>>>> >>>>> $ ../openmpi_devel/bin/mpirun -x PATH -hostfile >> hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca >> orte_base_help_aggregate 0 -mca opal_debug_locks 1 ./bin/mpi_test >>>>> >>>>> The problem does not exist in release 1.4.2 or earlier. We are >> testing unreleased codes for potential knem benefits, but can fall > back to >> 1.4.2 if necessary. >>>>> >>>>> My apologies in advance if I've missed something basic, thanks > for >> any help :) >>>>> >>>>> regards, >>>>> John >>>>> <test.cpp>_______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> <mpi_test.cpp>_______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > End of users Digest, Vol 1655, Issue 3 > ************************************** > > Please do not print this email unless it is absolutely necessary. > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient should > check this email and any attachments for the presence of viruses. The company > accepts no liability for any damage caused by any virus transmitted by this > email. > > www.wipro.com > <mpi4py-ompi-bug.py>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users