Can you try this with the current trunk (r23587 or later)?

I just added a number of new features and bug fixes, and I would be interested 
to see if it fixes the problem. In particular I suspect that this might be 
related to the Init/Finalize bounding of the checkpoint region.

-- Josh

On Aug 10, 2010, at 2:18 PM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> 
wrote:

> Josh
> 
> Please find attached is the python program that reproduces the hang that
> I described. Initial part of this file describes the prerequisite
> modules and the steps to reproduce the problem. Please let me know if
> you have any questions in reproducing the hang.
> 
> Please note that, if I add the following lines at the end of the program
> (in case sleep_time is True), the problem disappears ie; program resumes
> successfully after successful completion of checkpoint.
> # Add following lines at the end for sleep_time is True
> else:
>       time.sleep(0.1)
> # End of added lines
> 
> 
> Thanks a lot for your time in looking into this issue.
> 
> Regards
> Ananda
> 
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Ph: 972 765 8093
> ananda.mu...@wipro.com
> 
> 
> -----Original Message-----
> Date: Mon, 9 Aug 2010 16:37:58 -0400
> From: Joshua Hursey <jjhur...@open-mpi.org>
> Subject: Re: [OMPI users] Checkpointing mpi4py program
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <270bd450-743a-4662-9568-1fedfcc6f...@open-mpi.org>
> Content-Type: text/plain; charset=windows-1252
> 
> I have not tried to checkpoint an mpi4py application, so I cannot say
> for sure if it works or not. You might be hitting something with the
> Python runtime interacting in an odd way with either Open MPI or BLCR.
> 
> Can you attach a debugger and get a backtrace on a stuck checkpoint?
> That might show us where things are held up.
> 
> -- Josh
> 
> 
> On Aug 9, 2010, at 4:04 PM, <ananda.mu...@wipro.com>
> <ananda.mu...@wipro.com> wrote:
> 
>> Hi
>> 
>> I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR
> 0.8.2. When I run ompi-checkpoint on the program written using mpi4py, I
> see that program doesn?t resume sometimes after successful checkpoint
> creation. This doesn?t occur always meaning the program resumes after
> successful checkpoint creation most of the time and completes
> successfully. Has anyone tested the checkpoint/restart functionality
> with mpi4py programs? Are there any best practices that I should keep in
> mind while checkpointing mpi4py programs?
>> 
>> Thanks for your time
>> -          Ananda
>> Please do not print this email unless it is absolutely necessary.
>> 
>> The information contained in this electronic message and any
> attachments to this message are intended for the exclusive use of the
> addressee(s) and may contain proprietary, confidential or privileged
> information. If you are not the intended recipient, you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and destroy all copies of this message and any attachments.
>> 
>> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>> 
>> www.wipro.com
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> ------------------------------
> 
> Message: 8
> Date: Mon, 9 Aug 2010 13:50:03 -0700
> From: John Hsu <john...@willowgarage.com>
> Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID:
>       <AANLkTim63t=wQMeWfHWNnvnVKajOe92e7NG3X=war...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with
> -npernode
> 11), so I proceeded to bump up -npernode to 12:
> 
> $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
> -npernode
> 12 --mca btl_sm_use_knem 0  ./bin/mpi_test
> 
> and the same error occurs,
> 
> (gdb) bt
> #0  0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6
> #1  0x00007fcca7e5ea4b in epoll_dispatch ()
>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #2  0x00007fcca7e665fa in opal_event_base_loop ()
>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #3  0x00007fcca7e37e69 in opal_progress ()
>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #4  0x00007fcca15b6e95 in mca_pml_ob1_recv ()
>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> #5  0x00007fcca7dd635c in PMPI_Recv ()
>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #6  0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff2a0d7e00,
>    count=1, datatype=..., source=23, tag=100, status=...)
>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
> #7  0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028)
>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
> (gdb)
> 
> 
> (gdb) bt
> #0  0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6
> #1  0x00007f5dc454ba4b in epoll_dispatch ()
>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #2  0x00007f5dc45535fa in opal_event_base_loop ()
>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #3  0x00007f5dc4524e69 in opal_progress ()
>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #4  0x00007f5dbdca4b1d in mca_pml_ob1_send ()
>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
> #5  0x00007f5dc44c574f in PMPI_Send ()
>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
> #6  0x000000000040adda in MPI::Comm::Send (this=0x612800,
> buf=0x7fff6e0c0790,
>    count=1, datatype=..., dest=0, tag=100)
>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
> #7  0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8)
>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
> (gdb)
> 
> 
> 
> 
> On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
> 
>> In your first mail, you mentioned that you are testing the new knem
>> support.
>> 
>> Can you try disabling knem and see if that fixes the problem?  (i.e.,
> run
>> with --mca btl_sm_use_knem 0")  If it fixes the issue, that might mean
> we
>> have a knem-based bug.
>> 
>> 
>> 
>> On Aug 6, 2010, at 1:42 PM, John Hsu wrote:
>> 
>>> Hi,
>>> 
>>> sorry for the confusion, that was indeed the trunk version of things
> I
>> was running.
>>> 
>>> Here's the same problem using
>>> 
>>> 
>> 
> http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.
> bz2
>>> 
>>> command-line:
>>> 
>>> ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
> -npernode
>> 11 ./bin/mpi_test
>>> 
>>> back trace on sender:
>>> 
>>> (gdb) bt
>>> #0  0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
>>> #1  0x00007fa004f43a4b in epoll_dispatch ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2  0x00007fa004f4b5fa in opal_event_base_loop ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3  0x00007fa004f1ce69 in opal_progress ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4  0x00007f9ffe69be95 in mca_pml_ob1_recv ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5  0x00007fa004ebb35c in PMPI_Recv ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6  0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
>> buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
>>>    tag=100, status=...)
>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>> #7  0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
>>> (gdb)
>>> 
>>> back trace on receiver:
>>> 
>>> (gdb) bt
>>> #0  0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
>>> #1  0x00007fcce2f1ea4b in epoll_dispatch ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2  0x00007fcce2f265fa in opal_event_base_loop ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3  0x00007fcce2ef7e69 in opal_progress ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4  0x00007fccdc677b1d in mca_pml_ob1_send ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5  0x00007fcce2e9874f in PMPI_Send ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6  0x000000000040adda in MPI::Comm::Send (this=0x612800,
>> buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>> #7  0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
>>> (gdb)
>>> 
>>> and attached is my mpi_test file for reference.
>>> 
>>> thanks,
>>> John
>>> 
>>> 
>>> On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <r...@open-mpi.org>
> wrote:
>>> You clearly have an issue with version confusion. The file cited in
> your
>> warning:
>>> 
>>>> [wgsg0:29074] Warning -- mutex was double locked from
> errmgr_hnp.c:772
>>> 
>>> does not exist in 1.5rc5. It only exists in the developer's trunk at
> this
>> time. Check to ensure you have the right paths set, blow away the
> install
>> area (in case you have multiple versions installed on top of each
> other),
>> etc.
>>> 
>>> 
>>> 
>>> On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
>>> 
>>>> Hi All,
>>>> I am new to openmpi and have encountered an issue using
> pre-release
>> 1.5rc5, for a simple mpi code (see attached).  In this test, nodes 1
> to n
>> sends out a random number to node 0, node 0 sums all numbers received.
>>>> 
>>>> This code works fine on 1 machine with any number of nodes, and on
> 3
>> machines running 10 nodes per machine, but when we try to run 11 nodes
> per
>> machine this warning appears:
>>>> 
>>>> [wgsg0:29074] Warning -- mutex was double locked from
> errmgr_hnp.c:772
>>>> 
>>>> And node 0 (master summing node) hangs on receiving plus another
> random
>> node hangs on sending indefinitely.  Below are the back traces:
>>>> 
>>>> (gdb) bt
>>>> #0  0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
>>>> #1  0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0,
>> arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215
>>>> #2  0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0,
>> flags=2) at event.c:838
>>>> #3  0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766
>>>> #4  0x00007f0c604ebb5a in opal_progress () at
>> runtime/opal_progress.c:189
>>>> #5  0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0,
>> m=0x7f0c60800400) at ../../../../opal/threads/
>>>> condition.h:99
>>>> #6  0x00007f0c59b79dff in ompi_request_wait_completion
> (req=0x2538d80)
>> at ../../../../ompi/request/request.h:377
>>>> #7  0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0,
>> count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
>>>>    status=0x7fff90f62668) at pml_ob1_irecv.c:104
>>>> #8  0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1,
>> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40,
>> status=0x7fff90f62668)
>>>>    at precv.c:78
>>>> #9  0x000000000040ae14 in MPI::Comm::Recv (this=0x612800,
>> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100,
> status=...)
>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>>> #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:30
>>>> (gdb)
>>>> 
>>>> and for sender is:
>>>> 
>>>> (gdb) bt
>>>> #0  0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
>>>> #1  0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880,
>> arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215
>>>> #2  0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880,
>> flags=2) at event.c:838
>>>> #3  0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766
>>>> #4  0x00007fedba59c43a in opal_progress () at
>> runtime/opal_progress.c:189
>>>> #5  0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0,
>> m=0x7fedba8ba740)
>>>>    at ../../../../opal/threads/condition.h:99
>>>> #6  0x00007fedb279742e in ompi_request_wait_completion
> (req=0x2392d80)
>> at ../../../../ompi/request/request.h:377
>>>> #7  0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210,
> count=100,
>> datatype=0x612600, dst=0, tag=100,
>>>>    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at
>> pml_ob1_isend.c:125
>>>> #8  0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100,
>> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
>>>>    at psend.c:75
>>>> #9  0x000000000040ae52 in MPI::Comm::Send (this=0x612800,
>> buf=0x23b6210, count=100, datatype=..., dest=0, tag=100)
>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>>> #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:42
>>>> (gdb)
>>>> 
>>>> The "deadlock" appears to be a machine dependent race condition,
>> different machines fails with different combinations of nodes /
> machine.
>>>> 
>>>> Below is my command line for reference:
>>>> 
>>>> $ ../openmpi_devel/bin/mpirun -x PATH -hostfile
>> hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca
>> orte_base_help_aggregate 0 -mca opal_debug_locks 1  ./bin/mpi_test
>>>> 
>>>> The problem does not exist in release 1.4.2 or earlier.  We are
> testing
>> unreleased codes for potential knem benefits, but can fall back to
> 1.4.2 if
>> necessary.
>>>> 
>>>> My apologies in advance if I've missed something basic, thanks for
> any
>> help :)
>>>> 
>>>> regards,
>>>> John
>>>> <test.cpp>_______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> <mpi_test.cpp>_______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> -------------- next part --------------
> HTML attachment scrubbed and removed
> 
> ------------------------------
> 
> Message: 9
> Date: Mon, 9 Aug 2010 23:02:51 +0200
> From: Riccardo Murri <riccardo.mu...@gmail.com>
> Subject: Re: [OMPI users] MPI Template Datatype?
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID:
>       <AANLkTi=Peq+CQ6t+EXaKhwOT=wd0b8vjwc88shxqr...@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
> 
> Hi Alexandru,
> 
> you can read all about Boost.MPI at:
> 
>  http://www.boost.org/doc/libs/1_43_0/doc/html/mpi.html
> 
> 
> On Mon, Aug 9, 2010 at 10:27 PM, Alexandru Blidaru <alexs...@gmail.com>
> wrote:
>> I basically have to implement a 4D vector. An additional goal of my
> project
>> is to support char, int, float and double datatypes in the vector.
> 
> If your "vector" is fixed-size (i.e., all vectors are comprised of
> 4 elements), then you can likely dispose of std::vector, use
> C-style arrays with templated send/receive calls (that would
> be just interfaces to MPI_Send/MPI_Recv)
> 
>   // BEWARE: untested code!!!
> 
>   template <typename T>
>   int send(T* vector, int dest, int tag, MPI_Comm comm) {
>       throw std::logic_error("called generic MyVector::send");
>   };
> 
>   template <typename T>
>   int recv(T* vector, int source, int tag, MPI_Comm comm) {
>       throw std::logic_error("called generic MyVector::send");
>   };
> 
> and then you specialize the template for the types you actually use:
> 
>  template <>
>  int send<double>(int* vector, int dest, int tag, MPI_Comm comm)
>  {
>    return MPI_Send(vector, 4, MPI_DOUBLE, dest, tag, comm);
>  };
> 
>  template <>
>  int recv<double>(int* vector, int src, int tag, MPI_Comm comm)
>  {
>    return MPI_Recv(vector, 4, MPI_DOUBLE, dest, tag, comm);
>  };
> 
>  // etc.
> 
> However, let me warn you that it would likely take more time and
> effort to write all the template specializations and get them working
> than just use Boost.MPI.
> 
> Best regards,
> Riccardo
> 
> 
> ------------------------------
> 
> Message: 10
> Date: Mon, 9 Aug 2010 17:42:26 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5
> To: "Open MPI Users" <us...@open-mpi.org>
> Cc: Brice Goglin <brice.gog...@inria.fr>
> Message-ID: <7283451e-8c4a-4f15-b8e5-649349abb...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
> 
> I've opened a ticket about this -- if it's an actual problem, it's a 1.5
> blocker:
> 
>    https://svn.open-mpi.org/trac/ompi/ticket/2530
> 
> What version of knem and Linux are you using?
> 
> 
> 
> On Aug 9, 2010, at 4:50 PM, John Hsu wrote:
> 
>> problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with
> -npernode 11), so I proceeded to bump up -npernode to 12:
>> 
>> $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
> -npernode 12 --mca btl_sm_use_knem 0  ./bin/mpi_test
>> 
>> and the same error occurs,
>> 
>> (gdb) bt
>> #0  0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6
>> #1  0x00007fcca7e5ea4b in epoll_dispatch ()
>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #2  0x00007fcca7e665fa in opal_event_base_loop ()
>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #3  0x00007fcca7e37e69 in opal_progress ()
>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #4  0x00007fcca15b6e95 in mca_pml_ob1_recv ()
>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>> #5  0x00007fcca7dd635c in PMPI_Recv ()
>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #6  0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff2a0d7e00,
>>    count=1, datatype=..., source=23, tag=100, status=...)
>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>> #7  0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028)
>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
>> (gdb)
>> 
>> 
>> (gdb) bt
>> #0  0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6
>> #1  0x00007f5dc454ba4b in epoll_dispatch ()
>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #2  0x00007f5dc45535fa in opal_event_base_loop ()
>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #3  0x00007f5dc4524e69 in opal_progress ()
>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #4  0x00007f5dbdca4b1d in mca_pml_ob1_send ()
>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>> #5  0x00007f5dc44c574f in PMPI_Send ()
>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>> #6  0x000000000040adda in MPI::Comm::Send (this=0x612800,
> buf=0x7fff6e0c0790,
>>    count=1, datatype=..., dest=0, tag=100)
>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>> #7  0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8)
>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
>> (gdb)
>> 
>> 
>> 
>> 
>> On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquy...@cisco.com>
> wrote:
>> In your first mail, you mentioned that you are testing the new knem
> support.
>> 
>> Can you try disabling knem and see if that fixes the problem?  (i.e.,
> run with --mca btl_sm_use_knem 0")  If it fixes the issue, that might
> mean we have a knem-based bug.
>> 
>> 
>> 
>> On Aug 6, 2010, at 1:42 PM, John Hsu wrote:
>> 
>>> Hi,
>>> 
>>> sorry for the confusion, that was indeed the trunk version of things
> I was running.
>>> 
>>> Here's the same problem using
>>> 
>>> 
> http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.
> bz2
>>> 
>>> command-line:
>>> 
>>> ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
> -npernode 11 ./bin/mpi_test
>>> 
>>> back trace on sender:
>>> 
>>> (gdb) bt
>>> #0  0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
>>> #1  0x00007fa004f43a4b in epoll_dispatch ()
>>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2  0x00007fa004f4b5fa in opal_event_base_loop ()
>>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3  0x00007fa004f1ce69 in opal_progress ()
>>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4  0x00007f9ffe69be95 in mca_pml_ob1_recv ()
>>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5  0x00007fa004ebb35c in PMPI_Recv ()
>>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6  0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
>>>    tag=100, status=...)
>>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>> #7  0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
>>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
>>> (gdb)
>>> 
>>> back trace on receiver:
>>> 
>>> (gdb) bt
>>> #0  0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
>>> #1  0x00007fcce2f1ea4b in epoll_dispatch ()
>>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2  0x00007fcce2f265fa in opal_event_base_loop ()
>>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3  0x00007fcce2ef7e69 in opal_progress ()
>>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4  0x00007fccdc677b1d in mca_pml_ob1_send ()
>>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5  0x00007fcce2e9874f in PMPI_Send ()
>>>   from
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6  0x000000000040adda in MPI::Comm::Send (this=0x612800,
> buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
>>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>> #7  0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
>>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
>>> (gdb)
>>> 
>>> and attached is my mpi_test file for reference.
>>> 
>>> thanks,
>>> John
>>> 
>>> 
>>> On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <r...@open-mpi.org>
> wrote:
>>> You clearly have an issue with version confusion. The file cited in
> your warning:
>>> 
>>>> [wgsg0:29074] Warning -- mutex was double locked from
> errmgr_hnp.c:772
>>> 
>>> does not exist in 1.5rc5. It only exists in the developer's trunk at
> this time. Check to ensure you have the right paths set, blow away the
> install area (in case you have multiple versions installed on top of
> each other), etc.
>>> 
>>> 
>>> 
>>> On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
>>> 
>>>> Hi All,
>>>> I am new to openmpi and have encountered an issue using
> pre-release 1.5rc5, for a simple mpi code (see attached).  In this test,
> nodes 1 to n sends out a random number to node 0, node 0 sums all
> numbers received.
>>>> 
>>>> This code works fine on 1 machine with any number of nodes, and on
> 3 machines running 10 nodes per machine, but when we try to run 11 nodes
> per machine this warning appears:
>>>> 
>>>> [wgsg0:29074] Warning -- mutex was double locked from
> errmgr_hnp.c:772
>>>> 
>>>> And node 0 (master summing node) hangs on receiving plus another
> random node hangs on sending indefinitely.  Below are the back traces:
>>>> 
>>>> (gdb) bt
>>>> #0  0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
>>>> #1  0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0,
> arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215
>>>> #2  0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0,
> flags=2) at event.c:838
>>>> #3  0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766
>>>> #4  0x00007f0c604ebb5a in opal_progress () at
> runtime/opal_progress.c:189
>>>> #5  0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0,
> m=0x7f0c60800400) at ../../../../opal/threads/
>>>> condition.h:99
>>>> #6  0x00007f0c59b79dff in ompi_request_wait_completion
> (req=0x2538d80) at ../../../../ompi/request/request.h:377
>>>> #7  0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0,
> count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
>>>>    status=0x7fff90f62668) at pml_ob1_irecv.c:104
>>>> #8  0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1,
> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40,
> status=0x7fff90f62668)
>>>>    at precv.c:78
>>>> #9  0x000000000040ae14 in MPI::Comm::Recv (this=0x612800,
> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100,
> status=...)
>>>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>>> #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
>>>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:30
>>>> (gdb)
>>>> 
>>>> and for sender is:
>>>> 
>>>> (gdb) bt
>>>> #0  0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
>>>> #1  0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880,
> arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215
>>>> #2  0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880,
> flags=2) at event.c:838
>>>> #3  0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766
>>>> #4  0x00007fedba59c43a in opal_progress () at
> runtime/opal_progress.c:189
>>>> #5  0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0,
> m=0x7fedba8ba740)
>>>>    at ../../../../opal/threads/condition.h:99
>>>> #6  0x00007fedb279742e in ompi_request_wait_completion
> (req=0x2392d80) at ../../../../ompi/request/request.h:377
>>>> #7  0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210,
> count=100, datatype=0x612600, dst=0, tag=100,
>>>>    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at
> pml_ob1_isend.c:125
>>>> #8  0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100,
> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
>>>>    at psend.c:75
>>>> #9  0x000000000040ae52 in MPI::Comm::Send (this=0x612800,
> buf=0x23b6210, count=100, datatype=..., dest=0, tag=100)
>>>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>>> #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
>>>>    at
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:42
>>>> (gdb)
>>>> 
>>>> The "deadlock" appears to be a machine dependent race condition,
> different machines fails with different combinations of nodes / machine.
>>>> 
>>>> Below is my command line for reference:
>>>> 
>>>> $ ../openmpi_devel/bin/mpirun -x PATH -hostfile
> hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca
> orte_base_help_aggregate 0 -mca opal_debug_locks 1  ./bin/mpi_test
>>>> 
>>>> The problem does not exist in release 1.4.2 or earlier.  We are
> testing unreleased codes for potential knem benefits, but can fall back
> to 1.4.2 if necessary.
>>>> 
>>>> My apologies in advance if I've missed something basic, thanks for
> any help :)
>>>> 
>>>> regards,
>>>> John
>>>> <test.cpp>_______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> <mpi_test.cpp>_______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> ------------------------------
> 
> Message: 11
> Date: Mon, 9 Aug 2010 14:48:04 -0700
> From: John Hsu <john...@willowgarage.com>
> Subject: Re: [OMPI users] deadlock in openmpi 1.5rc5
> To: Open MPI Users <us...@open-mpi.org>
> Cc: Brice Goglin <brice.gog...@inria.fr>
> Message-ID:
>       <aanlktimpmgtuzmsdmgafreonzzdx9krpz+wtxrgah...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> I've replied in the ticket.
> https://svn.open-mpi.org/trac/ompi/ticket/2530#comment:2
> thanks!
> John
> 
> On Mon, Aug 9, 2010 at 2:42 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
> 
>> I've opened a ticket about this -- if it's an actual problem, it's a
> 1.5
>> blocker:
>> 
>>   https://svn.open-mpi.org/trac/ompi/ticket/2530
>> 
>> What version of knem and Linux are you using?
>> 
>> 
>> 
>> On Aug 9, 2010, at 4:50 PM, John Hsu wrote:
>> 
>>> problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with
>> -npernode 11), so I proceeded to bump up -npernode to 12:
>>> 
>>> $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
>> -npernode 12 --mca btl_sm_use_knem 0  ./bin/mpi_test
>>> 
>>> and the same error occurs,
>>> 
>>> (gdb) bt
>>> #0  0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6
>>> #1  0x00007fcca7e5ea4b in epoll_dispatch ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2  0x00007fcca7e665fa in opal_event_base_loop ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3  0x00007fcca7e37e69 in opal_progress ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4  0x00007fcca15b6e95 in mca_pml_ob1_recv ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5  0x00007fcca7dd635c in PMPI_Recv ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6  0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
>> buf=0x7fff2a0d7e00,
>>>    count=1, datatype=..., source=23, tag=100, status=...)
>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>> #7  0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028)
>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
>>> (gdb)
>>> 
>>> 
>>> (gdb) bt
>>> #0  0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6
>>> #1  0x00007f5dc454ba4b in epoll_dispatch ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #2  0x00007f5dc45535fa in opal_event_base_loop ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #3  0x00007f5dc4524e69 in opal_progress ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #4  0x00007f5dbdca4b1d in mca_pml_ob1_send ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>> #5  0x00007f5dc44c574f in PMPI_Send ()
>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>> #6  0x000000000040adda in MPI::Comm::Send (this=0x612800,
>> buf=0x7fff6e0c0790,
>>>    count=1, datatype=..., dest=0, tag=100)
>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>> #7  0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8)
>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
>>> (gdb)
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquy...@cisco.com>
> wrote:
>>> In your first mail, you mentioned that you are testing the new knem
>> support.
>>> 
>>> Can you try disabling knem and see if that fixes the problem?
> (i.e., run
>> with --mca btl_sm_use_knem 0")  If it fixes the issue, that might mean
> we
>> have a knem-based bug.
>>> 
>>> 
>>> 
>>> On Aug 6, 2010, at 1:42 PM, John Hsu wrote:
>>> 
>>>> Hi,
>>>> 
>>>> sorry for the confusion, that was indeed the trunk version of
> things I
>> was running.
>>>> 
>>>> Here's the same problem using
>>>> 
>>>> 
>> 
> http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.
> bz2
>>>> 
>>>> command-line:
>>>> 
>>>> ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX
>> -npernode 11 ./bin/mpi_test
>>>> 
>>>> back trace on sender:
>>>> 
>>>> (gdb) bt
>>>> #0  0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6
>>>> #1  0x00007fa004f43a4b in epoll_dispatch ()
>>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #2  0x00007fa004f4b5fa in opal_event_base_loop ()
>>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #3  0x00007fa004f1ce69 in opal_progress ()
>>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #4  0x00007f9ffe69be95 in mca_pml_ob1_recv ()
>>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>>> #5  0x00007fa004ebb35c in PMPI_Recv ()
>>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #6  0x000000000040ae48 in MPI::Comm::Recv (this=0x612800,
>> buf=0x7fff8f5cbb50, count=1, datatype=..., source=29,
>>>>    tag=100, status=...)
>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>>> #7  0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78)
>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:30
>>>> (gdb)
>>>> 
>>>> back trace on receiver:
>>>> 
>>>> (gdb) bt
>>>> #0  0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6
>>>> #1  0x00007fcce2f1ea4b in epoll_dispatch ()
>>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #2  0x00007fcce2f265fa in opal_event_base_loop ()
>>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #3  0x00007fcce2ef7e69 in opal_progress ()
>>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #4  0x00007fccdc677b1d in mca_pml_ob1_send ()
>>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/openmpi/mca_pml_ob1.so
>>>> #5  0x00007fcce2e9874f in PMPI_Send ()
>>>>   from
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/lib/libmpi.so.0
>>>> #6  0x000000000040adda in MPI::Comm::Send (this=0x612800,
>> buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100)
>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>>> #7  0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48)
>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mp
> i/mpi_test/src/mpi_test.cpp:38
>>>> (gdb)
>>>> 
>>>> and attached is my mpi_test file for reference.
>>>> 
>>>> thanks,
>>>> John
>>>> 
>>>> 
>>>> On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <r...@open-mpi.org>
>> wrote:
>>>> You clearly have an issue with version confusion. The file cited
> in
>> your warning:
>>>> 
>>>>> [wgsg0:29074] Warning -- mutex was double locked from
>> errmgr_hnp.c:772
>>>> 
>>>> does not exist in 1.5rc5. It only exists in the developer's trunk
> at
>> this time. Check to ensure you have the right paths set, blow away the
>> install area (in case you have multiple versions installed on top of
> each
>> other), etc.
>>>> 
>>>> 
>>>> 
>>>> On Aug 5, 2010, at 5:16 PM, John Hsu wrote:
>>>> 
>>>>> Hi All,
>>>>> I am new to openmpi and have encountered an issue using
> pre-release
>> 1.5rc5, for a simple mpi code (see attached).  In this test, nodes 1
> to n
>> sends out a random number to node 0, node 0 sums all numbers received.
>>>>> 
>>>>> This code works fine on 1 machine with any number of nodes, and
> on 3
>> machines running 10 nodes per machine, but when we try to run 11 nodes
> per
>> machine this warning appears:
>>>>> 
>>>>> [wgsg0:29074] Warning -- mutex was double locked from
>> errmgr_hnp.c:772
>>>>> 
>>>>> And node 0 (master summing node) hangs on receiving plus another
>> random node hangs on sending indefinitely.  Below are the back traces:
>>>>> 
>>>>> (gdb) bt
>>>>> #0  0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6
>>>>> #1  0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0,
>> arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215
>>>>> #2  0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0,
>> flags=2) at event.c:838
>>>>> #3  0x00007f0c6053ac27 in opal_event_loop (flags=2) at
> event.c:766
>>>>> #4  0x00007f0c604ebb5a in opal_progress () at
>> runtime/opal_progress.c:189
>>>>> #5  0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0,
>> m=0x7f0c60800400) at ../../../../opal/threads/
>>>>> condition.h:99
>>>>> #6  0x00007f0c59b79dff in ompi_request_wait_completion
>> (req=0x2538d80) at ../../../../ompi/request/request.h:377
>>>>> #7  0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0,
>> count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40,
>>>>>    status=0x7fff90f62668) at pml_ob1_irecv.c:104
>>>>> #8  0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0,
> count=1,
>> type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40,
>> status=0x7fff90f62668)
>>>>>    at precv.c:78
>>>>> #9  0x000000000040ae14 in MPI::Comm::Recv (this=0x612800,
>> buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100,
> status=...)
>>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36
>>>>> #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8)
>>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:30
>>>>> (gdb)
>>>>> 
>>>>> and for sender is:
>>>>> 
>>>>> (gdb) bt
>>>>> #0  0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6
>>>>> #1  0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880,
>> arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215
>>>>> #2  0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880,
>> flags=2) at event.c:838
>>>>> #3  0x00007fedba5edbaf in opal_event_loop (flags=2) at
> event.c:766
>>>>> #4  0x00007fedba59c43a in opal_progress () at
>> runtime/opal_progress.c:189
>>>>> #5  0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0,
>> m=0x7fedba8ba740)
>>>>>    at ../../../../opal/threads/condition.h:99
>>>>> #6  0x00007fedb279742e in ompi_request_wait_completion
>> (req=0x2392d80) at ../../../../ompi/request/request.h:377
>>>>> #7  0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210,
> count=100,
>> datatype=0x612600, dst=0, tag=100,
>>>>>    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at
>> pml_ob1_isend.c:125
>>>>> #8  0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100,
>> type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80)
>>>>>    at psend.c:75
>>>>> #9  0x000000000040ae52 in MPI::Comm::Send (this=0x612800,
>> buf=0x23b6210, count=100, datatype=..., dest=0, tag=100)
>>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/op
> enmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29
>>>>> #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658)
>>>>>    at
>> 
> /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mp
> i_test/src/mpi_test.cpp:42
>>>>> (gdb)
>>>>> 
>>>>> The "deadlock" appears to be a machine dependent race condition,
>> different machines fails with different combinations of nodes /
> machine.
>>>>> 
>>>>> Below is my command line for reference:
>>>>> 
>>>>> $ ../openmpi_devel/bin/mpirun -x PATH -hostfile
>> hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca
>> orte_base_help_aggregate 0 -mca opal_debug_locks 1  ./bin/mpi_test
>>>>> 
>>>>> The problem does not exist in release 1.4.2 or earlier.  We are
>> testing unreleased codes for potential knem benefits, but can fall
> back to
>> 1.4.2 if necessary.
>>>>> 
>>>>> My apologies in advance if I've missed something basic, thanks
> for
>> any help :)
>>>>> 
>>>>> regards,
>>>>> John
>>>>> <test.cpp>_______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> <mpi_test.cpp>_______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> -------------- next part --------------
> HTML attachment scrubbed and removed
> 
> ------------------------------
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> End of users Digest, Vol 1655, Issue 3
> **************************************
> 
> Please do not print this email unless it is absolutely necessary. 
> 
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments. 
> 
> WARNING: Computer viruses can be transmitted via email. The recipient should 
> check this email and any attachments for the presence of viruses. The company 
> accepts no liability for any damage caused by any virus transmitted by this 
> email. 
> 
> www.wipro.com
> <mpi4py-ompi-bug.py>_______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to