Re: [OMPI users] [openib] segfault when using openib btl

Eloi Gaudry Tue, 31 Jan 2012 04:34:41 -0500

Hi,

I just would like to give you an update on this issue.
Since we are using OpenMPI-1.4.4, we cannot reproduce it anymore.


Regards,
Eloi



On 09/29/2010 06:01 AM, Nysal Jan wrote:

Hi Eloi,

We discussed this issue during the weekly developer meeting & therewere no further suggestions, apart from checking the driver andfirmware levels. The consensus was that it would be better if youcould take this up directly with your IB vendor.


Regards
--Nysal

On Mon, Sep 27, 2010 at 8:14 PM, Eloi Gaudry <e...@fft.be<mailto:e...@fft.be>> wrote:


    Terry,

    Please find enclosed the requested check outputs (using
    -output-filename stdout.tag.null option).
    I'm displaying frag->hdr->tag here.

    Eloi

    On Monday 27 September 2010 16:29:12 Terry Dontje wrote:
    > Eloi, sorry can you print out frag->hdr->tag?
    >
    > Unfortunately from your last email I think it will still all have
    > non-zero values.
    > If that ends up being the case then there must be something odd
    with the
    > descriptor pointer to the fragment.
    >
    > --td
    >
    > Eloi Gaudry wrote:
    > > Terry,
    > >
    > > Please find enclosed the requested check outputs (using
    -output-filename
    > > stdout.tag.null option).
    > >
    > > For information, Nysal In his first message referred to
    > > ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was
    wrnong on

> > receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH(MCA_BTL_TAG_PML

    > > + 1)
    > > #define MCA_PML_OB1_HDR_TYPE_RNDV      (MCA_BTL_TAG_PML + 2)
    > > #define MCA_PML_OB1_HDR_TYPE_RGET      (MCA_BTL_TAG_PML + 3)
    > >
    > >  #define MCA_PML_OB1_HDR_TYPE_ACK       (MCA_BTL_TAG_PML + 4)
    > >
    > > #define MCA_PML_OB1_HDR_TYPE_NACK      (MCA_BTL_TAG_PML + 5)
    > > #define MCA_PML_OB1_HDR_TYPE_FRAG      (MCA_BTL_TAG_PML + 6)
    > > #define MCA_PML_OB1_HDR_TYPE_GET       (MCA_BTL_TAG_PML + 7)
    > >
    > >  #define MCA_PML_OB1_HDR_TYPE_PUT       (MCA_BTL_TAG_PML + 8)
    > >
    > > #define MCA_PML_OB1_HDR_TYPE_FIN       (MCA_BTL_TAG_PML + 9)
    > > and in ompi/mca/btl/btl.h
    > > #define MCA_BTL_TAG_PML             0x40
    > >
    > > Eloi
    > >
    > > On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
    > >> I am thinking checking the value of *frag->hdr right before
    the return
    > >> in the post_send function in
    ompi/mca/btl/openib/btl_openib_endpoint.h.
    > >> It is line 548 in the trunk
    > >>
    https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_
    > >> ope nib_endpoint.h#548
    > >>
    > >> --td
    > >>
    > >> Eloi Gaudry wrote:
    > >>> Hi Terry,
    > >>>
    > >>> Do you have any patch that I could apply to be able to do so
    ? I'm
    > >>> remotely working on a cluster (with a terminal) and I cannot
    use any
    > >>> parallel debugger or sequential debugger (with a call to
    xterm...). I
    > >>> can track frag->hdr->tag value in
    > >>> ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
    > >>> SEND/RDMA_WRITE case, but this is all I can think of alone.
    > >>>
    > >>> You'll find a stacktrace (receive side) in this thread (10th
    or 11th
    > >>> message) but it might be pointless.
    > >>>
    > >>> Regards,
    > >>> Eloi
    > >>>
    > >>> On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
    > >>>> So it sounds like coalescing is not your issue and that the
    problem
    > >>>> has something to do with the queue sizes.  It would be
    helpful if we
    > >>>> could detect the hdr->tag == 0 issue on the sending side
    and get at
    > >>>> least a stack trace.  There is something really odd going
    on here.
    > >>>>
    > >>>> --td
    > >>>>
    > >>>> Eloi Gaudry wrote:
    > >>>>> Hi Terry,
    > >>>>>
    > >>>>> I'm sorry to say that I might have missed a point here.
    > >>>>>
    > >>>>> I've lately been relaunching all previously failing
    computations with
    > >>>>> the message coalescing feature being switched off, and I
    saw the same
    > >>>>> hdr->tag=0 error several times, always during a collective
    call
    > >>>>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so
    far). And as
    > >>>>> soon as I switched to the peer queue option I was
    previously using
    > >>>>> (--mca btl_openib_receive_queues P,65536,256,192,128
    instead of using
    > >>>>> --mca btl_openib_use_message_coalescing 0), all
    computations ran
    > >>>>> flawlessly.
    > >>>>>
    > >>>>> As for the reproducer, I've already tried to write
    something but I
    > >>>>> haven't succeeded so far at reproducing the hdr->tag=0
    issue with it.
    > >>>>>
    > >>>>> Eloi
    > >>>>>
    > >>>>> On 24/09/2010 18:37, Terry Dontje wrote:
    > >>>>>> Eloi Gaudry wrote:
    > >>>>>>> Terry,
    > >>>>>>>
    > >>>>>>> You were right, the error indeed seems to come from the
    message
    > >>>>>>> coalescing feature. If I turn it off using the "--mca
    > >>>>>>> btl_openib_use_message_coalescing 0", I'm not able to
    observe the
    > >>>>>>> "hdr->tag=0" error.
    > >>>>>>>
    > >>>>>>> There are some trac requests associated to very similar
    error
    > >>>>>>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing)
    but they
    > >>>>>>> are all closed (except
    > >>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be
    > >>>>>>> related), aren't they ? What would you suggest Terry ?
    > >>>>>>
    > >>>>>> Interesting, though it looks to me like the segv in
    ticket 2352
    > >>>>>> would have happened on the send side instead of the
    receive side
    > >>>>>> like you have.  As to what to do next it would be really
    nice to
    > >>>>>> have some sort of reproducer that we can try and debug
    what is
    > >>>>>> really going on.  The only other thing to do without a
    reproducer
    > >>>>>> is to inspect the code on the send side to figure out
    what might
    > >>>>>> make it generate at 0 hdr->tag.  Or maybe instrument the
    send side
    > >>>>>> to stop when it is about ready to send a 0 hdr->tag and
    see if we
    > >>>>>> can see how the code got there.
    > >>>>>>
    > >>>>>> I might have some cycles to look at this Monday.
    > >>>>>>
    > >>>>>> --td
    > >>>>>>
    > >>>>>>> Eloi
    > >>>>>>>
    > >>>>>>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
    > >>>>>>>> Eloi Gaudry wrote:
    > >>>>>>>>> Terry,
    > >>>>>>>>>
    > >>>>>>>>> No, I haven't tried any other values than
    P,65536,256,192,128
    > >>>>>>>>> yet.
    > >>>>>>>>>
    > >>>>>>>>> The reason why is quite simple. I've been reading and
    reading
    > >>>>>>>>> again this thread to understand the
    btl_openib_receive_queues
    > >>>>>>>>> meaning and I can't figure out why the default values
    seem to
    > >>>>>>>>> induce the hdr-
    > >>>>>>>>>
    > >>>>>>>>>> tag=0 issue
    > >>>>>>>>>>
    (http://www.open-mpi.org/community/lists/users/2009/01/7808.php)
    > >>>>>>>>>> .
    > >>>>>>>>
    > >>>>>>>> Yeah, the size of the fragments and number of them
    really should
    > >>>>>>>> not cause this issue.  So I too am a little perplexed
    about it.
    > >>>>>>>>
    > >>>>>>>>> Do you think that the default shared received queue
    parameters
    > >>>>>>>>> are erroneous for this specific Mellanox card ? Any
    help on
    > >>>>>>>>> finding the proper parameters would actually be much
    > >>>>>>>>> appreciated.
    > >>>>>>>>
    > >>>>>>>> I don't necessarily think it is the queue size for a
    specific card
    > >>>>>>>> but more so the handling of the queues by the BTL when
    using
    > >>>>>>>> certain sizes. At least that is one gut feel I have.
    > >>>>>>>>
    > >>>>>>>> In my mind the tag being 0 is either something below
    OMPI is
    > >>>>>>>> polluting the data fragment or OMPI's internal protocol
    is some
    > >>>>>>>> how getting messed up.  I can imagine (no empirical
    data here)
    > >>>>>>>> the queue sizes could change how the OMPI protocol sets
    things
    > >>>>>>>> up. Another thing may be the coalescing feature in the
    openib BTL
    > >>>>>>>> which tries to gang multiple messages into one packet when
    > >>>>>>>> resources are running low.   I can see where changing
    the queue
    > >>>>>>>> sizes might affect the coalescing. So, it might be
    interesting to
    > >>>>>>>> turn off the coalescing.  You can do that by setting "--mca
    > >>>>>>>> btl_openib_use_message_coalescing 0" in your mpirun line.
    > >>>>>>>>
    > >>>>>>>> If that doesn't solve the issue then obviously there
    must be
    > >>>>>>>> something else going on :-).
    > >>>>>>>>
    > >>>>>>>> Note, the reason I am interested in this is I am seeing
    a similar
    > >>>>>>>> error condition (hdr->tag == 0) on a development
    system.  Though
    > >>>>>>>> my failing case fails with np=8 using the connectivity test
    > >>>>>>>> program which is mainly point to point and there are not a
    > >>>>>>>> significant amount of data transfers going on either.
    > >>>>>>>>
    > >>>>>>>> --td
    > >>>>>>>>
    > >>>>>>>>> Eloi
    > >>>>>>>>>
    > >>>>>>>>> On Friday 24 September 2010 14:27:07 you wrote:
    > >>>>>>>>>> That is interesting.  So does the number of processes
    affect
    > >>>>>>>>>> your runs any.  The times I've seen hdr->tag be 0
    usually has
    > >>>>>>>>>> been due to protocol issues.  The tag should never be
    0.  Have
    > >>>>>>>>>> you tried to do other receive_queue settings other
    than the
    > >>>>>>>>>> default and the one you mention.
    > >>>>>>>>>>
    > >>>>>>>>>> I wonder if you did a combination of the two receive
    queues
    > >>>>>>>>>> causes a failure or not.  Something like
    > >>>>>>>>>>
    > >>>>>>>>>> P,128,256,192,128:P,65536,256,192,128
    > >>>>>>>>>>
    > >>>>>>>>>> I am wondering if it is the first queuing definition
    causing the
    > >>>>>>>>>> issue or possibly the SRQ defined in the default.
    > >>>>>>>>>>
    > >>>>>>>>>> --td
    > >>>>>>>>>>
    > >>>>>>>>>> Eloi Gaudry wrote:
    > >>>>>>>>>>> Hi Terry,
    > >>>>>>>>>>>
    > >>>>>>>>>>> The messages being send/received can be of any size,
    but the
    > >>>>>>>>>>> error seems to happen more often with small messages
    (as an int
    > >>>>>>>>>>> being broadcasted or allreduced). The failing
    communication
    > >>>>>>>>>>> differs from one run to another, but some spots are
    more likely
    > >>>>>>>>>>> to be failing than another. And as far as I know,
    there are
    > >>>>>>>>>>> always located next to a small message (an int being
    > >>>>>>>>>>> broadcasted for instance) communication. Other typical
    > >>>>>>>>>>> messages size are
    > >>>>>>>>>>>
    > >>>>>>>>>>>> 10k but can be very much larger.
    > >>>>>>>>>>>
    > >>>>>>>>>>> I've been checking the hca being used, its' from
    mellanox (with
    > >>>>>>>>>>> vendor_part_id=26428). There is no receive_queues
    parameters
    > >>>>>>>>>>> associated to it.
    > >>>>>>>>>>>
    > >>>>>>>>>>>  $ cat
    share/openmpi/mca-btl-openib-device-params.ini as well:
    > >>>>>>>>>>> [...]
    > >>>>>>>>>>>
    > >>>>>>>>>>>   # A.k.a. ConnectX
    > >>>>>>>>>>>   [Mellanox Hermon]
    > >>>>>>>>>>>   vendor_id =
    0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
    > >>>>>>>>>>>   vendor_part_id =

> >>>>>>>>>>>25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,2

    > >>>>>>>>>>>   64 88 use_eager_rdma = 1
    > >>>>>>>>>>>   mtu = 2048
    > >>>>>>>>>>>   max_inline_data = 128
    > >>>>>>>>>>>
    > >>>>>>>>>>> [..]
    > >>>>>>>>>>>
    > >>>>>>>>>>> $ ompi_info --param btl openib --parsable | grep
    receive_queues
    > >>>>>>>>>>>
    > >>>>>>>>>>>
     mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256
    > >>>>>>>>>>>  ,1 92 ,128
    > >>>>>>>>>>>
    > >>>>>>>>>>>  :S
    ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
    > >>>>>>>>>>>
    > >>>>>>>>>>>
     mca:btl:openib:param:btl_openib_receive_queues:data_source:def
    > >>>>>>>>>>>  au lt value
    > >>>>>>>>>>>
     mca:btl:openib:param:btl_openib_receive_queues:status:writable
    > >>>>>>>>>>>
     mca:btl:openib:param:btl_openib_receive_queues:help:Colon-deli
    > >>>>>>>>>>>  mi t ed, comma delimited list of receive queues:
    > >>>>>>>>>>>  P,4096,8,6,4:P,32768,8,6,4
    > >>>>>>>>>>>
     mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
    > >>>>>>>>>>>
    > >>>>>>>>>>> I was wondering if these parameters (automatically
    computed at
    > >>>>>>>>>>> openib btl init for what I understood) were not
    incorrect in
    > >>>>>>>>>>> some way and I plugged some others values:
    > >>>>>>>>>>> "P,65536,256,192,128" (someone on the list used that
    values
    > >>>>>>>>>>> when encountering a different issue) . Since that, I
    haven't
    > >>>>>>>>>>> been able to observe the segfault (occuring as
    hrd->tag = 0 in
    > >>>>>>>>>>> btl_openib_component.c:2881) yet.
    > >>>>>>>>>>>
    > >>>>>>>>>>> Eloi
    > >>>>>>>>>>>
    > >>>>>>>>>>>
    > >>>>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
    > >>>>>>>>>>>
    > >>>>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje
    wrote:
    > >>>>>>>>>>>> Eloi, I am curious about your problem.  Can you
    tell me what
    > >>>>>>>>>>>> size of job it is?  Does it always fail on the same
    bcast,  or
    > >>>>>>>>>>>> same process?
    > >>>>>>>>>>>>
    > >>>>>>>>>>>> Eloi Gaudry wrote:
    > >>>>>>>>>>>>> Hi Nysal,
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> Thanks for your suggestions.
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> I'm now able to get the checksum computed and
    redirected to
    > >>>>>>>>>>>>> stdout, thanks (I forgot the  "-mca
    pml_base_verbose 5"
    > >>>>>>>>>>>>> option, you were right). I haven't been able to
    observe the
    > >>>>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when
    using pml
    > >>>>>>>>>>>>> csum) but I 'll let you know when I am.
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> I've got two others question, which may be related
    to the
    > >>>>>>>>>>>>> error observed:
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be
    handled by
    > >>>>>>>>>>>>> OpenMPI somehow depends on the btl being used
    (i.e. if I'm
    > >>>>>>>>>>>>> using openib, may I use the same number of
    MPI_Comm object as
    > >>>>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in
    OpenMPI ?
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> 2/ the segfaults only appears during a mpi
    collective call,
    > >>>>>>>>>>>>> with very small message (one int is being
    broadcast, for
    > >>>>>>>>>>>>> instance) ; i followed the guidelines given at
    > >>>>>>>>>>>>> http://icl.cs.utk.edu/open-
    > >>>>>>>>>>>>>
    mpi/faq/?category=openfabrics#ib-small-message-rdma but the
    > >>>>>>>>>>>>> debug-build of OpenMPI asserts if I use a
    different min-size
    > >>>>>>>>>>>>> that 255. Anyway, if I deactivate eager_rdma, the
    segfaults
    > >>>>>>>>>>>>> remains. Does the openib btl handle very small message
    > >>>>>>>>>>>>> differently (even with eager_rdma
    > >>>>>>>>>>>>> deactivated) than tcp ?
    > >>>>>>>>>>>>
    > >>>>>>>>>>>> Others on the list does coalescing happen with
    non-eager_rdma?
    > >>>>>>>>>>>> If so then that would possibly be one difference
    between the
    > >>>>>>>>>>>> openib btl and tcp aside from the actual protocol used.
    > >>>>>>>>>>>>
    > >>>>>>>>>>>>>  is there a way to make sure that large messages
    and small
    > >>>>>>>>>>>>>  messages are handled the same way ?
    > >>>>>>>>>>>>
    > >>>>>>>>>>>> Do you mean so they all look like eager messages?
     How large
    > >>>>>>>>>>>> of messages are we talking about here 1K, 1M or 10M?
    > >>>>>>>>>>>>
    > >>>>>>>>>>>> --td
    > >>>>>>>>>>>>
    > >>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
    > >>>>>>>>>>>>>> Hi Eloi,
    > >>>>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug)
    and while
    > >>>>>>>>>>>>>> running with the csum PML add "-mca
    pml_base_verbose 5" to
    > >>>>>>>>>>>>>> the command line. This will print the checksum
    details for
    > >>>>>>>>>>>>>> each fragment sent over the wire. I'm guessing it
    didnt
    > >>>>>>>>>>>>>> catch anything because the BTL failed. The checksum
    > >>>>>>>>>>>>>> verification is done in the PML, which the BTL
    calls via a
    > >>>>>>>>>>>>>> callback function. In your case the PML callback
    is never
    > >>>>>>>>>>>>>> called because the hdr->tag is invalid. So enabling
    > >>>>>>>>>>>>>> checksum tracing also might not be of much use.
    Is it the
    > >>>>>>>>>>>>>> first Bcast that fails or the nth Bcast and what
    is the
    > >>>>>>>>>>>>>> message size? I'm not sure what could be the
    problem at
    > >>>>>>>>>>>>>> this moment. I'm afraid you will have to debug
    the BTL to
    > >>>>>>>>>>>>>> find out more.
    > >>>>>>>>>>>>>>
    > >>>>>>>>>>>>>> --Nysal
    > >>>>>>>>>>>>>>
    > >>>>>>>>>>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry
    <e...@fft.be <mailto:e...@fft.be>> wrote:
    > >>>>>>>>>>>>>>> Hi Nysal,
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> thanks for your response.
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> I've been unable so far to write a test case
    that could
    > >>>>>>>>>>>>>>> illustrate the hdr->tag=0 error.
    > >>>>>>>>>>>>>>> Actually, I'm only observing this issue when
    running an
    > >>>>>>>>>>>>>>> internode computation involving infiniband
    hardware from
    > >>>>>>>>>>>>>>> Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0
    > >>>>>>>>>>>>>>> 2.5GT/s, rev a0) with our time-domain software.
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> I checked, double-checked, and rechecked again
    every MPI
    > >>>>>>>>>>>>>>> use performed during a parallel computation and
    I couldn't
    > >>>>>>>>>>>>>>> find any error so far. The fact that the very
    > >>>>>>>>>>>>>>> same parallel computation run flawlessly when
    using tcp
    > >>>>>>>>>>>>>>> (and disabling openib support) might seem to
    indicate that
    > >>>>>>>>>>>>>>> the issue is somewhere located inside the
    > >>>>>>>>>>>>>>> openib btl or at the hardware/driver level.
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> I've just used the "-mca pml csum" option and I
    haven't
    > >>>>>>>>>>>>>>> seen any related messages (when hdr->tag=0 and the
    > >>>>>>>>>>>>>>> segfaults occurs). Any suggestion ?
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan
    wrote:
    > >>>>>>>>>>>>>>>> Hi Eloi,
    > >>>>>>>>>>>>>>>> Sorry for the delay in response. I haven't read
    the entire
    > >>>>>>>>>>>>>>>> email thread, but do you have a test case which can
    > >>>>>>>>>>>>>>>> reproduce this error? Without that it will be
    difficult to
    > >>>>>>>>>>>>>>>> nail down the cause. Just to clarify, I do not
    work for an
    > >>>>>>>>>>>>>>>> iwarp vendor. I can certainly try to reproduce
    it on an IB
    > >>>>>>>>>>>>>>>> system. There is also a PML called csum, you
    can use it
    > >>>>>>>>>>>>>>>> via "-mca pml csum", which will checksum the
    MPI messages
    > >>>>>>>>>>>>>>>> and verify it at the receiver side for any data
    > >>>>>>>>>>>>>>>> corruption. You can try using it to see if it
    is able
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> to
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>> catch anything.
    > >>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>> Regards
    > >>>>>>>>>>>>>>>> --Nysal
    > >>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry
    <e...@fft.be <mailto:e...@fft.be>> wrote:
    > >>>>>>>>>>>>>>>>> Hi Nysal,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> I'm sorry to intrrupt, but I was wondering if
    you had a
    > >>>>>>>>>>>>>>>>> chance to look
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> at
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> this error.
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> --
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> Eloi Gaudry
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> Free Field Technologies
    > >>>>>>>>>>>>>>>>> Company Website: http://www.fft.be
    > >>>>>>>>>>>>>>>>> Company Phone:   +32 10 487 959
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> ---------- Forwarded message ----------
    > >>>>>>>>>>>>>>>>> From: Eloi Gaudry <e...@fft.be <mailto:e...@fft.be>>
    > >>>>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > >>>>>>>>>>>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200
    > >>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] [openib] segfault
    when using
    > >>>>>>>>>>>>>>>>> openib btl Hi,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> I was wondering if anybody got a chance to
    have a look at
    > >>>>>>>>>>>>>>>>> this issue.
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi
    Gaudry wrote:
    > >>>>>>>>>>>>>>>>>> Hi Jeff,
    > >>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>> Please find enclosed the output
    (valgrind.out.gz) from
    > >>>>>>>>>>>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host
    > >>>>>>>>>>>>>>>>>> pbn11,pbn10 --mca
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> btl
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>> openib,self --display-map --verbose --mca
    > >>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
    btl_openib_want_fork_support 0
    > >>>>>>>>>>>>>>>>>> -tag-output /opt/valgrind-3.5.0/bin/valgrind
    > >>>>>>>>>>>>>>>>>> --tool=memcheck
    > >>>>>>>>>>>>>>>>>>
    --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/o
    > >>>>>>>>>>>>>>>>>> pen mp i- valgrind.supp
    > >>>>>>>>>>>>>>>>>> --suppressions=./suppressions.python.supp
    > >>>>>>>>>>>>>>>>>> /opt/actran/bin/actranpy_mp ...
    > >>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>> Thanks,
    > >>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi
    Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff
    Squyres wrote:
    > >>>>>>>>>>>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry
    wrote:
    > >>>>>>>>>>>>>>>>>>>>> I did run our application through valgrind
    but it
    > >>>>>>>>>>>>>>>>>>>>> couldn't find any "Invalid write": there
    is a bunch
    > >>>>>>>>>>>>>>>>>>>>> of "Invalid read" (I'm using
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> 1.4.2
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> with the suppression file), "Use of
    uninitialized
    > >>>>>>>>>>>>>>>>>>>>> bytes" and "Conditional jump depending on
    > >>>>>>>>>>>>>>>>>>>>> uninitialized bytes" in
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> different
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> ompi
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> routines. Some of them are located in
    > >>>>>>>>>>>>>>>>>>>>> btl_openib_component.c. I'll send you an
    output of
    > >>>>>>>>>>>>>>>>>>>>> valgrind shortly.
    > >>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>> A lot of them in btl_openib_* are to be
    expected --
    > >>>>>>>>>>>>>>>>>>>> OpenFabrics uses OS-bypass methods for some
    of its
    > >>>>>>>>>>>>>>>>>>>> memory, and therefore valgrind is unaware
    of them (and
    > >>>>>>>>>>>>>>>>>>>> therefore incorrectly marks them as
    > >>>>>>>>>>>>>>>>>>>> uninitialized).
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>> would it  help if i use the upcoming 1.5
    version of
    > >>>>>>>>>>>>>>>>>>> openmpi ? i
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> read
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> that
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>> a huge effort has been done to clean-up the
    valgrind
    > >>>>>>>>>>>>>>>>>>> output ? but maybe that this doesn't concern
    this btl
    > >>>>>>>>>>>>>>>>>>> (for the reasons you mentionned).
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> Another question, you said that the
    callback function
    > >>>>>>>>>>>>>>>>>>>>> pointer
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> should
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> never be 0. But can the tag be null
    (hdr->tag) ?
    > >>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>> The tag is not a pointer -- it's just an
    integer.
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>> I was worrying that its value could not be null.
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>> I'll send a valgrind output soon (i need to
    build
    > >>>>>>>>>>>>>>>>>>> libpython without pymalloc first).
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>> Thanks,
    > >>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> Thanks for your help,
    > >>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote:
    > >>>>>>>>>>>>>>>>>>>>>> Sorry for the delay in replying.
    > >>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>> Odd; the values of the callback function
    pointer
    > >>>>>>>>>>>>>>>>>>>>>> should never
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> be
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> 0.
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>> This seems to suggest some kind of memory
    corruption
    > >>>>>>>>>>>>>>>>>>>>>> is occurring.
    > >>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>> I don't know if it's possible, because
    the stack
    > >>>>>>>>>>>>>>>>>>>>>> trace looks like you're calling through
    python, but
    > >>>>>>>>>>>>>>>>>>>>>> can you run this application through
    valgrind, or
    > >>>>>>>>>>>>>>>>>>>>>> some other memory-checking debugger?
    > >>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry
    wrote:
    > >>>>>>>>>>>>>>>>>>>>>>> Hi,
    > >>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>> sorry, i just forgot to add the values
    of the
    > >>>>>>>>>>>>>>>>>>>>>>> function
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> parameters:
    > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbdata
    > >>>>>>>>>>>>>>>>>>>>>>> $1 = (void *) 0x0
    > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print openib_btl->super
    > >>>>>>>>>>>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380,
    > >>>>>>>>>>>>>>>>>>>>>>> btl_eager_limit =
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> 12288,
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>> btl_rndv_eager_limit = 12288,
    btl_max_send_size =
    > >>>>>>>>>>>>>>>>>>>>>>> 65536, btl_rdma_pipeline_send_length =
    1048576,
    > >>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_rdma_pipeline_frag_size = 1048576,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> btl_min_rdma_pipeline_size
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>   = 1060864, btl_exclusivity = 1024,
    btl_latency =
    > >>>>>>>>>>>>>>>>>>>>>>>   10, btl_bandwidth = 800, btl_flags = 310,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_add_procs =
    > >>>>>>>>>>>>>>>>>>>>>>>   0x2b341eb8ee47<mca_btl_openib_add_procs>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_del_procs =
    > >>>>>>>>>>>>>>>>>>>>>>>   0x2b341eb90156<mca_btl_openib_del_procs>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_register = 0, btl_finalize =
    > >>>>>>>>>>>>>>>>>>>>>>>   0x2b341eb93186<mca_btl_openib_finalize>,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> btl_alloc
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>   =
    0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free
    > >>>>>>>>>>>>>>>>>>>>>>>   = 0x2b341eb91400<mca_btl_openib_free>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_prepare_src =

> >>>>>>>>>>>>>>>>>>>>>>>0x2b341eb91813<mca_btl_openib_prepare_src>,

    > >>>>>>>>>>>>>>>>>>>>>>>   btl_prepare_dst
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> =
    > >>>>>>>>>>>>>>>

> >>>>>>>>>>>>>>>>>>>>>>>0x2b341eb91f2e<mca_btl_openib_prepare_dst>,

    > >>>>>>>>>>>>>>>>>>>>>>>   btl_send =
    0x2b341eb94517<mca_btl_openib_send>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_sendi =
    0x2b341eb9340d<mca_btl_openib_sendi>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_put =
    0x2b341eb94660<mca_btl_openib_put>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_get =
    0x2b341eb94c4e<mca_btl_openib_get>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_dump =
    0x2b341acd45cb<mca_btl_base_dump>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_mpool = 0xf3f4110,
    btl_register_error =

> >>>>>>>>>>>>>>>>>>>>>>>0x2b341eb90565<mca_btl_openib_register_error_cb>,

    > >>>>>>>>>>>>>>>>>>>>>>>   btl_ft_event
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> =
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>   0x2b341eb952e7<mca_btl_openib_ft_event>}
    > >>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print hdr->tag
    > >>>>>>>>>>>>>>>>>>>>>>> $3 = 0 '\0'
    > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print des
    > >>>>>>>>>>>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
    > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbfunc
    > >>>>>>>>>>>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
    > >>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi
    Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>> Hi,
    > >>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> Here is the output of a core file
    generated during
    > >>>>>>>>>>>>>>>>>>>>>>>> a
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> segmentation
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> fault observed during a collective call
    (using
    > >>>>>>>>>>>>>>>>>>>>>>>> openib):
    > >>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> #0  0x0000000000000000 in ?? ()
    > >>>>>>>>>>>>>>>>>>>>>>>> (gdb) where
    > >>>>>>>>>>>>>>>>>>>>>>>> #0  0x0000000000000000 in ?? ()
    > >>>>>>>>>>>>>>>>>>>>>>>> #1  0x00002aedbc4e05f4 in
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
    > >>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
    > >>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 #2
    0x00002aedbc4e25e2
    > >>>>>>>>>>>>>>>>>>>>>>>> in handle_wc (device=0x19024ac0, cq=0,
    > >>>>>>>>>>>>>>>>>>>>>>>> wc=0x7ffff279ce90) at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3178 #3
     0x00002aedbc4e2e9d
    > >>>>>>>>>>>>>>>>>>>>>>>> in
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> poll_device
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3318
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> #4
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> (device=0x19024ac0)
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e3561 in
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component_progress () at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3451
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> #6
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at
    > >>>>>>>>>>>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7
    0x00002aedb859f497
    > >>>>>>>>>>>>>>>>>>>>>>>> in opal_condition_wait (c=0x2aedb888ccc0,
    > >>>>>>>>>>>>>>>>>>>>>>>> m=0x2aedb888cd20) at
    > >>>>>>>>>>>>>>>>>>>>>>>> ../opal/threads/condition.h:99 #8
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb859fa31 in
    > >>>>>>>>>>>>>>>>>>>>>>>> ompi_request_default_wait_all
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> (count=2,
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> request/req_wait.c:262 #9
    0x00002aedbd7559ad in
    > >>>>>>>>>>>>>>>>>>>>>>>>
    ompi_coll_tuned_allreduce_intra_recursivedoubling
    > >>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440,
    > >>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20,
    > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> coll_tuned_allreduce.c:223
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in
    > >>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed
    > >>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440,
    > >>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20,
    > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
    > >>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63
    > >>>>>>>>>>>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> (sendbuf=0x7ffff279d444,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1,
    > >>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> op=0x6787a20,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004387dbf
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> in
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::MPI::Allreduce
    (sendbuf=0x7ffff279d444,
    > >>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1,
    > >>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> op=0x6787a20,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004058be8 in
    FEMTown::Domain::align (itf=
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>             er fa ce>>
    > >>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> = {_vptr.shared_base_ptr =
    0x7ffff279d620, ptr_ =
    > >>>>>>>>>>>>>>>>>>>>>>>> {px = 0x199942a4, pn = {pi_ =
    0x6}}},<No data
    > >>>>>>>>>>>>>>>>>>>>>>>> fields>}) at interface.cpp:371 #14
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00000000040cb858 in
    > >>>>>>>>>>>>>>>>>>>>>>>>
    FEMTown::Field::detail::align_itfs_and_neighbhors
    > >>>>>>>>>>>>>>>>>>>>>>>> (dim=2,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> set={px
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ =
    0x2f279d640}},
    > >>>>>>>>>>>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at
    check.cpp:63 #15
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> 0x00000000040cbfa8
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> in FEMTown::Field::align_elements
    (set={px =
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x7ffff279d950, pn
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> =
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> {pi_ = 0x66e08d0}},
    check_info=@0x7ffff279d7f0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyField_align_elements (self=0x0,
    > >>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
    > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:31 #17
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000001fbf76d in
    > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object*
    (*)(_object*,
    > >>>>>>>>>>>>>>>>>>>>>>>> _object*, _object*)>::exec<_object>
    > >>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0,
    po1=0x2aaab0765050,
    > >>>>>>>>>>>>>>>>>>>>>>>> po2=0x19d2e950) at
    > >>>>>>>>>>>>>>>>>>>>>>>>
    /home/qa/svntop/femtown/modules/main/py/exception.
    > >>>>>>>>>>>>>>>>>>>>>>>> hp p: 463
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> #18
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00000000039acc82 in
    PyField_align_elements_ewrap
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> (self=0x0,
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
    > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:39 #19 0x00000000044093a0 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx (f=0x19b52e90,
    throwflag=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>) at Python/ceval.c:3921 #20
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x000000000440aae9 in PyEval_EvalCodeEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value
    optimized out>,
    > >>>>>>>>>>>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3,
    > >>>>>>>>>>>>>>>>>>>>>>>> argcount=1, kws=0x19ace4a0, kwcount=2,
    > >>>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2,
    closure=0x0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
    > >>>>>>>>>>>>>>>>>>>>>>>> #21 0x0000000004408f58 in
    PyEval_EvalFrameEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value
    optimized out>) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #22
    0x000000000440aae9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120,
    > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
    locals=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x7, argcount=1,
    > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19acc418, kwcount=3,
    defs=0x2aaab759e958,
    > >>>>>>>>>>>>>>>>>>>>>>>> defcount=6, closure=0x0) at
    Python/ceval.c:2968
    > >>>>>>>>>>>>>>>>>>>>>>>> #23 0x0000000004408f58 in
    PyEval_EvalFrameEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value
    optimized out>) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #24
    0x000000000440aae9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738,
    > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
    locals=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x6, argcount=1,
    > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19abd328, kwcount=5,
    defs=0x2aaab891b7e8,
    > >>>>>>>>>>>>>>>>>>>>>>>> defcount=3, closure=0x0) at
    Python/ceval.c:2968
    > >>>>>>>>>>>>>>>>>>>>>>>> #25 0x0000000004408f58 in
    PyEval_EvalFrameEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value
    optimized out>) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #26
    0x000000000440aae9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198,
    > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
    locals=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0xb, argcount=1,
    > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89df0, kwcount=10, defs=0x0,
    defcount=0,
    > >>>>>>>>>>>>>>>>>>>>>>>> closure=0x0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 #27
    0x0000000004408f58 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value
    optimized out>) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #28
    0x000000000440aae9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288,
    > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
    locals=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x1, argcount=0,
    > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89330, kwcount=0,
    defs=0x2aaab8b66668,
    > >>>>>>>>>>>>>>>>>>>>>>>> defcount=1, closure=0x0) at
    Python/ceval.c:2968
    > >>>>>>>>>>>>>>>>>>>>>>>> #29 0x0000000004408f58 in
    PyEval_EvalFrameEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value
    optimized out>) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #30
    0x000000000440aae9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738,
    > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
    locals=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x0, argcount=0,
    kws=0x0,
    > >>>>>>>>>>>>>>>>>>>>>>>> kwcount=0, defs=0x0, defcount=0,
    closure=0x0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
    > >>>>>>>>>>>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode
    > >>>>>>>>>>>>>>>>>>>>>>>> (co=0x1902f9b0, globals=0x0,
    locals=0x190d9700) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:522 #32
    0x000000000442853c in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyRun_StringFlags (str=0x192fd3d8
    > >>>>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value
    optimized
    > >>>>>>>>>>>>>>>>>>>>>>>> out>, globals=0x192213d0,
    locals=0x192213d0,
    > >>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at Python/pythonrun.c:1335 #33
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004429690 in
    PyRun_SimpleStringFlags
    > >>>>>>>>>>>>>>>>>>>>>>>> (command=0x192fd3d8 "DIRECT.Actran.main()",
    > >>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:957 #34
    0x0000000001fa1cf9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> (this=0x7ffff279f650)
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> FEMTown::Main::Batch::run
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> 0x0000000001f9aa25
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at
    > >>>>>>>>>>>>>>>>>>>>>>>> main.cpp:10 (gdb) f 1 #1
     0x00002aedbc4e05f4 in
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
    (openib_btl=0x1902f9b0,
    > >>>>>>>>>>>>>>>>>>>>>>>> ep=0x1908a1c0, frag=0x190d9700,
    byte_len=18) at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881
    reg->cbfunc(
    > >>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des,
    reg->cbdata
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> );
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> Current language: auto; currently c
    > >>>>>>>>>>>>>>>>>>>>>>>> (gdb)
    > >>>>>>>>>>>>>>>>>>>>>>>> #1  0x00002aedbc4e05f4 in
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
    > >>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
    > >>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881
    reg->cbfunc(
    > >>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des,
    reg->cbdata
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> );
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> (gdb) l 2876

> >>>>>>>>>>>>>>>>>>>>>>>> 2877if(OPAL_LIKELY(!(is_credit_msg => >>>>>>>>>>>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878/*

    > >>>>>>>>>>>>>>>>>>>>>>>> call registered callback */

> >>>>>>>>>>>>>>>>>>>>>>>> 2879mca_btl_active_message_callback_t*

    > >>>>>>>>>>>>>>>>>>>>>>>> reg; 2880            reg =
    > >>>>>>>>>>>>>>>>>>>>>>>> mca_btl_base_active_message_trigger +
    hdr->tag;
    > >>>>>>>>>>>>>>>>>>>>>>>> 2881 reg->cbfunc(&openib_btl->super,
    hdr->tag,
    > >>>>>>>>>>>>>>>>>>>>>>>> des, reg->cbdata ); 2882
    > >>>>>>>>>>>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883
    > >>>>>>>>>>>>>>>>>>>>>>>> cqp
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> =
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> (hdr->credits>>  11)&  0x0f;
    > >>>>>>>>>>>>>>>>>>>>>>>> 2884                hdr->credits&= 0x87ff;
    > >>>>>>>>>>>>>>>>>>>>>>>> 2885            } else {
    > >>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi
    Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>> Hi Edgar,
    > >>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>> The only difference I could observed
    was that the
    > >>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault appeared sometimes
    later
    > >>>>>>>>>>>>>>>>>>>>>>>>> during the parallel computation.
    > >>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>> I'm running out of idea here. I wish I
    could use
    > >>>>>>>>>>>>>>>>>>>>>>>>> the "--mca
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> coll
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so
    that I could
    > >>>>>>>>>>>>>>>>>>>>>>>>> check that the issue is not somehow
    limited to
    > >>>>>>>>>>>>>>>>>>>>>>>>> the tuned collective routines.
    > >>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
    > >>>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24
    Edgar Gabriel wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> hi edgar,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try
    this option
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> as well.
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> the
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault i'm observing always
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> happened during a collective
    communication
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> indeed... does it basically
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> switch
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> all
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> collective communication to basic
    mode, right ?
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a
    NCA ?
    > >>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand
    > >>>>>>>>>>>>>>>>>>>>>>>>>> networking card)
    > >>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
    > >>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
    > >>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> thanks,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54
    Edgar Gabriel wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> you could try first to use the
    algorithms in
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> module,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> e.g.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> and see whether this makes a
    difference. I
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> used to
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> observe
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in
    the openib
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> btl triggered from the tuned collective
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> component, in cases where the ofed
    libraries
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> were installed but no NCA was found
    on a node.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> It used to work however with the basic
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> component.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry
    wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> hi Rolf,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid
    of that
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> annoying segmentation fault when
    selecting
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> another bcast algorithm. i'm now
    going to
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> replace MPI_Bcast with a naive
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and
    MPI_Recv)
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and see if
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> that
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> regards,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53
    Eloi Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Rolf,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for your input. You're
    right, I miss
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the coll_tuned_use_dynamic_rules
    option.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation
    fault
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disappears when
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> using
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm
    using the
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proper command line you provided.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> vandeVaart
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Eloi:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To select the different bcast
    algorithms,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need to add an extra mca
    parameter
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that tells the library to use
    dynamic
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> selection. --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One way to make sure you are
    typing this in
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> correctly is
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> to
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use it with ompi_info.  Do the
    following:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_info -mca
    coll_tuned_use_dynamic_rules
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1 --param
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> coll
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You should see lots of output
    with all the
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithms that can be
    selected
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for the various collectives.
    Therefore,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need this:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --mca
    coll_tuned_use_dynamic_rules 1 --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rolf
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry
    wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1"
    allowed to
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to the basic linear
    algorithm.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anyway whatever the algorithm
    used, the
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault remains.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some
    advice on ways
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> diagnose
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> the
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ?
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58
    Eloi Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast
    routine
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that seems to randomly
    segfault when
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using the openib btl. I'd
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> like
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> to
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to
    make OpenMPI
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> a
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the
    default one
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> being selected for MPI_Bcast.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010
    11:06:52 Eloi Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random
    segmentation
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fault during
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> an
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internode parallel
    computation involving
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> openib
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> btl
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same
    issue can be
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3).
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    mpirun (Open MPI) 1.4.2
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    Report bugs to
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    http://www.open-mpi.org/community/hel
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    p/ [pbn08:02624] ***
    Process received
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    signal *** [pbn08:02624]
    Signal:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    Segmentation fault (11)
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    [pbn08:02624] Signal code:
    Address
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    not mapped
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> (1)
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    [pbn08:02624] Failing at
    address:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    (nil) [pbn08:02624] [ 0]
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    /lib64/libpthread.so.0
    [0x349540e4c0]
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    [pbn08:02624] *** End of error
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> message
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    ***
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    sh: line 1:  2624
    Segmentation fault
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /bin\/actranpy_mp
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tran_11.0.rc2.41872'
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t'
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1'
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL'
    '--t_max=0.1'
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain'
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the
    openib btl
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (by using --mca btl
    self,sm,tcp on the
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command line, for instance),
    I don't
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> encounter any problem and the
    parallel
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation runs flawlessly.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help
    to be
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> able: - to diagnose the issue I'm
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facing with the openib btl -
    understand
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why this issue is observed
    only when
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> using
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self,sm,tcp
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much
    appreciated.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure scripts of OpenMPI are
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enclosed to this email, and some
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> information
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as
    well.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used
    when
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> launching a
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> parallel
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using infiniband:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    path_to_openmpi/bin/mpirun -np
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    $NPROCESS --hostfile
    host.list --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp
     --display-map
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --verbose --version --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
    0 [...]
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if
    not using infiniband:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    path_to_openmpi/bin/mpirun -np
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    $NPROCESS --hostfile
    host.list --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp
     --display-map --verbose
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --version
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> --mca
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> 0
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...]
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    __________________________________________
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __ __ _
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> _______________________________________________
    > >>>>>>>>>>>>> users mailing list
    > >>>>>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users

    --


    Eloi Gaudry

    Free Field Technologies
    Company Website: http://www.fft.be
    Company Phone:   +32 10 487 959

    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--


*Eloi Gaudry*
Senior Product and Development Engineer -- HPC & IT Manager
Company phone:  +32 10 45 12 26         Direct line:    +32 10 49 51 47
Company fax:    +32 10 45 46 26         Email:  eloi.gau...@fft.be
Website:        www.fft.be <http://www.fft.be>    


        FFT logo <http://www.fft.be>

Re: [OMPI users] [openib] segfault when using openib btl

Reply via email to