Hi,

I just would like to give you an update on this issue.
Since we are using OpenMPI-1.4.4, we cannot reproduce it anymore.

Regards,
Eloi



On 09/29/2010 06:01 AM, Nysal Jan wrote:
Hi Eloi,
We discussed this issue during the weekly developer meeting & there were no further suggestions, apart from checking the driver and firmware levels. The consensus was that it would be better if you could take this up directly with your IB vendor.

Regards
--Nysal

On Mon, Sep 27, 2010 at 8:14 PM, Eloi Gaudry <e...@fft.be <mailto:e...@fft.be>> wrote:

    Terry,

    Please find enclosed the requested check outputs (using
    -output-filename stdout.tag.null option).
    I'm displaying frag->hdr->tag here.

    Eloi

    On Monday 27 September 2010 16:29:12 Terry Dontje wrote:
    > Eloi, sorry can you print out frag->hdr->tag?
    >
    > Unfortunately from your last email I think it will still all have
    > non-zero values.
    > If that ends up being the case then there must be something odd
    with the
    > descriptor pointer to the fragment.
    >
    > --td
    >
    > Eloi Gaudry wrote:
    > > Terry,
    > >
    > > Please find enclosed the requested check outputs (using
    -output-filename
    > > stdout.tag.null option).
    > >
    > > For information, Nysal In his first message referred to
    > > ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was
    wrnong on
> > receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML
    > > + 1)
    > > #define MCA_PML_OB1_HDR_TYPE_RNDV      (MCA_BTL_TAG_PML + 2)
    > > #define MCA_PML_OB1_HDR_TYPE_RGET      (MCA_BTL_TAG_PML + 3)
    > >
    > >  #define MCA_PML_OB1_HDR_TYPE_ACK       (MCA_BTL_TAG_PML + 4)
    > >
    > > #define MCA_PML_OB1_HDR_TYPE_NACK      (MCA_BTL_TAG_PML + 5)
    > > #define MCA_PML_OB1_HDR_TYPE_FRAG      (MCA_BTL_TAG_PML + 6)
    > > #define MCA_PML_OB1_HDR_TYPE_GET       (MCA_BTL_TAG_PML + 7)
    > >
    > >  #define MCA_PML_OB1_HDR_TYPE_PUT       (MCA_BTL_TAG_PML + 8)
    > >
    > > #define MCA_PML_OB1_HDR_TYPE_FIN       (MCA_BTL_TAG_PML + 9)
    > > and in ompi/mca/btl/btl.h
    > > #define MCA_BTL_TAG_PML             0x40
    > >
    > > Eloi
    > >
    > > On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
    > >> I am thinking checking the value of *frag->hdr right before
    the return
    > >> in the post_send function in
    ompi/mca/btl/openib/btl_openib_endpoint.h.
    > >> It is line 548 in the trunk
    > >>
    https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_
    > >> ope nib_endpoint.h#548
    > >>
    > >> --td
    > >>
    > >> Eloi Gaudry wrote:
    > >>> Hi Terry,
    > >>>
    > >>> Do you have any patch that I could apply to be able to do so
    ? I'm
    > >>> remotely working on a cluster (with a terminal) and I cannot
    use any
    > >>> parallel debugger or sequential debugger (with a call to
    xterm...). I
    > >>> can track frag->hdr->tag value in
    > >>> ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
    > >>> SEND/RDMA_WRITE case, but this is all I can think of alone.
    > >>>
    > >>> You'll find a stacktrace (receive side) in this thread (10th
    or 11th
    > >>> message) but it might be pointless.
    > >>>
    > >>> Regards,
    > >>> Eloi
    > >>>
    > >>> On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
    > >>>> So it sounds like coalescing is not your issue and that the
    problem
    > >>>> has something to do with the queue sizes.  It would be
    helpful if we
    > >>>> could detect the hdr->tag == 0 issue on the sending side
    and get at
    > >>>> least a stack trace.  There is something really odd going
    on here.
    > >>>>
    > >>>> --td
    > >>>>
    > >>>> Eloi Gaudry wrote:
    > >>>>> Hi Terry,
    > >>>>>
    > >>>>> I'm sorry to say that I might have missed a point here.
    > >>>>>
    > >>>>> I've lately been relaunching all previously failing
    computations with
    > >>>>> the message coalescing feature being switched off, and I
    saw the same
    > >>>>> hdr->tag=0 error several times, always during a collective
    call
    > >>>>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so
    far). And as
    > >>>>> soon as I switched to the peer queue option I was
    previously using
    > >>>>> (--mca btl_openib_receive_queues P,65536,256,192,128
    instead of using
    > >>>>> --mca btl_openib_use_message_coalescing 0), all
    computations ran
    > >>>>> flawlessly.
    > >>>>>
    > >>>>> As for the reproducer, I've already tried to write
    something but I
    > >>>>> haven't succeeded so far at reproducing the hdr->tag=0
    issue with it.
    > >>>>>
    > >>>>> Eloi
    > >>>>>
    > >>>>> On 24/09/2010 18:37, Terry Dontje wrote:
    > >>>>>> Eloi Gaudry wrote:
    > >>>>>>> Terry,
    > >>>>>>>
    > >>>>>>> You were right, the error indeed seems to come from the
    message
    > >>>>>>> coalescing feature. If I turn it off using the "--mca
    > >>>>>>> btl_openib_use_message_coalescing 0", I'm not able to
    observe the
    > >>>>>>> "hdr->tag=0" error.
    > >>>>>>>
    > >>>>>>> There are some trac requests associated to very similar
    error
    > >>>>>>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing)
    but they
    > >>>>>>> are all closed (except
    > >>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be
    > >>>>>>> related), aren't they ? What would you suggest Terry ?
    > >>>>>>
    > >>>>>> Interesting, though it looks to me like the segv in
    ticket 2352
    > >>>>>> would have happened on the send side instead of the
    receive side
    > >>>>>> like you have.  As to what to do next it would be really
    nice to
    > >>>>>> have some sort of reproducer that we can try and debug
    what is
    > >>>>>> really going on.  The only other thing to do without a
    reproducer
    > >>>>>> is to inspect the code on the send side to figure out
    what might
    > >>>>>> make it generate at 0 hdr->tag.  Or maybe instrument the
    send side
    > >>>>>> to stop when it is about ready to send a 0 hdr->tag and
    see if we
    > >>>>>> can see how the code got there.
    > >>>>>>
    > >>>>>> I might have some cycles to look at this Monday.
    > >>>>>>
    > >>>>>> --td
    > >>>>>>
    > >>>>>>> Eloi
    > >>>>>>>
    > >>>>>>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
    > >>>>>>>> Eloi Gaudry wrote:
    > >>>>>>>>> Terry,
    > >>>>>>>>>
    > >>>>>>>>> No, I haven't tried any other values than
    P,65536,256,192,128
    > >>>>>>>>> yet.
    > >>>>>>>>>
    > >>>>>>>>> The reason why is quite simple. I've been reading and
    reading
    > >>>>>>>>> again this thread to understand the
    btl_openib_receive_queues
    > >>>>>>>>> meaning and I can't figure out why the default values
    seem to
    > >>>>>>>>> induce the hdr-
    > >>>>>>>>>
    > >>>>>>>>>> tag=0 issue
    > >>>>>>>>>>
    (http://www.open-mpi.org/community/lists/users/2009/01/7808.php)
    > >>>>>>>>>> .
    > >>>>>>>>
    > >>>>>>>> Yeah, the size of the fragments and number of them
    really should
    > >>>>>>>> not cause this issue.  So I too am a little perplexed
    about it.
    > >>>>>>>>
    > >>>>>>>>> Do you think that the default shared received queue
    parameters
    > >>>>>>>>> are erroneous for this specific Mellanox card ? Any
    help on
    > >>>>>>>>> finding the proper parameters would actually be much
    > >>>>>>>>> appreciated.
    > >>>>>>>>
    > >>>>>>>> I don't necessarily think it is the queue size for a
    specific card
    > >>>>>>>> but more so the handling of the queues by the BTL when
    using
    > >>>>>>>> certain sizes. At least that is one gut feel I have.
    > >>>>>>>>
    > >>>>>>>> In my mind the tag being 0 is either something below
    OMPI is
    > >>>>>>>> polluting the data fragment or OMPI's internal protocol
    is some
    > >>>>>>>> how getting messed up.  I can imagine (no empirical
    data here)
    > >>>>>>>> the queue sizes could change how the OMPI protocol sets
    things
    > >>>>>>>> up. Another thing may be the coalescing feature in the
    openib BTL
    > >>>>>>>> which tries to gang multiple messages into one packet when
    > >>>>>>>> resources are running low.   I can see where changing
    the queue
    > >>>>>>>> sizes might affect the coalescing. So, it might be
    interesting to
    > >>>>>>>> turn off the coalescing.  You can do that by setting "--mca
    > >>>>>>>> btl_openib_use_message_coalescing 0" in your mpirun line.
    > >>>>>>>>
    > >>>>>>>> If that doesn't solve the issue then obviously there
    must be
    > >>>>>>>> something else going on :-).
    > >>>>>>>>
    > >>>>>>>> Note, the reason I am interested in this is I am seeing
    a similar
    > >>>>>>>> error condition (hdr->tag == 0) on a development
    system.  Though
    > >>>>>>>> my failing case fails with np=8 using the connectivity test
    > >>>>>>>> program which is mainly point to point and there are not a
    > >>>>>>>> significant amount of data transfers going on either.
    > >>>>>>>>
    > >>>>>>>> --td
    > >>>>>>>>
    > >>>>>>>>> Eloi
    > >>>>>>>>>
    > >>>>>>>>> On Friday 24 September 2010 14:27:07 you wrote:
    > >>>>>>>>>> That is interesting.  So does the number of processes
    affect
    > >>>>>>>>>> your runs any.  The times I've seen hdr->tag be 0
    usually has
    > >>>>>>>>>> been due to protocol issues.  The tag should never be
    0.  Have
    > >>>>>>>>>> you tried to do other receive_queue settings other
    than the
    > >>>>>>>>>> default and the one you mention.
    > >>>>>>>>>>
    > >>>>>>>>>> I wonder if you did a combination of the two receive
    queues
    > >>>>>>>>>> causes a failure or not.  Something like
    > >>>>>>>>>>
    > >>>>>>>>>> P,128,256,192,128:P,65536,256,192,128
    > >>>>>>>>>>
    > >>>>>>>>>> I am wondering if it is the first queuing definition
    causing the
    > >>>>>>>>>> issue or possibly the SRQ defined in the default.
    > >>>>>>>>>>
    > >>>>>>>>>> --td
    > >>>>>>>>>>
    > >>>>>>>>>> Eloi Gaudry wrote:
    > >>>>>>>>>>> Hi Terry,
    > >>>>>>>>>>>
    > >>>>>>>>>>> The messages being send/received can be of any size,
    but the
    > >>>>>>>>>>> error seems to happen more often with small messages
    (as an int
    > >>>>>>>>>>> being broadcasted or allreduced). The failing
    communication
    > >>>>>>>>>>> differs from one run to another, but some spots are
    more likely
    > >>>>>>>>>>> to be failing than another. And as far as I know,
    there are
    > >>>>>>>>>>> always located next to a small message (an int being
    > >>>>>>>>>>> broadcasted for instance) communication. Other typical
    > >>>>>>>>>>> messages size are
    > >>>>>>>>>>>
    > >>>>>>>>>>>> 10k but can be very much larger.
    > >>>>>>>>>>>
    > >>>>>>>>>>> I've been checking the hca being used, its' from
    mellanox (with
    > >>>>>>>>>>> vendor_part_id=26428). There is no receive_queues
    parameters
    > >>>>>>>>>>> associated to it.
    > >>>>>>>>>>>
    > >>>>>>>>>>>  $ cat
    share/openmpi/mca-btl-openib-device-params.ini as well:
    > >>>>>>>>>>> [...]
    > >>>>>>>>>>>
    > >>>>>>>>>>>   # A.k.a. ConnectX
    > >>>>>>>>>>>   [Mellanox Hermon]
    > >>>>>>>>>>>   vendor_id =
    0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3
    > >>>>>>>>>>>   vendor_part_id =
> >>>>>>>>>>> 25408,25418,25428,26418,26428,25448,26438,26448,26468,26478,2
    > >>>>>>>>>>>   64 88 use_eager_rdma = 1
    > >>>>>>>>>>>   mtu = 2048
    > >>>>>>>>>>>   max_inline_data = 128
    > >>>>>>>>>>>
    > >>>>>>>>>>> [..]
    > >>>>>>>>>>>
    > >>>>>>>>>>> $ ompi_info --param btl openib --parsable | grep
    receive_queues
    > >>>>>>>>>>>
    > >>>>>>>>>>>
     mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256
    > >>>>>>>>>>>  ,1 92 ,128
    > >>>>>>>>>>>
    > >>>>>>>>>>>  :S
    ,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
    > >>>>>>>>>>>
    > >>>>>>>>>>>
     mca:btl:openib:param:btl_openib_receive_queues:data_source:def
    > >>>>>>>>>>>  au lt value
    > >>>>>>>>>>>
     mca:btl:openib:param:btl_openib_receive_queues:status:writable
    > >>>>>>>>>>>
     mca:btl:openib:param:btl_openib_receive_queues:help:Colon-deli
    > >>>>>>>>>>>  mi t ed, comma delimited list of receive queues:
    > >>>>>>>>>>>  P,4096,8,6,4:P,32768,8,6,4
    > >>>>>>>>>>>
     mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
    > >>>>>>>>>>>
    > >>>>>>>>>>> I was wondering if these parameters (automatically
    computed at
    > >>>>>>>>>>> openib btl init for what I understood) were not
    incorrect in
    > >>>>>>>>>>> some way and I plugged some others values:
    > >>>>>>>>>>> "P,65536,256,192,128" (someone on the list used that
    values
    > >>>>>>>>>>> when encountering a different issue) . Since that, I
    haven't
    > >>>>>>>>>>> been able to observe the segfault (occuring as
    hrd->tag = 0 in
    > >>>>>>>>>>> btl_openib_component.c:2881) yet.
    > >>>>>>>>>>>
    > >>>>>>>>>>> Eloi
    > >>>>>>>>>>>
    > >>>>>>>>>>>
    > >>>>>>>>>>> /home/pp_fr/st03230/EG/Softs/openmpi-custom-1.4.2/bin/
    > >>>>>>>>>>>
    > >>>>>>>>>>> On Thursday 23 September 2010 23:33:48 Terry Dontje
    wrote:
    > >>>>>>>>>>>> Eloi, I am curious about your problem.  Can you
    tell me what
    > >>>>>>>>>>>> size of job it is?  Does it always fail on the same
    bcast,  or
    > >>>>>>>>>>>> same process?
    > >>>>>>>>>>>>
    > >>>>>>>>>>>> Eloi Gaudry wrote:
    > >>>>>>>>>>>>> Hi Nysal,
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> Thanks for your suggestions.
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> I'm now able to get the checksum computed and
    redirected to
    > >>>>>>>>>>>>> stdout, thanks (I forgot the  "-mca
    pml_base_verbose 5"
    > >>>>>>>>>>>>> option, you were right). I haven't been able to
    observe the
    > >>>>>>>>>>>>> segmentation fault (with hdr->tag=0) so far (when
    using pml
    > >>>>>>>>>>>>> csum) but I 'll let you know when I am.
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> I've got two others question, which may be related
    to the
    > >>>>>>>>>>>>> error observed:
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> 1/ does the maximum number of MPI_Comm that can be
    handled by
    > >>>>>>>>>>>>> OpenMPI somehow depends on the btl being used
    (i.e. if I'm
    > >>>>>>>>>>>>> using openib, may I use the same number of
    MPI_Comm object as
    > >>>>>>>>>>>>> with tcp) ? Is there something as MPI_COMM_MAX in
    OpenMPI ?
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> 2/ the segfaults only appears during a mpi
    collective call,
    > >>>>>>>>>>>>> with very small message (one int is being
    broadcast, for
    > >>>>>>>>>>>>> instance) ; i followed the guidelines given at
    > >>>>>>>>>>>>> http://icl.cs.utk.edu/open-
    > >>>>>>>>>>>>>
    mpi/faq/?category=openfabrics#ib-small-message-rdma but the
    > >>>>>>>>>>>>> debug-build of OpenMPI asserts if I use a
    different min-size
    > >>>>>>>>>>>>> that 255. Anyway, if I deactivate eager_rdma, the
    segfaults
    > >>>>>>>>>>>>> remains. Does the openib btl handle very small message
    > >>>>>>>>>>>>> differently (even with eager_rdma
    > >>>>>>>>>>>>> deactivated) than tcp ?
    > >>>>>>>>>>>>
    > >>>>>>>>>>>> Others on the list does coalescing happen with
    non-eager_rdma?
    > >>>>>>>>>>>> If so then that would possibly be one difference
    between the
    > >>>>>>>>>>>> openib btl and tcp aside from the actual protocol used.
    > >>>>>>>>>>>>
    > >>>>>>>>>>>>>  is there a way to make sure that large messages
    and small
    > >>>>>>>>>>>>>  messages are handled the same way ?
    > >>>>>>>>>>>>
    > >>>>>>>>>>>> Do you mean so they all look like eager messages?
     How large
    > >>>>>>>>>>>> of messages are we talking about here 1K, 1M or 10M?
    > >>>>>>>>>>>>
    > >>>>>>>>>>>> --td
    > >>>>>>>>>>>>
    > >>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> On Friday 17 September 2010 17:57:17 Nysal Jan wrote:
    > >>>>>>>>>>>>>> Hi Eloi,
    > >>>>>>>>>>>>>> Create a debug build of OpenMPI (--enable-debug)
    and while
    > >>>>>>>>>>>>>> running with the csum PML add "-mca
    pml_base_verbose 5" to
    > >>>>>>>>>>>>>> the command line. This will print the checksum
    details for
    > >>>>>>>>>>>>>> each fragment sent over the wire. I'm guessing it
    didnt
    > >>>>>>>>>>>>>> catch anything because the BTL failed. The checksum
    > >>>>>>>>>>>>>> verification is done in the PML, which the BTL
    calls via a
    > >>>>>>>>>>>>>> callback function. In your case the PML callback
    is never
    > >>>>>>>>>>>>>> called because the hdr->tag is invalid. So enabling
    > >>>>>>>>>>>>>> checksum tracing also might not be of much use.
    Is it the
    > >>>>>>>>>>>>>> first Bcast that fails or the nth Bcast and what
    is the
    > >>>>>>>>>>>>>> message size? I'm not sure what could be the
    problem at
    > >>>>>>>>>>>>>> this moment. I'm afraid you will have to debug
    the BTL to
    > >>>>>>>>>>>>>> find out more.
    > >>>>>>>>>>>>>>
    > >>>>>>>>>>>>>> --Nysal
    > >>>>>>>>>>>>>>
    > >>>>>>>>>>>>>> On Fri, Sep 17, 2010 at 4:39 PM, Eloi Gaudry
    <e...@fft.be <mailto:e...@fft.be>> wrote:
    > >>>>>>>>>>>>>>> Hi Nysal,
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> thanks for your response.
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> I've been unable so far to write a test case
    that could
    > >>>>>>>>>>>>>>> illustrate the hdr->tag=0 error.
    > >>>>>>>>>>>>>>> Actually, I'm only observing this issue when
    running an
    > >>>>>>>>>>>>>>> internode computation involving infiniband
    hardware from
    > >>>>>>>>>>>>>>> Mellanox (MT25418, ConnectX IB DDR, PCIe 2.0
    > >>>>>>>>>>>>>>> 2.5GT/s, rev a0) with our time-domain software.
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> I checked, double-checked, and rechecked again
    every MPI
    > >>>>>>>>>>>>>>> use performed during a parallel computation and
    I couldn't
    > >>>>>>>>>>>>>>> find any error so far. The fact that the very
    > >>>>>>>>>>>>>>> same parallel computation run flawlessly when
    using tcp
    > >>>>>>>>>>>>>>> (and disabling openib support) might seem to
    indicate that
    > >>>>>>>>>>>>>>> the issue is somewhere located inside the
    > >>>>>>>>>>>>>>> openib btl or at the hardware/driver level.
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> I've just used the "-mca pml csum" option and I
    haven't
    > >>>>>>>>>>>>>>> seen any related messages (when hdr->tag=0 and the
    > >>>>>>>>>>>>>>> segfaults occurs). Any suggestion ?
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> On Friday 17 September 2010 16:03:34 Nysal Jan
    wrote:
    > >>>>>>>>>>>>>>>> Hi Eloi,
    > >>>>>>>>>>>>>>>> Sorry for the delay in response. I haven't read
    the entire
    > >>>>>>>>>>>>>>>> email thread, but do you have a test case which can
    > >>>>>>>>>>>>>>>> reproduce this error? Without that it will be
    difficult to
    > >>>>>>>>>>>>>>>> nail down the cause. Just to clarify, I do not
    work for an
    > >>>>>>>>>>>>>>>> iwarp vendor. I can certainly try to reproduce
    it on an IB
    > >>>>>>>>>>>>>>>> system. There is also a PML called csum, you
    can use it
    > >>>>>>>>>>>>>>>> via "-mca pml csum", which will checksum the
    MPI messages
    > >>>>>>>>>>>>>>>> and verify it at the receiver side for any data
    > >>>>>>>>>>>>>>>> corruption. You can try using it to see if it
    is able
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> to
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>> catch anything.
    > >>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>> Regards
    > >>>>>>>>>>>>>>>> --Nysal
    > >>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>> On Thu, Sep 16, 2010 at 3:48 PM, Eloi Gaudry
    <e...@fft.be <mailto:e...@fft.be>> wrote:
    > >>>>>>>>>>>>>>>>> Hi Nysal,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> I'm sorry to intrrupt, but I was wondering if
    you had a
    > >>>>>>>>>>>>>>>>> chance to look
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> at
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> this error.
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> --
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> Eloi Gaudry
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> Free Field Technologies
    > >>>>>>>>>>>>>>>>> Company Website: http://www.fft.be
    > >>>>>>>>>>>>>>>>> Company Phone:   +32 10 487 959
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> ---------- Forwarded message ----------
    > >>>>>>>>>>>>>>>>> From: Eloi Gaudry <e...@fft.be <mailto:e...@fft.be>>
    > >>>>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org
    <mailto:us...@open-mpi.org>>
    > >>>>>>>>>>>>>>>>> Date: Wed, 15 Sep 2010 16:27:43 +0200
    > >>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] [openib] segfault
    when using
    > >>>>>>>>>>>>>>>>> openib btl Hi,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> I was wondering if anybody got a chance to
    have a look at
    > >>>>>>>>>>>>>>>>> this issue.
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> On Wednesday 18 August 2010 09:16:26 Eloi
    Gaudry wrote:
    > >>>>>>>>>>>>>>>>>> Hi Jeff,
    > >>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>> Please find enclosed the output
    (valgrind.out.gz) from
    > >>>>>>>>>>>>>>>>>> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host
    > >>>>>>>>>>>>>>>>>> pbn11,pbn10 --mca
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> btl
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>> openib,self --display-map --verbose --mca
    > >>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
    btl_openib_want_fork_support 0
    > >>>>>>>>>>>>>>>>>> -tag-output /opt/valgrind-3.5.0/bin/valgrind
    > >>>>>>>>>>>>>>>>>> --tool=memcheck
    > >>>>>>>>>>>>>>>>>>
    --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/o
    > >>>>>>>>>>>>>>>>>> pen mp i- valgrind.supp
    > >>>>>>>>>>>>>>>>>> --suppressions=./suppressions.python.supp
    > >>>>>>>>>>>>>>>>>> /opt/actran/bin/actranpy_mp ...
    > >>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>> Thanks,
    > >>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>> On Tuesday 17 August 2010 09:32:53 Eloi
    Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>> On Monday 16 August 2010 19:14:47 Jeff
    Squyres wrote:
    > >>>>>>>>>>>>>>>>>>>> On Aug 16, 2010, at 10:05 AM, Eloi Gaudry
    wrote:
    > >>>>>>>>>>>>>>>>>>>>> I did run our application through valgrind
    but it
    > >>>>>>>>>>>>>>>>>>>>> couldn't find any "Invalid write": there
    is a bunch
    > >>>>>>>>>>>>>>>>>>>>> of "Invalid read" (I'm using
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> 1.4.2
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> with the suppression file), "Use of
    uninitialized
    > >>>>>>>>>>>>>>>>>>>>> bytes" and "Conditional jump depending on
    > >>>>>>>>>>>>>>>>>>>>> uninitialized bytes" in
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> different
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> ompi
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> routines. Some of them are located in
    > >>>>>>>>>>>>>>>>>>>>> btl_openib_component.c. I'll send you an
    output of
    > >>>>>>>>>>>>>>>>>>>>> valgrind shortly.
    > >>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>> A lot of them in btl_openib_* are to be
    expected --
    > >>>>>>>>>>>>>>>>>>>> OpenFabrics uses OS-bypass methods for some
    of its
    > >>>>>>>>>>>>>>>>>>>> memory, and therefore valgrind is unaware
    of them (and
    > >>>>>>>>>>>>>>>>>>>> therefore incorrectly marks them as
    > >>>>>>>>>>>>>>>>>>>> uninitialized).
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>> would it  help if i use the upcoming 1.5
    version of
    > >>>>>>>>>>>>>>>>>>> openmpi ? i
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> read
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> that
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>> a huge effort has been done to clean-up the
    valgrind
    > >>>>>>>>>>>>>>>>>>> output ? but maybe that this doesn't concern
    this btl
    > >>>>>>>>>>>>>>>>>>> (for the reasons you mentionned).
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> Another question, you said that the
    callback function
    > >>>>>>>>>>>>>>>>>>>>> pointer
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> should
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> never be 0. But can the tag be null
    (hdr->tag) ?
    > >>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>> The tag is not a pointer -- it's just an
    integer.
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>> I was worrying that its value could not be null.
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>> I'll send a valgrind output soon (i need to
    build
    > >>>>>>>>>>>>>>>>>>> libpython without pymalloc first).
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>> Thanks,
    > >>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> Thanks for your help,
    > >>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>> On 16/08/2010 18:22, Jeff Squyres wrote:
    > >>>>>>>>>>>>>>>>>>>>>> Sorry for the delay in replying.
    > >>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>> Odd; the values of the callback function
    pointer
    > >>>>>>>>>>>>>>>>>>>>>> should never
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> be
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> 0.
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>> This seems to suggest some kind of memory
    corruption
    > >>>>>>>>>>>>>>>>>>>>>> is occurring.
    > >>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>> I don't know if it's possible, because
    the stack
    > >>>>>>>>>>>>>>>>>>>>>> trace looks like you're calling through
    python, but
    > >>>>>>>>>>>>>>>>>>>>>> can you run this application through
    valgrind, or
    > >>>>>>>>>>>>>>>>>>>>>> some other memory-checking debugger?
    > >>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry
    wrote:
    > >>>>>>>>>>>>>>>>>>>>>>> Hi,
    > >>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>> sorry, i just forgot to add the values
    of the
    > >>>>>>>>>>>>>>>>>>>>>>> function
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> parameters:
    > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbdata
    > >>>>>>>>>>>>>>>>>>>>>>> $1 = (void *) 0x0
    > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print openib_btl->super
    > >>>>>>>>>>>>>>>>>>>>>>> $2 = {btl_component = 0x2b341edd7380,
    > >>>>>>>>>>>>>>>>>>>>>>> btl_eager_limit =
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> 12288,
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>> btl_rndv_eager_limit = 12288,
    btl_max_send_size =
    > >>>>>>>>>>>>>>>>>>>>>>> 65536, btl_rdma_pipeline_send_length =
    1048576,
    > >>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_rdma_pipeline_frag_size = 1048576,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> btl_min_rdma_pipeline_size
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>   = 1060864, btl_exclusivity = 1024,
    btl_latency =
    > >>>>>>>>>>>>>>>>>>>>>>>   10, btl_bandwidth = 800, btl_flags = 310,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_add_procs =
    > >>>>>>>>>>>>>>>>>>>>>>>   0x2b341eb8ee47<mca_btl_openib_add_procs>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_del_procs =
    > >>>>>>>>>>>>>>>>>>>>>>>   0x2b341eb90156<mca_btl_openib_del_procs>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_register = 0, btl_finalize =
    > >>>>>>>>>>>>>>>>>>>>>>>   0x2b341eb93186<mca_btl_openib_finalize>,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> btl_alloc
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>   =
    0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free
    > >>>>>>>>>>>>>>>>>>>>>>>   = 0x2b341eb91400<mca_btl_openib_free>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_prepare_src =
> >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91813<mca_btl_openib_prepare_src>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_prepare_dst
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> =
    > >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb91f2e<mca_btl_openib_prepare_dst>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_send =
    0x2b341eb94517<mca_btl_openib_send>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_sendi =
    0x2b341eb9340d<mca_btl_openib_sendi>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_put =
    0x2b341eb94660<mca_btl_openib_put>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_get =
    0x2b341eb94c4e<mca_btl_openib_get>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_dump =
    0x2b341acd45cb<mca_btl_base_dump>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_mpool = 0xf3f4110,
    btl_register_error =
> >>>>>>>>>>>>>>>>>>>>>>> 0x2b341eb90565<mca_btl_openib_register_error_cb>,
    > >>>>>>>>>>>>>>>>>>>>>>>   btl_ft_event
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> =
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>   0x2b341eb952e7<mca_btl_openib_ft_event>}
    > >>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print hdr->tag
    > >>>>>>>>>>>>>>>>>>>>>>> $3 = 0 '\0'
    > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print des
    > >>>>>>>>>>>>>>>>>>>>>>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
    > >>>>>>>>>>>>>>>>>>>>>>> (gdb) print reg->cbfunc
    > >>>>>>>>>>>>>>>>>>>>>>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
    > >>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>> On Tuesday 10 August 2010 16:04:08 Eloi
    Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>> Hi,
    > >>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> Here is the output of a core file
    generated during
    > >>>>>>>>>>>>>>>>>>>>>>>> a
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> segmentation
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> fault observed during a collective call
    (using
    > >>>>>>>>>>>>>>>>>>>>>>>> openib):
    > >>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> #0  0x0000000000000000 in ?? ()
    > >>>>>>>>>>>>>>>>>>>>>>>> (gdb) where
    > >>>>>>>>>>>>>>>>>>>>>>>> #0  0x0000000000000000 in ?? ()
    > >>>>>>>>>>>>>>>>>>>>>>>> #1  0x00002aedbc4e05f4 in
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
    > >>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
    > >>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 #2
    0x00002aedbc4e25e2
    > >>>>>>>>>>>>>>>>>>>>>>>> in handle_wc (device=0x19024ac0, cq=0,
    > >>>>>>>>>>>>>>>>>>>>>>>> wc=0x7ffff279ce90) at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3178 #3
     0x00002aedbc4e2e9d
    > >>>>>>>>>>>>>>>>>>>>>>>> in
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> poll_device
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> (device=0x19024ac0, count=2) at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3318
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> #4
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e34b8 in progress_one_device
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> (device=0x19024ac0)
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> at btl_openib_component.c:3426 #5
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedbc4e3561 in
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component_progress () at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:3451
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> #6
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb8b22ab8 in opal_progress () at
    > >>>>>>>>>>>>>>>>>>>>>>>> runtime/opal_progress.c:207 #7
    0x00002aedb859f497
    > >>>>>>>>>>>>>>>>>>>>>>>> in opal_condition_wait (c=0x2aedb888ccc0,
    > >>>>>>>>>>>>>>>>>>>>>>>> m=0x2aedb888cd20) at
    > >>>>>>>>>>>>>>>>>>>>>>>> ../opal/threads/condition.h:99 #8
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00002aedb859fa31 in
    > >>>>>>>>>>>>>>>>>>>>>>>> ompi_request_default_wait_all
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> (count=2,
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> requests=0x7ffff279d0e0, statuses=0x0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> request/req_wait.c:262 #9
    0x00002aedbd7559ad in
    > >>>>>>>>>>>>>>>>>>>>>>>>
    ompi_coll_tuned_allreduce_intra_recursivedoubling
    > >>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440,
    > >>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20,
    > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> coll_tuned_allreduce.c:223
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> #10 0x00002aedbd7514f7 in
    > >>>>>>>>>>>>>>>>>>>>>>>> ompi_coll_tuned_allreduce_intra_dec_fixed
    > >>>>>>>>>>>>>>>>>>>>>>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440,
    > >>>>>>>>>>>>>>>>>>>>>>>> count=1, dtype=0x6788220, op=0x6787a20,
    > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0, module=0x19d82b20) at
    > >>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_decision_fixed.c:63
    > >>>>>>>>>>>>>>>>>>>>>>>> #11 0x00002aedb85c7792 in PMPI_Allreduce
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> (sendbuf=0x7ffff279d444,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1,
    > >>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> op=0x6787a20,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at pallreduce.c:102 #12
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004387dbf
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> in
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::MPI::Allreduce
    (sendbuf=0x7ffff279d444,
    > >>>>>>>>>>>>>>>>>>>>>>>> recvbuf=0x7ffff279d440, count=1,
    > >>>>>>>>>>>>>>>>>>>>>>>> datatype=0x6788220,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> op=0x6787a20,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> comm=0x19d81ff0) at stubs.cpp:626 #13
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004058be8 in
    FEMTown::Domain::align (itf=
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>             er fa ce>>
    > >>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> = {_vptr.shared_base_ptr =
    0x7ffff279d620, ptr_ =
    > >>>>>>>>>>>>>>>>>>>>>>>> {px = 0x199942a4, pn = {pi_ =
    0x6}}},<No data
    > >>>>>>>>>>>>>>>>>>>>>>>> fields>}) at interface.cpp:371 #14
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00000000040cb858 in
    > >>>>>>>>>>>>>>>>>>>>>>>>
    FEMTown::Field::detail::align_itfs_and_neighbhors
    > >>>>>>>>>>>>>>>>>>>>>>>> (dim=2,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> set={px
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> = 0x7ffff279d780, pn = {pi_ =
    0x2f279d640}},
    > >>>>>>>>>>>>>>>>>>>>>>>> check_info=@0x7ffff279d7f0) at
    check.cpp:63 #15
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> 0x00000000040cbfa8
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> in FEMTown::Field::align_elements
    (set={px =
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x7ffff279d950, pn
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> =
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> {pi_ = 0x66e08d0}},
    check_info=@0x7ffff279d7f0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:159 #16 0x00000000039acdd4 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyField_align_elements (self=0x0,
    > >>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
    > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:31 #17
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000001fbf76d in
    > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Main::ExErrCatch<_object*
    (*)(_object*,
    > >>>>>>>>>>>>>>>>>>>>>>>> _object*, _object*)>::exec<_object>
    > >>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279dc20, s=0x0,
    po1=0x2aaab0765050,
    > >>>>>>>>>>>>>>>>>>>>>>>> po2=0x19d2e950) at
    > >>>>>>>>>>>>>>>>>>>>>>>>
    /home/qa/svntop/femtown/modules/main/py/exception.
    > >>>>>>>>>>>>>>>>>>>>>>>> hp p: 463
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> #18
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x00000000039acc82 in
    PyField_align_elements_ewrap
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> (self=0x0,
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> args=0x2aaab0765050, kwds=0x19d2e950) at
    > >>>>>>>>>>>>>>>>>>>>>>>> check.cpp:39 #19 0x00000000044093a0 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx (f=0x19b52e90,
    throwflag=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>) at Python/ceval.c:3921 #20
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x000000000440aae9 in PyEval_EvalCodeEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (co=0x2aaab754ad50, globals=<value
    optimized out>,
    > >>>>>>>>>>>>>>>>>>>>>>>> locals=<value optimized out>, args=0x3,
    > >>>>>>>>>>>>>>>>>>>>>>>> argcount=1, kws=0x19ace4a0, kwcount=2,
    > >>>>>>>>>>>>>>>>>>>>>>>> defs=0x2aaab75e4800, defcount=2,
    closure=0x0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
    > >>>>>>>>>>>>>>>>>>>>>>>> #21 0x0000000004408f58 in
    PyEval_EvalFrameEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19ace2d0, throwflag=<value
    optimized out>) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #22
    0x000000000440aae9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab7550120,
    > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
    locals=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x7, argcount=1,
    > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19acc418, kwcount=3,
    defs=0x2aaab759e958,
    > >>>>>>>>>>>>>>>>>>>>>>>> defcount=6, closure=0x0) at
    Python/ceval.c:2968
    > >>>>>>>>>>>>>>>>>>>>>>>> #23 0x0000000004408f58 in
    PyEval_EvalFrameEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19acc1c0, throwflag=<value
    optimized out>) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #24
    0x000000000440aae9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b5e738,
    > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
    locals=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x6, argcount=1,
    > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19abd328, kwcount=5,
    defs=0x2aaab891b7e8,
    > >>>>>>>>>>>>>>>>>>>>>>>> defcount=3, closure=0x0) at
    Python/ceval.c:2968
    > >>>>>>>>>>>>>>>>>>>>>>>> #25 0x0000000004408f58 in
    PyEval_EvalFrameEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19abcea0, throwflag=<value
    optimized out>) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #26
    0x000000000440aae9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4198,
    > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
    locals=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0xb, argcount=1,
    > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89df0, kwcount=10, defs=0x0,
    defcount=0,
    > >>>>>>>>>>>>>>>>>>>>>>>> closure=0x0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968 #27
    0x0000000004408f58 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalFrameEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a89c40, throwflag=<value
    optimized out>) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #28
    0x000000000440aae9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab3eb4288,
    > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
    locals=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x1, argcount=0,
    > >>>>>>>>>>>>>>>>>>>>>>>> kws=0x19a89330, kwcount=0,
    defs=0x2aaab8b66668,
    > >>>>>>>>>>>>>>>>>>>>>>>> defcount=1, closure=0x0) at
    Python/ceval.c:2968
    > >>>>>>>>>>>>>>>>>>>>>>>> #29 0x0000000004408f58 in
    PyEval_EvalFrameEx
    > >>>>>>>>>>>>>>>>>>>>>>>> (f=0x19a891b0, throwflag=<value
    optimized out>) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:3802 #30
    0x000000000440aae9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyEval_EvalCodeEx (co=0x2aaab8b6a738,
    > >>>>>>>>>>>>>>>>>>>>>>>> globals=<value optimized out>,
    locals=<value
    > >>>>>>>>>>>>>>>>>>>>>>>> optimized out>, args=0x0, argcount=0,
    kws=0x0,
    > >>>>>>>>>>>>>>>>>>>>>>>> kwcount=0, defs=0x0, defcount=0,
    closure=0x0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:2968
    > >>>>>>>>>>>>>>>>>>>>>>>> #31 0x000000000440ac02 in PyEval_EvalCode
    > >>>>>>>>>>>>>>>>>>>>>>>> (co=0x1902f9b0, globals=0x0,
    locals=0x190d9700) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/ceval.c:522 #32
    0x000000000442853c in
    > >>>>>>>>>>>>>>>>>>>>>>>> PyRun_StringFlags (str=0x192fd3d8
    > >>>>>>>>>>>>>>>>>>>>>>>> "DIRECT.Actran.main()", start=<value
    optimized
    > >>>>>>>>>>>>>>>>>>>>>>>> out>, globals=0x192213d0,
    locals=0x192213d0,
    > >>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at Python/pythonrun.c:1335 #33
    > >>>>>>>>>>>>>>>>>>>>>>>> 0x0000000004429690 in
    PyRun_SimpleStringFlags
    > >>>>>>>>>>>>>>>>>>>>>>>> (command=0x192fd3d8 "DIRECT.Actran.main()",
    > >>>>>>>>>>>>>>>>>>>>>>>> flags=0x0) at
    > >>>>>>>>>>>>>>>>>>>>>>>> Python/pythonrun.c:957 #34
    0x0000000001fa1cf9 in
    > >>>>>>>>>>>>>>>>>>>>>>>> FEMTown::Python::FEMPy::run_application
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> (this=0x7ffff279f650)
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> at fempy.cpp:873 #35 0x000000000434ce99 in
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> FEMTown::Main::Batch::run
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> (this=0x7ffff279f650) at batch.cpp:374 #36
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> 0x0000000001f9aa25
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> in main (argc=8, argv=0x7ffff279fa48) at
    > >>>>>>>>>>>>>>>>>>>>>>>> main.cpp:10 (gdb) f 1 #1
     0x00002aedbc4e05f4 in
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
    (openib_btl=0x1902f9b0,
    > >>>>>>>>>>>>>>>>>>>>>>>> ep=0x1908a1c0, frag=0x190d9700,
    byte_len=18) at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881
    reg->cbfunc(
    > >>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des,
    reg->cbdata
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> );
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> Current language: auto; currently c
    > >>>>>>>>>>>>>>>>>>>>>>>> (gdb)
    > >>>>>>>>>>>>>>>>>>>>>>>> #1  0x00002aedbc4e05f4 in
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_handle_incoming
    > >>>>>>>>>>>>>>>>>>>>>>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0,
    > >>>>>>>>>>>>>>>>>>>>>>>> frag=0x190d9700, byte_len=18) at
    > >>>>>>>>>>>>>>>>>>>>>>>> btl_openib_component.c:2881 2881
    reg->cbfunc(
    > >>>>>>>>>>>>>>>>>>>>>>>> &openib_btl->super, hdr->tag, des,
    reg->cbdata
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> );
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> (gdb) l 2876
> >>>>>>>>>>>>>>>>>>>>>>>> 2877 if(OPAL_LIKELY(!(is_credit_msg = > >>>>>>>>>>>>>>>>>>>>>>>> is_credit_message(frag)))) { 2878 /*
    > >>>>>>>>>>>>>>>>>>>>>>>> call registered callback */
> >>>>>>>>>>>>>>>>>>>>>>>> 2879 mca_btl_active_message_callback_t*
    > >>>>>>>>>>>>>>>>>>>>>>>> reg; 2880            reg =
    > >>>>>>>>>>>>>>>>>>>>>>>> mca_btl_base_active_message_trigger +
    hdr->tag;
    > >>>>>>>>>>>>>>>>>>>>>>>> 2881 reg->cbfunc(&openib_btl->super,
    hdr->tag,
    > >>>>>>>>>>>>>>>>>>>>>>>> des, reg->cbdata ); 2882
    > >>>>>>>>>>>>>>>>>>>>>>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883
    > >>>>>>>>>>>>>>>>>>>>>>>> cqp
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> =
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> (hdr->credits>>  11)&  0x0f;
    > >>>>>>>>>>>>>>>>>>>>>>>> 2884                hdr->credits&= 0x87ff;
    > >>>>>>>>>>>>>>>>>>>>>>>> 2885            } else {
    > >>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>> On Friday 16 July 2010 16:01:02 Eloi
    Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>> Hi Edgar,
    > >>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>> The only difference I could observed
    was that the
    > >>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault appeared sometimes
    later
    > >>>>>>>>>>>>>>>>>>>>>>>>> during the parallel computation.
    > >>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>> I'm running out of idea here. I wish I
    could use
    > >>>>>>>>>>>>>>>>>>>>>>>>> the "--mca
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> coll
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>> tuned" with "--mca self,sm,tcp" so
    that I could
    > >>>>>>>>>>>>>>>>>>>>>>>>> check that the issue is not somehow
    limited to
    > >>>>>>>>>>>>>>>>>>>>>>>>> the tuned collective routines.
    > >>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
    > >>>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 17:24:24
    Edgar Gabriel wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> hi edgar,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for the tips, I'm gonna try
    this option
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> as well.
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> the
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault i'm observing always
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> happened during a collective
    communication
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> indeed... does it basically
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> switch
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> all
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> collective communication to basic
    mode, right ?
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for my ignorance, but what's a
    NCA ?
    > >>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>> sorry, I meant to type HCA (InifinBand
    > >>>>>>>>>>>>>>>>>>>>>>>>>> networking card)
    > >>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
    > >>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
    > >>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> thanks,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thursday 15 July 2010 16:20:54
    Edgar Gabriel wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> you could try first to use the
    algorithms in
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> module,
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> e.g.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun -np x --mca coll basic ./mytest
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> and see whether this makes a
    difference. I
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> used to
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> observe
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> sometimes a (similar ?) problem in
    the openib
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> btl triggered from the tuned collective
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> component, in cases where the ofed
    libraries
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> were installed but no NCA was found
    on a node.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> It used to work however with the basic
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> component.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> Edgar
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry
    wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> hi Rolf,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> unfortunately, i couldn't get rid
    of that
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> annoying segmentation fault when
    selecting
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> another bcast algorithm. i'm now
    going to
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> replace MPI_Bcast with a naive
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> implementation (using MPI_Send and
    MPI_Recv)
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and see if
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> that
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> helps.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> regards,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> éloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wednesday 14 July 2010 10:59:53
    Eloi Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Rolf,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks for your input. You're
    right, I miss
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the coll_tuned_use_dynamic_rules
    option.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'll check if I the segmentation
    fault
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disappears when
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> using
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the basic bcast linear algorithm
    using the
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> proper command line you provided.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> vandeVaart
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Eloi:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> To select the different bcast
    algorithms,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need to add an extra mca
    parameter
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that tells the library to use
    dynamic
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> selection. --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_use_dynamic_rules 1
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> One way to make sure you are
    typing this in
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> correctly is
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> to
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> use it with ompi_info.  Do the
    following:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ompi_info -mca
    coll_tuned_use_dynamic_rules
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1 --param
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> coll
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You should see lots of output
    with all the
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithms that can be
    selected
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> for the various collectives.
    Therefore,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> you need this:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --mca
    coll_tuned_use_dynamic_rules 1 --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Rolf
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry
    wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've found that "--mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> coll_tuned_bcast_algorithm 1"
    allowed to
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to the basic linear
    algorithm.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Anyway whatever the algorithm
    used, the
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> segmentation fault remains.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Does anyone could give some
    advice on ways
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> diagnose
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> the
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue I'm facing ?
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Monday 12 July 2010 10:53:58
    Eloi Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm focusing on the MPI_Bcast
    routine
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that seems to randomly
    segfault when
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using the openib btl. I'd
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> like
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> to
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know if there is any way to
    make OpenMPI
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> switch to
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> a
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> different algorithm than the
    default one
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> being selected for MPI_Bcast.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for your help,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Friday 02 July 2010
    11:06:52 Eloi Gaudry wrote:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm observing a random
    segmentation
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fault during
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> an
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> internode parallel
    computation involving
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> openib
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> btl
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same
    issue can be
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> observed with OpenMPI-1.3.3).
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    mpirun (Open MPI) 1.4.2
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    Report bugs to
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    http://www.open-mpi.org/community/hel
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    p/ [pbn08:02624] ***
    Process received
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    signal *** [pbn08:02624]
    Signal:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    Segmentation fault (11)
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    [pbn08:02624] Signal code:
    Address
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    not mapped
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>> (1)
    > >>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    [pbn08:02624] Failing at
    address:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    (nil) [pbn08:02624] [ 0]
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    /lib64/libpthread.so.0
    [0x349540e4c0]
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    [pbn08:02624] *** End of error
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> message
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    ***
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    sh: line 1:  2624
    Segmentation fault
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /bin\/actranpy_mp
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tran_11.0.rc2.41872'
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> re al _m 4_ n2 .d a t'
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>
    '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ' '--mem=3200' '--threads=1'
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--errorlevel=FATAL'
    '--t_max=0.1'
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> '--parallel=domain'
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If I choose not to use the
    openib btl
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (by using --mca btl
    self,sm,tcp on the
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> command line, for instance),
    I don't
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> encounter any problem and the
    parallel
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation runs flawlessly.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I would like to get some help
    to be
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> able: - to diagnose the issue I'm
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> facing with the openib btl -
    understand
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> why this issue is observed
    only when
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> using
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the openib btl and not when using
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> self,sm,tcp
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Any help would be very much
    appreciated.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The outputs of ompi_info and the
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> configure scripts of OpenMPI are
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> enclosed to this email, and some
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> information
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the infiniband drivers as
    well.
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Here is the command line used
    when
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> launching a
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> parallel
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> computation
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> using infiniband:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    path_to_openmpi/bin/mpirun -np
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    $NPROCESS --hostfile
    host.list --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl openib,sm,self,tcp
     --display-map
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --verbose --version --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
    0 [...]
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and the command line used if
    not using infiniband:
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    path_to_openmpi/bin/mpirun -np
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    $NPROCESS --hostfile
    host.list --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl self,sm,tcp
     --display-map --verbose
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --version
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> --mca
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> btl_openib_want_fork_support
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>> 0
    > >>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [...]
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Eloi
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    __________________________________________
    > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> __ __ _
    > >>>>>>>>>>>>>
    > >>>>>>>>>>>>> _______________________________________________
    > >>>>>>>>>>>>> users mailing list
    > >>>>>>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users

    --


    Eloi Gaudry

    Free Field Technologies
    Company Website: http://www.fft.be
    Company Phone:   +32 10 487 959

    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--


*Eloi Gaudry*
Senior Product and Development Engineer -- HPC & IT Manager
Company phone:  +32 10 45 12 26         Direct line:    +32 10 49 51 47
Company fax:    +32 10 45 46 26         Email:  eloi.gau...@fft.be
Website:        www.fft.be <http://www.fft.be>    


        FFT logo <http://www.fft.be>

Reply via email to