Hi,

I was wondering if anybody got a chance to have a look at this issue.

Regards,
Eloi


On Wednesday 18 August 2010 09:16:26 Eloi Gaudry wrote:
> Hi Jeff,
> 
> Please find enclosed the output (valgrind.out.gz) from
> /opt/openmpi-debug-1.4.2/bin/orterun -np 2 --host pbn11,pbn10 --mca btl
> openib,self --display-map --verbose --mca mpi_warn_on_fork 0 --mca
> btl_openib_want_fork_support 0 -tag-output
> /opt/valgrind-3.5.0/bin/valgrind --tool=memcheck
> --suppressions=/opt/openmpi-debug-1.4.2/share/openmpi/openmpi-
> valgrind.supp --suppressions=./suppressions.python.supp
> /opt/actran/bin/actranpy_mp ...
> 
> Thanks,
> Eloi
> 
> On Tuesday 17 August 2010 09:32:53 Eloi Gaudry wrote:
> > On Monday 16 August 2010 19:14:47 Jeff Squyres wrote:
> > > On Aug 16, 2010, at 10:05 AM, Eloi Gaudry wrote:
> > > > I did run our application through valgrind but it couldn't find any
> > > > "Invalid write": there is a bunch of "Invalid read" (I'm using 1.4.2
> > > > with the suppression file), "Use of uninitialized bytes" and
> > > > "Conditional jump depending on uninitialized bytes" in different ompi
> > > > routines. Some of them are located in btl_openib_component.c. I'll
> > > > send you an output of valgrind shortly.
> > > 
> > > A lot of them in btl_openib_* are to be expected -- OpenFabrics uses
> > > OS-bypass methods for some of its memory, and therefore valgrind is
> > > unaware of them (and therefore incorrectly marks them as
> > > uninitialized).
> > 
> > would it  help if i use the upcoming 1.5 version of openmpi ? i read that
> > a huge effort has been done to clean-up the valgrind output ? but maybe
> > that this doesn't concern this btl (for the reasons you mentionned).
> > 
> > > > Another question, you said that the callback function pointer should
> > > > never be 0. But can the tag be null (hdr->tag) ?
> > > 
> > > The tag is not a pointer -- it's just an integer.
> > 
> > I was worrying that its value could not be null.
> > 
> > I'll send a valgrind output soon (i need to build libpython without
> > pymalloc first).
> > 
> > Thanks,
> > Eloi
> > 
> > > > Thanks for your help,
> > > > Eloi
> > > > 
> > > > On 16/08/2010 18:22, Jeff Squyres wrote:
> > > >> Sorry for the delay in replying.
> > > >> 
> > > >> Odd; the values of the callback function pointer should never be 0.
> > > >> This seems to suggest some kind of memory corruption is occurring.
> > > >> 
> > > >> I don't know if it's possible, because the stack trace looks like
> > > >> you're calling through python, but can you run this application
> > > >> through valgrind, or some other memory-checking debugger?
> > > >> 
> > > >> On Aug 10, 2010, at 7:15 AM, Eloi Gaudry wrote:
> > > >>> Hi,
> > > >>> 
> > > >>> sorry, i just forgot to add the values of the function parameters:
> > > >>> (gdb) print reg->cbdata
> > > >>> $1 = (void *) 0x0
> > > >>> (gdb) print openib_btl->super
> > > >>> $2 = {btl_component = 0x2b341edd7380, btl_eager_limit = 12288,
> > > >>> btl_rndv_eager_limit = 12288, btl_max_send_size = 65536,
> > > >>> btl_rdma_pipeline_send_length = 1048576,
> > > >>> 
> > > >>>   btl_rdma_pipeline_frag_size = 1048576, btl_min_rdma_pipeline_size
> > > >>>   = 1060864, btl_exclusivity = 1024, btl_latency = 10,
> > > >>>   btl_bandwidth = 800, btl_flags = 310, btl_add_procs =
> > > >>>   0x2b341eb8ee47<mca_btl_openib_add_procs>, btl_del_procs =
> > > >>>   0x2b341eb90156<mca_btl_openib_del_procs>, btl_register = 0,
> > > >>>   btl_finalize = 0x2b341eb93186<mca_btl_openib_finalize>, btl_alloc
> > > >>>   = 0x2b341eb90a3e<mca_btl_openib_alloc>, btl_free =
> > > >>>   0x2b341eb91400<mca_btl_openib_free>, btl_prepare_src =
> > > >>>   0x2b341eb91813<mca_btl_openib_prepare_src>, btl_prepare_dst =
> > > >>>   0x2b341eb91f2e<mca_btl_openib_prepare_dst>, btl_send =
> > > >>>   0x2b341eb94517<mca_btl_openib_send>, btl_sendi =
> > > >>>   0x2b341eb9340d<mca_btl_openib_sendi>, btl_put =
> > > >>>   0x2b341eb94660<mca_btl_openib_put>, btl_get =
> > > >>>   0x2b341eb94c4e<mca_btl_openib_get>, btl_dump =
> > > >>>   0x2b341acd45cb<mca_btl_base_dump>, btl_mpool = 0xf3f4110,
> > > >>>   btl_register_error =
> > > >>>   0x2b341eb90565<mca_btl_openib_register_error_cb>, btl_ft_event =
> > > >>>   0x2b341eb952e7<mca_btl_openib_ft_event>}
> > > >>> 
> > > >>> (gdb) print hdr->tag
> > > >>> $3 = 0 '\0'
> > > >>> (gdb) print des
> > > >>> $4 = (mca_btl_base_descriptor_t *) 0xf4a6700
> > > >>> (gdb) print reg->cbfunc
> > > >>> $5 = (mca_btl_base_module_recv_cb_fn_t) 0
> > > >>> 
> > > >>> Eloi
> > > >>> 
> > > >>> On Tuesday 10 August 2010 16:04:08 Eloi Gaudry wrote:
> > > >>>> Hi,
> > > >>>> 
> > > >>>> Here is the output of a core file generated during a segmentation
> > > >>>> fault observed during a collective call (using openib):
> > > >>>> 
> > > >>>> #0  0x0000000000000000 in ?? ()
> > > >>>> (gdb) where
> > > >>>> #0  0x0000000000000000 in ?? ()
> > > >>>> #1  0x00002aedbc4e05f4 in btl_openib_handle_incoming
> > > >>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
> > > >>>> byte_len=18) at btl_openib_component.c:2881 #2 
> > > >>>> 0x00002aedbc4e25e2 in handle_wc (device=0x19024ac0, cq=0,
> > > >>>> wc=0x7ffff279ce90) at
> > > >>>> btl_openib_component.c:3178 #3  0x00002aedbc4e2e9d in poll_device
> > > >>>> (device=0x19024ac0, count=2) at btl_openib_component.c:3318 #4
> > > >>>> 0x00002aedbc4e34b8 in progress_one_device (device=0x19024ac0) at
> > > >>>> btl_openib_component.c:3426 #5  0x00002aedbc4e3561 in
> > > >>>> btl_openib_component_progress () at btl_openib_component.c:3451 #6
> > > >>>> 0x00002aedb8b22ab8 in opal_progress () at
> > > >>>> runtime/opal_progress.c:207 #7 0x00002aedb859f497 in
> > > >>>> opal_condition_wait (c=0x2aedb888ccc0, m=0x2aedb888cd20) at
> > > >>>> ../opal/threads/condition.h:99 #8
> > > >>>> 0x00002aedb859fa31 in ompi_request_default_wait_all (count=2,
> > > >>>> requests=0x7ffff279d0e0, statuses=0x0) at request/req_wait.c:262
> > > >>>> #9 0x00002aedbd7559ad in
> > > >>>> ompi_coll_tuned_allreduce_intra_recursivedoubling
> > > >>>> (sbuf=0x7ffff279d444, rbuf=0x7ffff279d440, count=1,
> > > >>>> dtype=0x6788220, op=0x6787a20,
> > > >>>> comm=0x19d81ff0, module=0x19d82b20) at coll_tuned_allreduce.c:223
> > > >>>> #10 0x00002aedbd7514f7 in
> > > >>>> ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x7ffff279d444,
> > > >>>> rbuf=0x7ffff279d440, count=1, dtype=0x6788220, op=0x6787a20,
> > > >>>> comm=0x19d81ff0, module=0x19d82b20) at
> > > >>>> coll_tuned_decision_fixed.c:63
> > > >>>> #11 0x00002aedb85c7792 in PMPI_Allreduce (sendbuf=0x7ffff279d444,
> > > >>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > > >>>> comm=0x19d81ff0) at pallreduce.c:102 #12 0x0000000004387dbf in
> > > >>>> FEMTown::MPI::Allreduce (sendbuf=0x7ffff279d444,
> > > >>>> recvbuf=0x7ffff279d440, count=1, datatype=0x6788220, op=0x6787a20,
> > > >>>> comm=0x19d81ff0) at stubs.cpp:626 #13 0x0000000004058be8 in
> > > >>>> FEMTown::Domain::align (itf=
> > > >>>> 
> > > >>>>             {<FEMTown::Boost::shared_base_ptr<FEMTown::Domain::Int
> > > >>>>             er fa ce>>
> > > >>>> 
> > > >>>> = {_vptr.shared_base_ptr = 0x7ffff279d620, ptr_ = {px =
> > > >>>> 0x199942a4, pn = {pi_ = 0x6}}},<No data fields>}) at
> > > >>>> interface.cpp:371 #14 0x00000000040cb858 in
> > > >>>> FEMTown::Field::detail::align_itfs_and_neighbhors (dim=2, set={px
> > > >>>> = 0x7ffff279d780, pn = {pi_ = 0x2f279d640}},
> > > >>>> check_info=@0x7ffff279d7f0) at check.cpp:63 #15 0x00000000040cbfa8
> > > >>>> in FEMTown::Field::align_elements (set={px = 0x7ffff279d950, pn =
> > > >>>> {pi_ = 0x66e08d0}}, check_info=@0x7ffff279d7f0) at check.cpp:159
> > > >>>> #16 0x00000000039acdd4 in PyField_align_elements (self=0x0,
> > > >>>> args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:31 #17
> > > >>>> 0x0000000001fbf76d in FEMTown::Main::ExErrCatch<_object*
> > > >>>> (*)(_object*, _object*, _object*)>::exec<_object>
> > > >>>> (this=0x7ffff279dc20, s=0x0, po1=0x2aaab0765050, po2=0x19d2e950)
> > > >>>> at /home/qa/svntop/femtown/modules/main/py/exception.hpp:463 #18
> > > >>>> 0x00000000039acc82 in PyField_align_elements_ewrap (self=0x0,
> > > >>>> args=0x2aaab0765050, kwds=0x19d2e950) at check.cpp:39 #19
> > > >>>> 0x00000000044093a0 in PyEval_EvalFrameEx (f=0x19b52e90,
> > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3921 #20
> > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab754ad50,
> > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > >>>> args=0x3, argcount=1, kws=0x19ace4a0, kwcount=2,
> > > >>>> defs=0x2aaab75e4800, defcount=2, closure=0x0) at
> > > >>>> Python/ceval.c:2968
> > > >>>> #21 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19ace2d0,
> > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3802 #22
> > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab7550120,
> > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > >>>> args=0x7, argcount=1, kws=0x19acc418, kwcount=3,
> > > >>>> defs=0x2aaab759e958, defcount=6, closure=0x0) at
> > > >>>> Python/ceval.c:2968
> > > >>>> #23 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19acc1c0,
> > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3802 #24
> > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b5e738,
> > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > >>>> args=0x6, argcount=1, kws=0x19abd328, kwcount=5,
> > > >>>> defs=0x2aaab891b7e8, defcount=3, closure=0x0) at
> > > >>>> Python/ceval.c:2968
> > > >>>> #25 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19abcea0,
> > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3802 #26
> > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab3eb4198,
> > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > >>>> args=0xb, argcount=1, kws=0x19a89df0, kwcount=10, defs=0x0,
> > > >>>> defcount=0, closure=0x0) at Python/ceval.c:2968
> > > >>>> #27 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19a89c40,
> > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3802 #28
> > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab3eb4288,
> > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > >>>> args=0x1, argcount=0, kws=0x19a89330, kwcount=0,
> > > >>>> defs=0x2aaab8b66668, defcount=1, closure=0x0) at
> > > >>>> Python/ceval.c:2968
> > > >>>> #29 0x0000000004408f58 in PyEval_EvalFrameEx (f=0x19a891b0,
> > > >>>> throwflag=<value optimized out>) at Python/ceval.c:3802 #30
> > > >>>> 0x000000000440aae9 in PyEval_EvalCodeEx (co=0x2aaab8b6a738,
> > > >>>> globals=<value optimized out>, locals=<value optimized out>,
> > > >>>> args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0,
> > > >>>> closure=0x0) at
> > > >>>> Python/ceval.c:2968
> > > >>>> #31 0x000000000440ac02 in PyEval_EvalCode (co=0x1902f9b0,
> > > >>>> globals=0x0, locals=0x190d9700) at Python/ceval.c:522 #32
> > > >>>> 0x000000000442853c in PyRun_StringFlags (str=0x192fd3d8
> > > >>>> "DIRECT.Actran.main()", start=<value optimized out>,
> > > >>>> globals=0x192213d0, locals=0x192213d0, flags=0x0) at
> > > >>>> Python/pythonrun.c:1335 #33 0x0000000004429690 in
> > > >>>> PyRun_SimpleStringFlags (command=0x192fd3d8
> > > >>>> "DIRECT.Actran.main()", flags=0x0) at
> > > >>>> Python/pythonrun.c:957 #34 0x0000000001fa1cf9 in
> > > >>>> FEMTown::Python::FEMPy::run_application (this=0x7ffff279f650) at
> > > >>>> fempy.cpp:873 #35 0x000000000434ce99 in FEMTown::Main::Batch::run
> > > >>>> (this=0x7ffff279f650) at batch.cpp:374 #36 0x0000000001f9aa25 in
> > > >>>> main (argc=8, argv=0x7ffff279fa48) at main.cpp:10 (gdb) f 1
> > > >>>> #1  0x00002aedbc4e05f4 in btl_openib_handle_incoming
> > > >>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
> > > >>>> byte_len=18) at btl_openib_component.c:2881 2881           
> > > >>>> reg->cbfunc( &openib_btl->super, hdr->tag, des, reg->cbdata );
> > > >>>> Current language: auto; currently c
> > > >>>> (gdb)
> > > >>>> #1  0x00002aedbc4e05f4 in btl_openib_handle_incoming
> > > >>>> (openib_btl=0x1902f9b0, ep=0x1908a1c0, frag=0x190d9700,
> > > >>>> byte_len=18) at btl_openib_component.c:2881 2881           
> > > >>>> reg->cbfunc( &openib_btl->super, hdr->tag, des, reg->cbdata );
> > > >>>> (gdb) l 2876
> > > >>>> 2877        if(OPAL_LIKELY(!(is_credit_msg =
> > > >>>> is_credit_message(frag)))) { 2878            /* call registered
> > > >>>> callback */
> > > >>>> 2879            mca_btl_active_message_callback_t* reg;
> > > >>>> 2880            reg = mca_btl_base_active_message_trigger +
> > > >>>> hdr->tag; 2881            reg->cbfunc(&openib_btl->super,
> > > >>>> hdr->tag, des, reg->cbdata ); 2882
> > > >>>> if(MCA_BTL_OPENIB_RDMA_FRAG(frag)) { 2883                cqp =
> > > >>>> (hdr->credits>>  11)&  0x0f;
> > > >>>> 2884                hdr->credits&= 0x87ff;
> > > >>>> 2885            } else {
> > > >>>> 
> > > >>>> Regards,
> > > >>>> Eloi
> > > >>>> 
> > > >>>> On Friday 16 July 2010 16:01:02 Eloi Gaudry wrote:
> > > >>>>> Hi Edgar,
> > > >>>>> 
> > > >>>>> The only difference I could observed was that the segmentation
> > > >>>>> fault appeared sometimes later during the parallel computation.
> > > >>>>> 
> > > >>>>> I'm running out of idea here. I wish I could use the "--mca coll
> > > >>>>> tuned" with "--mca self,sm,tcp" so that I could check that the
> > > >>>>> issue is not somehow limited to the tuned collective routines.
> > > >>>>> 
> > > >>>>> Thanks,
> > > >>>>> Eloi
> > > >>>>> 
> > > >>>>> On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote:
> > > >>>>>> On 7/15/2010 10:18 AM, Eloi Gaudry wrote:
> > > >>>>>>> hi edgar,
> > > >>>>>>> 
> > > >>>>>>> thanks for the tips, I'm gonna try this option as well. the
> > > >>>>>>> segmentation fault i'm observing always happened during a
> > > >>>>>>> collective communication indeed... does it basically switch all
> > > >>>>>>> collective communication to basic mode, right ?
> > > >>>>>>> 
> > > >>>>>>> sorry for my ignorance, but what's a NCA ?
> > > >>>>>> 
> > > >>>>>> sorry, I meant to type HCA (InifinBand networking card)
> > > >>>>>> 
> > > >>>>>> Thanks
> > > >>>>>> Edgar
> > > >>>>>> 
> > > >>>>>>> thanks,
> > > >>>>>>> éloi
> > > >>>>>>> 
> > > >>>>>>> On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote:
> > > >>>>>>>> you could try first to use the algorithms in the basic module,
> > > >>>>>>>> e.g.
> > > >>>>>>>> 
> > > >>>>>>>> mpirun -np x --mca coll basic ./mytest
> > > >>>>>>>> 
> > > >>>>>>>> and see whether this makes a difference. I used to observe
> > > >>>>>>>> sometimes a (similar ?) problem in the openib btl triggered
> > > >>>>>>>> from the tuned collective component, in cases where the ofed
> > > >>>>>>>> libraries were installed but no NCA was found on a node. It
> > > >>>>>>>> used to work however with the basic component.
> > > >>>>>>>> 
> > > >>>>>>>> Thanks
> > > >>>>>>>> Edgar
> > > >>>>>>>> 
> > > >>>>>>>> On 7/15/2010 3:08 AM, Eloi Gaudry wrote:
> > > >>>>>>>>> hi Rolf,
> > > >>>>>>>>> 
> > > >>>>>>>>> unfortunately, i couldn't get rid of that annoying
> > > >>>>>>>>> segmentation fault when selecting another bcast algorithm.
> > > >>>>>>>>> i'm now going to replace MPI_Bcast with a naive
> > > >>>>>>>>> implementation (using MPI_Send and MPI_Recv) and see if that
> > > >>>>>>>>> helps.
> > > >>>>>>>>> 
> > > >>>>>>>>> regards,
> > > >>>>>>>>> éloi
> > > >>>>>>>>> 
> > > >>>>>>>>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote:
> > > >>>>>>>>>> Hi Rolf,
> > > >>>>>>>>>> 
> > > >>>>>>>>>> thanks for your input. You're right, I miss the
> > > >>>>>>>>>> coll_tuned_use_dynamic_rules option.
> > > >>>>>>>>>> 
> > > >>>>>>>>>> I'll check if I the segmentation fault disappears when using
> > > >>>>>>>>>> the basic bcast linear algorithm using the proper command
> > > >>>>>>>>>> line you provided.
> > > >>>>>>>>>> 
> > > >>>>>>>>>> Regards,
> > > >>>>>>>>>> Eloi
> > > >>>>>>>>>> 
> > > >>>>>>>>>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote:
> > > >>>>>>>>>>> Hi Eloi:
> > > >>>>>>>>>>> To select the different bcast algorithms, you need to add
> > > >>>>>>>>>>> an extra mca parameter that tells the library to use
> > > >>>>>>>>>>> dynamic selection. --mca coll_tuned_use_dynamic_rules 1
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> One way to make sure you are typing this in correctly is to
> > > >>>>>>>>>>> use it with ompi_info.  Do the following:
> > > >>>>>>>>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> You should see lots of output with all the different
> > > >>>>>>>>>>> algorithms that can be selected for the various
> > > >>>>>>>>>>> collectives. Therefore, you need this:
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> --mca coll_tuned_use_dynamic_rules 1 --mca
> > > >>>>>>>>>>> coll_tuned_bcast_algorithm 1
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> Rolf
> > > >>>>>>>>>>> 
> > > >>>>>>>>>>> On 07/13/10 11:28, Eloi Gaudry wrote:
> > > >>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> I've found that "--mca coll_tuned_bcast_algorithm 1"
> > > >>>>>>>>>>>> allowed to switch to the basic linear algorithm. Anyway
> > > >>>>>>>>>>>> whatever the algorithm used, the segmentation fault
> > > >>>>>>>>>>>> remains.
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> Does anyone could give some advice on ways to diagnose the
> > > >>>>>>>>>>>> issue I'm facing ?
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>> Eloi
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
> > > >>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>> 
> > > >>>>>>>>>>>>> I'm focusing on the MPI_Bcast routine that seems to
> > > >>>>>>>>>>>>> randomly segfault when using the openib btl. I'd like to
> > > >>>>>>>>>>>>> know if there is any way to make OpenMPI switch to a
> > > >>>>>>>>>>>>> different algorithm than the default one being selected
> > > >>>>>>>>>>>>> for MPI_Bcast.
> > > >>>>>>>>>>>>> 
> > > >>>>>>>>>>>>> Thanks for your help,
> > > >>>>>>>>>>>>> Eloi
> > > >>>>>>>>>>>>> 
> > > >>>>>>>>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
> > > >>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> I'm observing a random segmentation fault during an
> > > >>>>>>>>>>>>>> internode parallel computation involving the openib btl
> > > >>>>>>>>>>>>>> and OpenMPI-1.4.2 (the same issue can be observed with
> > > >>>>>>>>>>>>>> OpenMPI-1.3.3).
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>>    mpirun (Open MPI) 1.4.2
> > > >>>>>>>>>>>>>>    Report bugs to
> > > >>>>>>>>>>>>>>    http://www.open-mpi.org/community/help/
> > > >>>>>>>>>>>>>>    [pbn08:02624] *** Process received signal ***
> > > >>>>>>>>>>>>>>    [pbn08:02624] Signal: Segmentation fault (11)
> > > >>>>>>>>>>>>>>    [pbn08:02624] Signal code: Address not mapped (1)
> > > >>>>>>>>>>>>>>    [pbn08:02624] Failing at address: (nil)
> > > >>>>>>>>>>>>>>    [pbn08:02624] [ 0] /lib64/libpthread.so.0
> > > >>>>>>>>>>>>>>    [0x349540e4c0] [pbn08:02624] *** End of error message
> > > >>>>>>>>>>>>>>    ***
> > > >>>>>>>>>>>>>>    sh: line 1:  2624 Segmentation fault
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/R
> > > >>>>>>>>>>>>>> ed Ha tE L\ -5 \/ x 86 _6 4\ /bin\/actranpy_mp
> > > >>>>>>>>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/Re
> > > >>>>>>>>>>>>>> dH at EL -5 /x 86 _ 64 /A c tran_11.0.rc2.41872'
> > > >>>>>>>>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3D
> > > >>>>>>>>>>>>>> re al _m 4_ n2 .d a t'
> > > >>>>>>>>>>>>>> '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch
> > > >>>>>>>>>>>>>> ' '--mem=3200' '--threads=1' '--errorlevel=FATAL'
> > > >>>>>>>>>>>>>> '--t_max=0.1' '--parallel=domain'
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> If I choose not to use the openib btl (by using --mca
> > > >>>>>>>>>>>>>> btl self,sm,tcp on the command line, for instance), I
> > > >>>>>>>>>>>>>> don't encounter any problem and the parallel
> > > >>>>>>>>>>>>>> computation runs flawlessly.
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> I would like to get some help to be able:
> > > >>>>>>>>>>>>>> - to diagnose the issue I'm facing with the openib btl
> > > >>>>>>>>>>>>>> - understand why this issue is observed only when using
> > > >>>>>>>>>>>>>> the openib btl and not when using self,sm,tcp
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> Any help would be very much appreciated.
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> The outputs of ompi_info and the configure scripts of
> > > >>>>>>>>>>>>>> OpenMPI are enclosed to this email, and some information
> > > >>>>>>>>>>>>>> on the infiniband drivers as well.
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> Here is the command line used when launching a parallel
> > > >>>>>>>>>>>>>> computation
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> using infiniband:
> > > >>>>>>>>>>>>>>    path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile
> > > >>>>>>>>>>>>>>    host.list --mca
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> btl openib,sm,self,tcp  --display-map --verbose
> > > >>>>>>>>>>>>>> --version --mca mpi_warn_on_fork 0 --mca
> > > >>>>>>>>>>>>>> btl_openib_want_fork_support 0 [...]
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> and the command line used if not using infiniband:
> > > >>>>>>>>>>>>>>    path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile
> > > >>>>>>>>>>>>>>    host.list --mca
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> btl self,sm,tcp  --display-map --verbose --version --mca
> > > >>>>>>>>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0
> > > >>>>>>>>>>>>>> [...]
> > > >>>>>>>>>>>>>> 
> > > >>>>>>>>>>>>>> Thanks,
> > > >>>>>>>>>>>>>> Eloi
> > > >>>>>>>>>>>> 
> > > >>>>>>>>>>>> _______________________________________________
> > > >>>>>>>>>>>> users mailing list
> > > >>>>>>>>>>>> us...@open-mpi.org
> > > >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >>> 
> > > >>> --
> > > >>> 
> > > >>> 
> > > >>> Eloi Gaudry
> > > >>> 
> > > >>> Free Field Technologies
> > > >>> Company Website: http://www.fft.be
> > > >>> Company Phone:   +32 10 487 959
> > > >>> 
> > > >>> _______________________________________________
> > > >>> users mailing list
> > > >>> us...@open-mpi.org
> > > >>> http://www.open-mpi.org/mailman/listinfo.cgi/user
> > > > 
> > > > _______________________________________________
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959

Reply via email to