Hi Edgar, The only difference I could observed was that the segmentation fault appeared sometimes later during the parallel computation.
I'm running out of idea here. I wish I could use the "--mca coll tuned" with "--mca self,sm,tcp" so that I could check that the issue is not somehow limited to the tuned collective routines. Thanks, Eloi On Thursday 15 July 2010 17:24:24 Edgar Gabriel wrote: > On 7/15/2010 10:18 AM, Eloi Gaudry wrote: > > hi edgar, > > > > thanks for the tips, I'm gonna try this option as well. the segmentation > > fault i'm observing always happened during a collective communication > > indeed... does it basically switch all collective communication to basic > > mode, right ? > > > > sorry for my ignorance, but what's a NCA ? > > sorry, I meant to type HCA (InifinBand networking card) > > Thanks > Edgar > > > thanks, > > éloi > > > > On Thursday 15 July 2010 16:20:54 Edgar Gabriel wrote: > >> you could try first to use the algorithms in the basic module, e.g. > >> > >> mpirun -np x --mca coll basic ./mytest > >> > >> and see whether this makes a difference. I used to observe sometimes a > >> (similar ?) problem in the openib btl triggered from the tuned > >> collective component, in cases where the ofed libraries were installed > >> but no NCA was found on a node. It used to work however with the basic > >> component. > >> > >> Thanks > >> Edgar > >> > >> On 7/15/2010 3:08 AM, Eloi Gaudry wrote: > >>> hi Rolf, > >>> > >>> unfortunately, i couldn't get rid of that annoying segmentation fault > >>> when selecting another bcast algorithm. i'm now going to replace > >>> MPI_Bcast with a naive implementation (using MPI_Send and MPI_Recv) and > >>> see if that helps. > >>> > >>> regards, > >>> éloi > >>> > >>> On Wednesday 14 July 2010 10:59:53 Eloi Gaudry wrote: > >>>> Hi Rolf, > >>>> > >>>> thanks for your input. You're right, I miss the > >>>> coll_tuned_use_dynamic_rules option. > >>>> > >>>> I'll check if I the segmentation fault disappears when using the basic > >>>> bcast linear algorithm using the proper command line you provided. > >>>> > >>>> Regards, > >>>> Eloi > >>>> > >>>> On Tuesday 13 July 2010 20:39:59 Rolf vandeVaart wrote: > >>>>> Hi Eloi: > >>>>> To select the different bcast algorithms, you need to add an extra > >>>>> mca parameter that tells the library to use dynamic selection. > >>>>> --mca coll_tuned_use_dynamic_rules 1 > >>>>> > >>>>> One way to make sure you are typing this in correctly is to use it > >>>>> with ompi_info. Do the following: > >>>>> ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll > >>>>> > >>>>> You should see lots of output with all the different algorithms that > >>>>> can be selected for the various collectives. > >>>>> Therefore, you need this: > >>>>> > >>>>> --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm > >>>>> 1 > >>>>> > >>>>> Rolf > >>>>> > >>>>> On 07/13/10 11:28, Eloi Gaudry wrote: > >>>>>> Hi, > >>>>>> > >>>>>> I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to > >>>>>> switch to the basic linear algorithm. Anyway whatever the algorithm > >>>>>> used, the segmentation fault remains. > >>>>>> > >>>>>> Does anyone could give some advice on ways to diagnose the issue I'm > >>>>>> facing ? > >>>>>> > >>>>>> Regards, > >>>>>> Eloi > >>>>>> > >>>>>> On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> I'm focusing on the MPI_Bcast routine that seems to randomly > >>>>>>> segfault when using the openib btl. I'd like to know if there is > >>>>>>> any way to make OpenMPI switch to a different algorithm than the > >>>>>>> default one being selected for MPI_Bcast. > >>>>>>> > >>>>>>> Thanks for your help, > >>>>>>> Eloi > >>>>>>> > >>>>>>> On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote: > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I'm observing a random segmentation fault during an internode > >>>>>>>> parallel computation involving the openib btl and OpenMPI-1.4.2 > >>>>>>>> (the same issue can be observed with OpenMPI-1.3.3). > >>>>>>>> > >>>>>>>> mpirun (Open MPI) 1.4.2 > >>>>>>>> Report bugs to http://www.open-mpi.org/community/help/ > >>>>>>>> [pbn08:02624] *** Process received signal *** > >>>>>>>> [pbn08:02624] Signal: Segmentation fault (11) > >>>>>>>> [pbn08:02624] Signal code: Address not mapped (1) > >>>>>>>> [pbn08:02624] Failing at address: (nil) > >>>>>>>> [pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0] > >>>>>>>> [pbn08:02624] *** End of error message *** > >>>>>>>> sh: line 1: 2624 Segmentation fault > >>>>>>>> > >>>>>>>> \/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5 > >>>>>>>> \/ x 86 _6 4\ /bin\/actranpy_mp > >>>>>>>> '--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x > >>>>>>>> 86 _ 64 /A c tran_11.0.rc2.41872' > >>>>>>>> '--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2 > >>>>>>>> .d a t' '--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch' > >>>>>>>> '--mem=3200' '--threads=1' '--errorlevel=FATAL' '--t_max=0.1' > >>>>>>>> '--parallel=domain' > >>>>>>>> > >>>>>>>> If I choose not to use the openib btl (by using --mca btl > >>>>>>>> self,sm,tcp on the command line, for instance), I don't encounter > >>>>>>>> any problem and the parallel computation runs flawlessly. > >>>>>>>> > >>>>>>>> I would like to get some help to be able: > >>>>>>>> - to diagnose the issue I'm facing with the openib btl > >>>>>>>> - understand why this issue is observed only when using the openib > >>>>>>>> btl and not when using self,sm,tcp > >>>>>>>> > >>>>>>>> Any help would be very much appreciated. > >>>>>>>> > >>>>>>>> The outputs of ompi_info and the configure scripts of OpenMPI are > >>>>>>>> enclosed to this email, and some information on the infiniband > >>>>>>>> drivers as well. > >>>>>>>> > >>>>>>>> Here is the command line used when launching a parallel > >>>>>>>> computation > >>>>>>>> > >>>>>>>> using infiniband: > >>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list > >>>>>>>> --mca > >>>>>>>> > >>>>>>>> btl openib,sm,self,tcp --display-map --verbose --version --mca > >>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...] > >>>>>>>> > >>>>>>>> and the command line used if not using infiniband: > >>>>>>>> path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list > >>>>>>>> --mca > >>>>>>>> > >>>>>>>> btl self,sm,tcp --display-map --verbose --version --mca > >>>>>>>> mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...] > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Eloi > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959