Hi Eloi:
To select the different bcast algorithms, you need to add an extra mca parameter that tells the library to use dynamic selection.
--mca coll_tuned_use_dynamic_rules 1

One way to make sure you are typing this in correctly is to use it with ompi_info. Do the following:
ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll

You should see lots of output with all the different algorithms that can be selected for the various collectives.
Therefore, you need this:

--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1

Rolf

On 07/13/10 11:28, Eloi Gaudry wrote:
Hi,

I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch to the 
basic linear algorithm.
Anyway whatever the algorithm used, the segmentation fault remains.

Does anyone could give some advice on ways to diagnose the issue I'm facing ?

Regards,
Eloi


On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
Hi,

I'm focusing on the MPI_Bcast routine that seems to randomly segfault when
using the openib btl. I'd like to know if there is any way to make OpenMPI
switch to a different algorithm than the default one being selected for
MPI_Bcast.

Thanks for your help,
Eloi

On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:
Hi,

I'm observing a random segmentation fault during an internode parallel
computation involving the openib btl and OpenMPI-1.4.2 (the same issue
can be observed with OpenMPI-1.3.3).

   mpirun (Open MPI) 1.4.2
   Report bugs to http://www.open-mpi.org/community/help/
   [pbn08:02624] *** Process received signal ***
   [pbn08:02624] Signal: Segmentation fault (11)
   [pbn08:02624] Signal code: Address not mapped (1)
   [pbn08:02624] Failing at address: (nil)
   [pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
   [pbn08:02624] *** End of error message ***
   sh: line 1:  2624 Segmentation fault

\/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x86_6
4\ /bin\/actranpy_mp
'--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_64/A
c tran_11.0.rc2.41872'
'--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.dat'
'--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch' '--mem=3200'
'--threads=1' '--errorlevel=FATAL' '--t_max=0.1' '--parallel=domain'

If I choose not to use the openib btl (by using --mca btl self,sm,tcp on
the command line, for instance), I don't encounter any problem and the
parallel computation runs flawlessly.

I would like to get some help to be able:
- to diagnose the issue I'm facing with the openib btl
- understand why this issue is observed only when using the openib btl
and not when using self,sm,tcp

Any help would be very much appreciated.

The outputs of ompi_info and the configure scripts of OpenMPI are
enclosed to this email, and some information on the infiniband drivers
as well.

Here is the command line used when launching a parallel computation

using infiniband:
   path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca

btl openib,sm,self,tcp  --display-map --verbose --version --mca
mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]

and the command line used if not using infiniband:
   path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca

btl self,sm,tcp  --display-map --verbose --version --mca
mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]

Thanks,
Eloi

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to