Hi Pasha,

I appreciate the feedback. I'm assuming that this upgrade to the Open Fabric
driver is something that the System Admin. of the cluster should be concerned with and not I ?

Thanks,

Peter

Peter Diamessis wrote:


--- On *Thu, 6/19/08, Pavel Shamis (Pasha) /<pa...@dev.mellanox.co.il>/* wrote:

    From: Pavel Shamis (Pasha) <pa...@dev.mellanox.co.il>
    Subject: Re: [OMPI users] Open MPI timeout problems.
    To: pj...@cornell.edu, "Open MPI Users" <us...@open-mpi.org>
    Date: Thursday, June 19, 2008, 5:20 AM

Usually the retry exceed point to some network issue on your cluster. I see from the logs that you still use MVAPI. If i remember correct, MVAPI include IBADM application that should be able to check and debug the network.
    BTW I recommend you to update your MVAPI driver to latest OpenFabric driver.

    Peter Diamessis wrote:
    > Dear folks,
    >
    > I would appreciate your help on the following:
    >
    > I'm running a parallel CFD code on the Army Research Lab's MJM
    Linux
    > cluster, which uses Open-MPI. I've run the same code on other Linux
    > clusters that use MPICH2 and had never run into this problem.
    >
    > I'm quite convinced that the bottleneck for my code is this data
    > transposition routine, although I have not done any rigorous profiling
    > to check on it. This is where 90% of the parallel communication takes
    > place. I'm running a CFD code that uses a 3-D rectangular domain which
    > is partitioned across processors in such a way that each processor
    > stores vertical slabs that are contiguous in the x-direction but shared
    > across processors in the y-dir. . When a 2-D Fast Fourier Transform
    > (FFT) needs to be done, data is transposed such that the vertical slabs
> are now contiguous in the y-dir. in each processor. >
    > The code would normally be run for about 10,000 timesteps. In the
    > specific case which blocks, the job crashes after ~200 timesteps and at
    > each timestep a large number of 2-D FFTs are performed. For a domain
    > with resolution of Nx * Ny * Nz points and P processors, during one FFT,
    > each processor performs P Sends and P Receives of a message of size
> (Nx*Ny*Nz)/P, i.e. there are a total of 2*P^2 such Sends/Receives. >
    > I've focused on a case using P=32 procs with Nx=256, Ny=128, Nz=175.
    You
    > can see that each FFT involves 2048 communications. I totally rewrote my
    > data transposition routine to no longer use specific blocking/non-
    > blocking Sends/Receives but to use MPI_ALLTOALL which I would hope is
    > optimized for the specific MPI Implementation to do data transpositions.
    > Unfortunately, my code still crashes with time-out problems like before.
    >
    > This happens for P=4, 8, 16 & 32 processors. The same MPI_ALLTOALL
    code
    > worked fine on a smaller cluster here. Note that in the future I would
    > like to work with resolutions of (Nx,Ny,Nz)=(512,256,533) and P=128 or
    > 256 procs. which will involve an order of magnitude more communication.
    >
    > Note that I ran the job by submitting it to an LSF queue system. I've
    > attached the script file used for that. I basically enter bsub -x <
> script_openmpi at the command line. >
    > When I communicated with a consultant at ARL, he recommended I use
    > 3 specific script files which I've attached. I believe these enable
    > control over some of the MCA parameters. I've experimented with values
    > of  btl_mvapi_ib_timeout = 14, 18, 20, 24 and 30 and I still have this
    > problem. I am still in contact with this consultant but thought it would
    > be good to contact you folks directly.
    >
    > Note:
> a) echo $PATH returns: >
    > /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/bin:
    > /opt/compiler/pgi/linux86-64/6.2/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
    > ia32e/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
    > ia32e/etc:/usr/cta/modules/3.1.6/bin:
    > /usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:
    > .:/usr/lib/java/bin:/opt/gm/bin:/opt/mx/bin:/opt/PST/bin
    >
    > b) echo $LD_LIBRARY_PATH returns:
    > /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/lib:
    > /opt/compiler/pgi/linux86-64/6.2/lib:
    > /opt/compiler/pgi/linux86-64/6.2/libso:/usr/lsf/6.2/linux2.6-glibc2.3-
    > ia32e/lib
    >
    > I've attached the following files:
    > 1) Gzipped versions of the .out & .err files of the failed job.
    > 2) ompi_info.log: The output of ompi_info -all
    > 3) mpirun, mpirun.lsf, openmpi_wrapper: the three script files provided
    > to me by the ARL consultant. I store these in my home directory and
    > experimented with the MCA parameter btl_mvapi_ib_timeout in mpirun.
    > 4) The script file script_openmpi that I use to submit the job.
    >
    > I am unable to provide you with the config.log file as I cannot find it
    > in the top level Open MPI directory.
    >
    > I am also unable to provide you with details on the specific cluster
    > that I'm running in terms of the network. I know they use Infiniband
    and
    > some more detail may be found on:
    >
    > http://www.arl.hpc.mil/Systems/mjm.html
    >
    > Some other info:
> a) uname -a returns: > Linux l1 2.6.5-7.308-smp.arl-msrc #2 SMP Thu Jan 10 09:18:41 EST 2008
    > x86_64 x86_64 x86_64 GNU/Linux
    >
    > b) ulimit -l returns: unlimited
    >
    > I cannot see a pattern as to which nodes are bad and which are good ...
    >
    >
    > Note that I found in the mail archives that someone had a similar
    > problem in transposing a matrix with 16 million elements. The only
    > answer I found in the thread was to increase the value of
    > btl_mvapi_ib_timeout to 14 or 16, something I've done already.
    >
    > I'm hoping that there must be a way out of this problem. I need to
    > get my code running as I'm under pressure to produce results for a
    > grant that's paying me.
    >
    > If you have any feedback I would be hugely grateful.
    >
    > Sincerely,
    >
    > Peter Diamessis
    > Cornell University
    >
    >
> > ------------------------------------------------------------------------
    >
    > _______________________________________________
    > users mailing list
    > us...@open-mpi.org
    > http://www.open-mpi.org/mailman/listinfo.cgi/users



--

-------------------------------------------------------------
Peter Diamessis
Assistant Professor
Environmental Fluid Mechanics & Hydrology
School of Civil and Environmental Engineering
Cornell University
Ithaca, NY 14853
Phone: (607)-255-1719 --- Fax: (607)-255-9004
pj...@cornell.edu <mailto:pj...@cornell.edu>
http://www.cee.cornell.edu/faculty/pjd38

Reply via email to