Re: [OMPI users] Fw: Re: Open MPI timeout problems.

Peter Diamessis Fri, 20 Jun 2008 10:40:47 -0400

Hi Jeff,

I release appreciate the insight. I will pass your thoughts on to oursystem admins.Hopefully, they can begin exploring the installation of a moresophisticated stack.


Sincerely,

Peter


Jeff Squyres wrote:

To clarify what Pasha said: AFAIK, all IB vendors have deprecated theuse of their mVAPI-based driver stacks in HPC environments (I knowthat Cisco and Mellanox have; I'm not 100% sure about others). We allencourage upgrading to the OFED stack (currently at v1.3.1) ifpossible; it's much newer, more modern, and is where all developmentwork is occurring these days. Indeed, OMPI is dropping support forthe older mVAPI-based driver stacks in our upcoming v1.3 release.
Upgrading to a whole new driver stack is not something that can beundertaken lightly, though -- it will likely take time for thesyadmins to evaluate, learn, etc.
On Jun 19, 2008, at 5:38 PM, Pavel Shamis (Pasha) wrote:
I appreciate the feedback. I'm assuming that this upgrade to theOpen Fabricdriver is something that the System Admin. of the cluster should beconcerned with and not I ?
Driver upgrade will require root permissions.
Thanks,
Pasha
Thanks,

Peter

Peter Diamessis wrote:
--- On *Thu, 6/19/08, Pavel Shamis (Pasha)/<pa...@dev.mellanox.co.il>/* wrote:
   From: Pavel Shamis (Pasha) <pa...@dev.mellanox.co.il>
   Subject: Re: [OMPI users] Open MPI timeout problems.
   To: pj...@cornell.edu, "Open MPI Users" <us...@open-mpi.org>
   Date: Thursday, June 19, 2008, 5:20 AM
Usually the retry exceed point to some network issue on yourcluster. I see from the logs that you stilluse MVAPI. If i remember correct, MVAPI include IBADMapplication that should be able to check and debug the network.BTW I recommend you to update your MVAPI driver to latestOpenFabric driver.
   Peter Diamessis wrote:
   > Dear folks,
   >
   > I would appreciate your help on the following:
   >
   > I'm running a parallel CFD code on the Army Research Lab's MJM
   Linux
> cluster, which uses Open-MPI. I've run the same code on otherLinux
   > clusters that use MPICH2 and had never run into this problem.
   >
   > I'm quite convinced that the bottleneck for my code is this data
> transposition routine, although I have not done any rigorousprofiling> to check on it. This is where 90% of the parallelcommunication takes> place. I'm running a CFD code that uses a 3-D rectangulardomain which> is partitioned across processors in such a way that eachprocessor> stores vertical slabs that are contiguous in the x-directionbut shared> across processors in the y-dir. . When a 2-D Fast FourierTransform> (FFT) needs to be done, data is transposed such that thevertical slabs
   > are now contiguous in the y-dir. in each processor.     >
   > The code would normally be run for about 10,000 timesteps. In the
> specific case which blocks, the job crashes after ~200timesteps and at> each timestep a large number of 2-D FFTs are performed. For adomain> with resolution of Nx * Ny * Nz points and P processors,during one FFT,> each processor performs P Sends and P Receives of a message ofsize> (Nx*Ny*Nz)/P, i.e. there are a total of 2*P^2 suchSends/Receives. >> I've focused on a case using P=32 procs with Nx=256, Ny=128,Nz=175.
   You
> can see that each FFT involves 2048 communications. I totallyrewrote my> data transposition routine to no longer use specificblocking/non-> blocking Sends/Receives but to use MPI_ALLTOALL which I wouldhope is> optimized for the specific MPI Implementation to do datatranspositions.> Unfortunately, my code still crashes with time-out problemslike before.
   >
> This happens for P=4, 8, 16 & 32 processors. The sameMPI_ALLTOALL
   code
> worked fine on a smaller cluster here. Note that in the futureI would> like to work with resolutions of (Nx,Ny,Nz)=(512,256,533) andP=128 or> 256 procs. which will involve an order of magnitude morecommunication.
   >
> Note that I ran the job by submitting it to an LSF queuesystem. I've> attached the script file used for that. I basically enter bsub-x <
   > script_openmpi at the command line.     >
> When I communicated with a consultant at ARL, he recommended Iuse> 3 specific script files which I've attached. I believe theseenable> control over some of the MCA parameters. I've experimentedwith values> of btl_mvapi_ib_timeout = 14, 18, 20, 24 and 30 and I stillhave this> problem. I am still in contact with this consultant butthought it would
   > be good to contact you folks directly.
   >
   > Note:
   > a) echo $PATH returns:     >
   > /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/bin:
>/opt/compiler/pgi/linux86-64/6.2/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
   > ia32e/bin:/usr/lsf/6.2/linux2.6-glibc2.3-
   > ia32e/etc:/usr/cta/modules/3.1.6/bin:
>/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:
   > .:/usr/lib/java/bin:/opt/gm/bin:/opt/mx/bin:/opt/PST/bin
   >
   > b) echo $LD_LIBRARY_PATH returns:
   > /opt/mpi/x86_64/pgi/6.2/openmpi-1.2/lib:
   > /opt/compiler/pgi/linux86-64/6.2/lib:
>/opt/compiler/pgi/linux86-64/6.2/libso:/usr/lsf/6.2/linux2.6-glibc2.3-
   > ia32e/lib
   >
   > I've attached the following files:
   > 1) Gzipped versions of the .out & .err files of the failed job.
   > 2) ompi_info.log: The output of ompi_info -all
> 3) mpirun, mpirun.lsf, openmpi_wrapper: the three script filesprovided> to me by the ARL consultant. I store these in my homedirectory and> experimented with the MCA parameter btl_mvapi_ib_timeout inmpirun.
   > 4) The script file script_openmpi that I use to submit the job.
   >
> I am unable to provide you with the config.log file as Icannot find it
   > in the top level Open MPI directory.
   >
> I am also unable to provide you with details on the specificcluster> that I'm running in terms of the network. I know they useInfiniband
   and
   > some more detail may be found on:
   >
   > http://www.arl.hpc.mil/Systems/mjm.html
   >
   > Some other info:
> a) uname -a returns: > Linux l1 2.6.5-7.308-smp.arl-msrc#2 SMP Thu Jan 10 09:18:41 EST 2008
   > x86_64 x86_64 x86_64 GNU/Linux
   >
   > b) ulimit -l returns: unlimited
   >
> I cannot see a pattern as to which nodes are bad and which aregood ...
   >
   >
   > Note that I found in the mail archives that someone had a similar
> problem in transposing a matrix with 16 million elements. Theonly
   > answer I found in the thread was to increase the value of
   > btl_mvapi_ib_timeout to 14 or 16, something I've done already.
   >
> I'm hoping that there must be a way out of this problem. Ineed to> get my code running as I'm under pressure to produce resultsfor a
   > grant that's paying me.
   >
   > If you have any feedback I would be hugely grateful.
   >
   > Sincerely,
   >
   > Peter Diamessis
   > Cornell University
   >
   >
> >------------------------------------------------------------------------
   >
   > _______________________________________________
   > users mailing list
   > us...@open-mpi.org
   > http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

-------------------------------------------------------------
Peter Diamessis
Assistant Professor
Environmental Fluid Mechanics & Hydrology
School of Civil and Environmental Engineering
Cornell University
Ithaca, NY 14853
Phone: (607)-255-1719 --- Fax: (607)-255-9004
pj...@cornell.edu <mailto:pj...@cornell.edu>
http://www.cee.cornell.edu/faculty/pjd38

Re: [OMPI users] Fw: Re: Open MPI timeout problems.

Reply via email to