Arif --
Sorry for the delay in replying.
Believe it or not, almost this exact issue just came up with the IBM
Benchmark Center; they were using Open MPI with MPIRandomAccess and
experiencing problems with running out of memory. We didn't get a
full set of data and experiments run; it was somewhat odd that the
problem seemed to happen most often with the Intel compilers
(preliminary tests shows that we couldn't replicate the problem with
the gcc compiler on the same problem size).
However, the IBM Benchmark Center engineers were able to get
successful runs in by using the btl_openib_free_list_max MCA
parameter. This parameter essentially limits how much space the
lowest-level IB driver in OMPI uses for fragment lists (it's actually
fairly complex as to what it exactly does and how it helps in this
situation -- insert "waving hands" image here...). This parameter
defaults to "infinite". Setting it to a finite value can allow
MPIRandomAccess to complete; I believe that the IBC engineers used
values of 2000 and 4000 for their systems.
On Apr 22, 2008, at 12:10 PM, Arif Ali wrote:
Hi list,
I had a similar problem last year with IMB when the the job would just
hang on a PowerPC cluster, for which Jeff Sqyres gave me many pointers
to change paramaters to fix the problem. Now with another cluster
that I
am building the IMB job hangs in the same place and also the
machines in
the cluster start swapping at the time of the hang. Following from
what
Jeff suggested I have tried the following mca paramaters
btl_openib_flags=1
btl_openib_ib_timeout=20
mpool_base_verbose=1
mpool_base_use_mem_hooks=1
btl_openib_eager_limit=3072
#btl_openib_eager_limit=4096
btl_openib_max_send_size=12288
After setting these paramaters, the machines swapped, but a lot less
than before and got a lot further in the run and ran to completion.
Are
there any further suggestions on paramaters that can be tweaked to get
these machines not to swap.
I am also having the same swapping issue when running the HPCC
benchmark
when it reaches the MPIRandomAccess where it swaps on all machines
and
we can no longer access them and therefore we have to reboot the
machines.
OS: SLES 10
Kernel: 2.6.16.46-0.12-smp
OFED release: 1.3
openmpi: 1.2.5 and 1.2.6 using btl openib
Switch: TopSpin
SM: on TopSpin switch
Ulimit has been set to unlimited as suggested in the FAQ
One thing to note, Both jobs run with no problems using TCP.
regards,
--
Arif Ali
Software Engineer
OCF plc
Mobile: +44 (0)7970 148 122
DDI: +44 (0)114 257 2240
Office: +44 (0)114 257 2200
Fax: +44 (0)114 257 0022
Email: a...@ocf.co.uk
Web: http://www.ocf.co.uk
Support Phone: +44 (0)845 702 3829
Support E-mail: supp...@ocf.co.uk
Skype: arif_ali80
MSN: a...@ocf.co.uk
This email is confidential in that it is intended for the exclusive
attention of the addressee(s) indicated. If you are not the intended
recipient, this email should not be read or disclosed to any other
person. Please notify the sender immediately and delete this email
from
your computer system. Any opinions expressed are not necessarily those
of the company from which this email was sent and, whilst to the
best of
our knowledge no viruses or defects exist, no responsibility can be
accepted for any loss or damage arising from its receipt or subsequent
use of this email.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems