Arif --

Sorry for the delay in replying.

Believe it or not, almost this exact issue just came up with the IBM Benchmark Center; they were using Open MPI with MPIRandomAccess and experiencing problems with running out of memory. We didn't get a full set of data and experiments run; it was somewhat odd that the problem seemed to happen most often with the Intel compilers (preliminary tests shows that we couldn't replicate the problem with the gcc compiler on the same problem size).

However, the IBM Benchmark Center engineers were able to get successful runs in by using the btl_openib_free_list_max MCA parameter. This parameter essentially limits how much space the lowest-level IB driver in OMPI uses for fragment lists (it's actually fairly complex as to what it exactly does and how it helps in this situation -- insert "waving hands" image here...). This parameter defaults to "infinite". Setting it to a finite value can allow MPIRandomAccess to complete; I believe that the IBC engineers used values of 2000 and 4000 for their systems.




On Apr 22, 2008, at 12:10 PM, Arif Ali wrote:

Hi list,

I had a similar problem last year with IMB when the the job would just
hang on a PowerPC cluster, for which Jeff Sqyres gave me many pointers
to change paramaters to fix the problem. Now with another cluster that I am building the IMB job hangs in the same place and also the machines in the cluster start swapping at the time of the hang. Following from what
Jeff suggested I have tried the following mca paramaters

btl_openib_flags=1
btl_openib_ib_timeout=20
mpool_base_verbose=1
mpool_base_use_mem_hooks=1
btl_openib_eager_limit=3072
#btl_openib_eager_limit=4096
btl_openib_max_send_size=12288

After setting these paramaters, the machines swapped, but a lot less
than before and got a lot further in the run and ran to completion. Are
there any further suggestions on paramaters that can be tweaked to get
these machines not to swap.

I am also having the same swapping issue when running the HPCC benchmark when it reaches the MPIRandomAccess where it swaps on all machines and we can no longer access them and therefore we have to reboot the machines.

OS: SLES 10
Kernel: 2.6.16.46-0.12-smp
OFED release: 1.3
openmpi: 1.2.5 and 1.2.6 using btl openib
Switch: TopSpin
SM: on TopSpin switch
Ulimit has been set to unlimited as suggested in the FAQ

One thing to note, Both jobs run with no problems using TCP.


regards,
--

Arif Ali
Software Engineer
OCF plc

Mobile: +44 (0)7970 148 122
DDI:    +44 (0)114 257 2240
Office: +44 (0)114 257 2200
Fax:    +44 (0)114 257 0022
Email:  a...@ocf.co.uk
Web:    http://www.ocf.co.uk

Support Phone:   +44 (0)845 702 3829
Support E-mail:  supp...@ocf.co.uk

Skype:  arif_ali80
MSN:    a...@ocf.co.uk

This email is confidential in that it is intended for the exclusive
attention of the addressee(s) indicated. If you are not the intended
recipient, this email should not be read or disclosed to any other
person. Please notify the sender immediately and delete this email from
your computer system. Any opinions expressed are not necessarily those
of the company from which this email was sent and, whilst to the best of
our knowledge no viruses or defects exist, no responsibility can be
accepted for any loss or damage arising from its receipt or subsequent
use of this email.

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to