Hi list,

I had a similar problem last year with IMB when the the job would just hang on a PowerPC cluster, for which Jeff Sqyres gave me many pointers to change paramaters to fix the problem. Now with another cluster that I am building the IMB job hangs in the same place and also the machines in the cluster start swapping at the time of the hang. Following from what Jeff suggested I have tried the following mca paramaters

btl_openib_flags=1
btl_openib_ib_timeout=20
mpool_base_verbose=1
mpool_base_use_mem_hooks=1
btl_openib_eager_limit=3072
#btl_openib_eager_limit=4096
btl_openib_max_send_size=12288

After setting these paramaters, the machines swapped, but a lot less than before and got a lot further in the run and ran to completion. Are there any further suggestions on paramaters that can be tweaked to get these machines not to swap.

I am also having the same swapping issue when running the HPCC benchmark when it reaches the MPIRandomAccess where it swaps on all machines and we can no longer access them and therefore we have to reboot the machines.

OS: SLES 10
Kernel: 2.6.16.46-0.12-smp
OFED release: 1.3
openmpi: 1.2.5 and 1.2.6 using btl openib
Switch: TopSpin
SM: on TopSpin switch
Ulimit has been set to unlimited as suggested in the FAQ

One thing to note, Both jobs run with no problems using TCP.


regards,
--

Arif Ali
Software Engineer
OCF plc

Mobile: +44 (0)7970 148 122 DDI: +44 (0)114 257 2240 Office: +44 (0)114 257 2200 Fax: +44 (0)114 257 0022 Email: a...@ocf.co.uk Web: http://www.ocf.co.uk

Support Phone:   +44 (0)845 702 3829
Support E-mail:  supp...@ocf.co.uk

Skype: arif_ali80 MSN: a...@ocf.co.uk

This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email.

Reply via email to