Hi List,

1. We have
HW
* 2xBladecenter H
* 2xCisco Infiniband Switch Modules
* 1xCisco Infiniband Switch
* 16x PPC64 JS21 blades each are 4 cores, with Cisco HCA

SW
* SLES 10
* OFED 1.1 w. OpenMPI 1.1.1

I am running the Intel MPI Benchmark (IMB) on the cluster as a part of validation process for the customer.

I have tried the OpenMPI that comes with OFED 1.1, which gave spurious "Not Enough Memory" error messages, after looking through FAQs (with the help of Cisco) I was able to find the problems and fixes. I used the FAQs to add unlimited soft and hard limits for memlock, turned RDMA off by using "--mca btl_openib_flags 1". This still did not work, and still got the Memory problems.

I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failed miserably.

I then tried the released version of the OpenMPI-1.2b3, which got me further than before. Now the benchmark goes through all the tests until Allgatherv finishes, and it seems that it is waiting to start AlltoAll, I have waited about 12 hours to see if this continues. I have since then managed to run AlltoAll, and the rest of the benchmark separately.

I have tried a few tunable paramaters, that was suggested by Cisco, which improved the results, but still hung. The parameters that I have used to try and diagnose are below. I used the debug/verbose variables to see if I could see if I could get error messages on the running of the benchmark.

#orte_debug=1
#btl_openib_verbose=1
#mca_verbose=1
#btl_base_debug=1
btl_openib_flags=1
mpi_leave_pinned=1
mpool_base_use_mem_hooks=1

2. On another side note, I am having similar problems on another customer's cluster, where the benchmark hangs but at a different place each time.

HW specs
* 12x IBM 3455 2xdual Core machines, with Infinipath/pathscale HCAs
* 1x Voltaire Switch
SW
* master: RHEL 4 AS U3
* compute: RHEL 4 WS U3
* OFED 1.1.1 w. OpenMPI-1.1.2

A) In this case, I have also had errors, which I was able to turn off by adding btl_openib_warn_no_hca_params_found to 0, but wasn't sure if this was the right thing to do.
--------------------------------------------------------------------------
WARNING: No HCA parameters were found for the HCA that Open MPI
detected:

   Hostname:           node004
   HCA vendor ID:      0x1fc1
   HCA vendor part ID: 13

Default HCA parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_hca_param_files MCA parameter to set values for your HCA.

NOTE: You can turn off this warning by setting the MCA parameter
     btl_openib_warn_no_hca_params_found to 0.
--------------------------------------------------------------------------
b) The runs on this machine would also hang, so I tried to remove all the unnecessary daemons, to see if that would improve it. In this case, in 75% of the tim, it would runn longer until AlltoAll put would hang there, or otherwise it would hang at various other places. At times I also errors regarding retry count and timeout, for both which I increased to 14 and 20 respectively, I tried similar steps to the PPC cluster to fix the problem of freezing but had no luck. Below are all the parameters that I have used in this case


#orte_debug=1
#btl_openib_verbose=1
#mca_verbose=1
#btl_base_debug=1
btl_openib_warn_no_hca_params_found=0
btl_openib_flags=1
#mpi_preconnect_all=1
mpi_leave_pinned=1
btl_openib_ib_retry_count=14
btl_openib_ib_timeout=20
mpool_base_use_mem_hooks=1

I hope I have included all the info, if there is anything else required, then I should be able to provide that to you, without a problem

thanks a lot for your help in advance

--
regards,
Arif Ali
Software Engineer
OCF plc

Mobile: +44 (0)7970 148 122
Office: +44 (0)114 257 2200
Fax:    +44 (0)114 257 0022
Email:  a...@ocf.co.uk
Web:    http://www.ocf.co.uk

Skype:  arif_ali80
MSN:    a...@ocf.co.uk

<<attachment: aali.vcf>>

Reply via email to