Hi List, 1. We have HW * 2xBladecenter H * 2xCisco Infiniband Switch Modules * 1xCisco Infiniband Switch * 16x PPC64 JS21 blades each are 4 cores, with Cisco HCA
SW * SLES 10 * OFED 1.1 w. OpenMPI 1.1.1I am running the Intel MPI Benchmark (IMB) on the cluster as a part of validation process for the customer.
I have tried the OpenMPI that comes with OFED 1.1, which gave spurious "Not Enough Memory" error messages, after looking through FAQs (with the help of Cisco) I was able to find the problems and fixes. I used the FAQs to add unlimited soft and hard limits for memlock, turned RDMA off by using "--mca btl_openib_flags 1". This still did not work, and still got the Memory problems.
I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failed miserably.I then tried the released version of the OpenMPI-1.2b3, which got me further than before. Now the benchmark goes through all the tests until Allgatherv finishes, and it seems that it is waiting to start AlltoAll, I have waited about 12 hours to see if this continues. I have since then managed to run AlltoAll, and the rest of the benchmark separately.
I have tried a few tunable paramaters, that was suggested by Cisco, which improved the results, but still hung. The parameters that I have used to try and diagnose are below. I used the debug/verbose variables to see if I could see if I could get error messages on the running of the benchmark.
#orte_debug=1 #btl_openib_verbose=1 #mca_verbose=1 #btl_base_debug=1 btl_openib_flags=1 mpi_leave_pinned=1 mpool_base_use_mem_hooks=12. On another side note, I am having similar problems on another customer's cluster, where the benchmark hangs but at a different place each time.
HW specs * 12x IBM 3455 2xdual Core machines, with Infinipath/pathscale HCAs * 1x Voltaire Switch SW * master: RHEL 4 AS U3 * compute: RHEL 4 WS U3 * OFED 1.1.1 w. OpenMPI-1.1.2A) In this case, I have also had errors, which I was able to turn off by adding btl_openib_warn_no_hca_params_found to 0, but wasn't sure if this was the right thing to do.
-------------------------------------------------------------------------- WARNING: No HCA parameters were found for the HCA that Open MPI detected: Hostname: node004 HCA vendor ID: 0x1fc1 HCA vendor part ID: 13 Default HCA parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_hca_param_files MCA parameter to set values for your HCA. NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_hca_params_found to 0. --------------------------------------------------------------------------b) The runs on this machine would also hang, so I tried to remove all the unnecessary daemons, to see if that would improve it. In this case, in 75% of the tim, it would runn longer until AlltoAll put would hang there, or otherwise it would hang at various other places. At times I also errors regarding retry count and timeout, for both which I increased to 14 and 20 respectively, I tried similar steps to the PPC cluster to fix the problem of freezing but had no luck. Below are all the parameters that I have used in this case
#orte_debug=1 #btl_openib_verbose=1 #mca_verbose=1 #btl_base_debug=1 btl_openib_warn_no_hca_params_found=0 btl_openib_flags=1 #mpi_preconnect_all=1 mpi_leave_pinned=1 btl_openib_ib_retry_count=14 btl_openib_ib_timeout=20 mpool_base_use_mem_hooks=1I hope I have included all the info, if there is anything else required, then I should be able to provide that to you, without a problem
thanks a lot for your help in advance -- regards, Arif Ali Software Engineer OCF plc Mobile: +44 (0)7970 148 122 Office: +44 (0)114 257 2200 Fax: +44 (0)114 257 0022 Email: a...@ocf.co.uk Web: http://www.ocf.co.uk Skype: arif_ali80 MSN: a...@ocf.co.uk
<<attachment: aali.vcf>>