IMB hangs[Scanned]

Arif Ali Thu, 18 Jan 2007 15:53:36 -0500

Hi List,

1. We have
HW
* 2xBladecenter H
* 2xCisco Infiniband Switch Modules
* 1xCisco Infiniband Switch
* 16x PPC64 JS21 blades each are 4 cores, with Cisco HCA


SW
* SLES 10
* OFED 1.1 w. OpenMPI 1.1.1

I am running the Intel MPI Benchmark (IMB) on the cluster as a part ofvalidation process for the customer.

I have tried the OpenMPI that comes with OFED 1.1, which gave spurious"Not Enough Memory" error messages, after looking through FAQs (with thehelp of Cisco) I was able to find the problems and fixes. I used theFAQs to add unlimited soft and hard limits for memlock, turned RDMA offby using "--mca btl_openib_flags 1". This still did not work, and stillgot the Memory problems.


I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failed miserably.

I then tried the released version of the OpenMPI-1.2b3, which got mefurther than before. Now the benchmark goes through all the tests untilAllgatherv finishes, and it seems that it is waiting to start AlltoAll,I have waited about 12 hours to see if this continues. I have since thenmanaged to run AlltoAll, and the rest of the benchmark separately.

I have tried a few tunable paramaters, that was suggested by Cisco,which improved the results, but still hung. The parameters that I haveused to try and diagnose are below. I used the debug/verbose variablesto see if I could see if I could get error messages on the running ofthe benchmark.


#orte_debug=1
#btl_openib_verbose=1
#mca_verbose=1
#btl_base_debug=1
btl_openib_flags=1
mpi_leave_pinned=1
mpool_base_use_mem_hooks=1

2. On another side note, I am having similar problems on anothercustomer's cluster, where the benchmark hangs but at a different placeeach time.


HW specs
* 12x IBM 3455 2xdual Core machines, with Infinipath/pathscale HCAs
* 1x Voltaire Switch
SW
* master: RHEL 4 AS U3
* compute: RHEL 4 WS U3
* OFED 1.1.1 w. OpenMPI-1.1.2

A) In this case, I have also had errors, which I was able to turn off byadding btl_openib_warn_no_hca_params_found to 0, but wasn't sure if thiswas the right thing to do.

--------------------------------------------------------------------------
WARNING: No HCA parameters were found for the HCA that Open MPI
detected:

   Hostname:           node004
   HCA vendor ID:      0x1fc1
   HCA vendor part ID: 13

Default HCA parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_hca_param_files MCA parameter to set values for your HCA.

NOTE: You can turn off this warning by setting the MCA parameter
     btl_openib_warn_no_hca_params_found to 0.
--------------------------------------------------------------------------

b) The runs on this machine would also hang, so I tried to remove allthe unnecessary daemons, to see if that would improve it. In this case,in 75% of the tim, it would runn longer until AlltoAll put would hangthere, or otherwise it would hang at various other places. At times Ialso errors regarding retry count and timeout, for both which Iincreased to 14 and 20 respectively, I tried similar steps to the PPCcluster to fix the problem of freezing but had no luck. Below are allthe parameters that I have used in this case



#orte_debug=1
#btl_openib_verbose=1
#mca_verbose=1
#btl_base_debug=1
btl_openib_warn_no_hca_params_found=0
btl_openib_flags=1
#mpi_preconnect_all=1
mpi_leave_pinned=1
btl_openib_ib_retry_count=14
btl_openib_ib_timeout=20
mpool_base_use_mem_hooks=1

I hope I have included all the info, if there is anything else required,then I should be able to provide that to you, without a problem


thanks a lot for your help in advance

--
regards,
Arif Ali
Software Engineer
OCF plc

Mobile: +44 (0)7970 148 122
Office: +44 (0)114 257 2200
Fax:    +44 (0)114 257 0022
Email:  a...@ocf.co.uk
Web:    http://www.ocf.co.uk

Skype:  arif_ali80
MSN:    a...@ocf.co.uk

<<attachment: aali.vcf>>

[OMPI users] OpenMPI/OpenIB/IMB hangs[Scanned]

Reply via email to