Hello everybody: I had the same problem described at thread http://www.open-mpi.org/community/lists/users/2008/05/5601.php which I solved setting btl_openib_free_list_max MCA parameter to 2048, but I have some doubts and derived problems that I would like to comment:
1) Is this a problem which only affects to hpcc MPIRandomAccess test or it may happen with any other code? 2) Should I set this parameter to some value by default? Would the performance be affected? How should I take into account to tune this parameter (if needed) for our home make applications? 3) I am using jfs file system on our cluster nodes and eventually I got it corrupted or put in a read only state when running into memory problems like the hpcc MPIRandomAccess or other problems with our home make code. a) How can memory problems caused by user codes corrupt file systems / and/or /home? b) Is this related to libibverbs bypassing the kernel TCP stack (I had to set /dev/infiniband/uverbs0 rw to everybody)? c) Should I change to ext3 file system? d) Shoud I change other parameters according to http://www.open-mpi.org/faq/?category=openfabrics#limiting-registered-memory-usage I have a newly started infiniband cluster in stand by, so please, any comment or advice will be welcomed *********************************************************************** My environment info, is: 1) Openfabrics included in distribution 2) Linux distribution: Ubuntu 7.04 uname -a -> Linux jff202 2.6.20-16-server #2 SMP Tue Feb 12 02:16:56 UTC 2008 x86_64 GNU/Linux 3) Subnet manager: OpenSM 3.1.11 from OFED 1.3 installed on the cluster server with Ubuntu 8.04 4) ulimit -l -> unlimited 5) The MCA parameters that I have modified at /etc/openmpi/openmpi-mca-params.conf are: mpi_paffinity_alone = 1 pls_rsh_agent = rsh Thanks in advance regards -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que està net. For all your IT requirements visit: http://www.transtec.co.uk