We have very limited openib resources for testing at the moment. Can
you provide details on how to reproduce?
My bad; I must've been in a bigger hurry to go home for the weekend
than I thought.
I'm going to start with the assumption you're interested in the steps
to reproduce it in OpenMPI, and are less interested in the specifics
of the OpenIB setup.
Hardware Data:
Dual Opteron
4 GB RAM
PCI-X Mellanox IB HCA's
Software:
SuSE Linux Enterprise Server 9es, SP2
Linux Kernel 2.6.14 (Kernel IB drivers)
OpenIB.org svn build of the userspace libraries and utilities. (I
mentioned the revision number in an earlier post)
Setup:
Recompiled Presta, Intel MPI Benchmark, HPL, and HPCC against OpenIB
1.0RC5
HPL.dat and HPCC.dat are identical to versions previously posted by
myself. (not included to reduce redundant traffic)
Execution was started by commenting out the desied binary from the
following (truncated) script:
#mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
openib -np 16 -machinefile $work_dir/node $work_dir/hello_world
#mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
openib -np 16 -machinefile $work_dir/node $work_dir/IMB-MPI1
#mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
openib -np 16 -machinefile $work_dir/node $work_dir/com -o100
#mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
openib -np 16 -machinefile $work_dir/node $work_dir/allred 1000 100 1000
#mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
openib -np 16 -machinefile $work_dir/node $work_dir/globalop --help
#mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca ptl
openib -np 16 -machinefile $work_dir/node $work_dir/laten -o 100
#mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
openib -np 16 -machinefile $work_dir/node $work_dir/hpcc
mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
openib -np 16 -machinefile $work_dir/node $work_dir/xhpl
As to which tests produce the error: The presta 'com' test almost
always produces it; although at different places in the test on each
run. (there are two files, presta.com-16.rc5 and presta.gen2-16rc5.
Both of these are running he 'com' test, however, note both fail at
different points).
In addition IMB (Intel MPI Benchmark) also exhibits the same
behavior, halting execution in different places. Similarly, the
'allred' and 'globalop' tests would also behave the same way,
producing the same error. (However, I did manage to get 'allred' to
actually complete once... somehow.)
HPL and HPCC also would exit, producing the same errors.
If there's anything else I may have left out, I'll see what I can do.