If I start an MPI job with:
mpirun -np 2
the following gets shown:
--
[0,1,0]: OpenIB on host tux was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
--
I'm trying to run skampi-5.0.1-r0191 under PBS
over IB with the command line
mpirun -np 2 ./skampi -i coll.ski -o coll_ib.sko
The pt2pt and mmisc tests run to completion.
The coll and onesided tests, on the other hand,
start to produce output but then seem to hang.
Actually, the cpus appear to
On Wed, Sep 19, 2007 at 01:58:35PM -0600, Edmund Sumbar wrote:
> I'm trying to run skampi-5.0.1-r0191 under PBS
> over IB with the command line
>
>mpirun -np 2 ./skampi -i coll.ski -o coll_ib.sko
Can you add
choose_barrier_synchronization()
to coll.ski and try again? It looks like this one:
h
I'm now running the same experiment under valgrind. It's probably
going to run for a few days, but interestingly what I'm seeing now is
that while running under valgrind's memcheck, the app has been
reporting much more of these "recv failed" errors, and not only on the
server node:
[host1][0,1
The suggestion will probably work, but it is not a solution.
"choosing barrier synchronization" is not recommended by SKaMPI team and
that it reduces accuracy of the benchmark.
The problem is either at pml ob1 level or in btl ib level - and it has
to do with many messages being sent at the same