What iWARP hardware are you using?

I only tested with Chelsio T3 iWARP hardware before v1.3 was launched; I tested with Intel (NetEffect) 020's after v1.3 was launched and found that their driver in OFED v1.4.0 does not handle RDMA CM REJECT messages correctly. I have not yet tested with any other iWARP hardware.

You probably don't care about the details; the high level description of the problem is this: there are ordering issues in our implementation during OpenFabrics wireups such that Open MPI insists on connections being made in "one direction." If a connection is made in "the Wrong direction", OMPI will REJECT the connection and initiate a new connection in "the Right direction." It's this REJECT that the Intel (NetEffect) driver doesn't handle properly, but it also explains why reversing the order of your hosts works properly (because connections will be made in the Right direction, and OMPI doesn't issue REJECTs).

This gets complicated because both two different as-yet unreleased pieces of software are compensating:

- Open MPI v1.3.1 (coming soon): contains a workaround for the Intel/ NetEffect RNICs that handles the fact that REJECT behavior misbehaves in the OFED 1.4.0 Intel driver. It relies on detecting that it is running on one of the misbehaving Intel RNICs to know when to apply the workaround. You can also manually enable the workaround via an MCA parameter.

- OFED v1.4.1 (coming soon): As of yesterday, Intel was on schedule to deliver driver fixes for REJECT behavior for OFED v1.4.1. Hopefully, they'll be able to stay on schedule and OFED v1.4.1 will contain the fixes and OMPI 1.3 will work out-of-the-box.

The situation gets further complicated because the Intel RNICs do not report their vendor/part IDs properly in OFED v1.4.0. Hence, Open MPI v1.3.1 cannot automatically know to apply the workaround (because it can't detect that it's running on a problematic RNIC); you unfortunately have to set an MCA parameter to activate the workaround.

That being said, the fixes for the Intel RNICs to properly report their vendor/part IDs have already been pushed upstream and will definitely be included in OFED v1.4.1. So here's the possible outcomes:

1. OMPI v1.3.1 will definitely work with OFED v1.4.1 with Intel RNICs (either via auto-detecting to use the workaround or if the Intel REJECT driver problems get fixed).

2. OMPI v1.3.1 will work with OFED v1.4.0 if you manually set an MCA parameter to activate the workaround (perhaps it would be convenient to set that MCA param in the system-wide mca-params.conf file).

3. OMPI v1.3.0 will work with OFED v1.4.1 *if* Intel gets the REJECT fixes pushed upstream in time for OFED v1.4.1.

So if you're running in Intel/NetEffect RNICs with OFED 1.4.0, you might want to try a nightly OMPI v1.3.1 tarball. They're not yet released, but they're darn close and pretty stable. If nothing else, you can at least see if it works for you:

    http://www.open-mpi.org/nightly/v1.3/

You may need to enable the btl_openib_connect_rdmacm_reject_causes_connect_error MCA parameter (yes, it's a long name on purpose :-) ), perhaps something like this:

mpirun --mca btl_openib_connect_rdmacm_reject_causes_connect_error 1 ....

To be absolutely clear: this MCA parameter and RDMA CM workaround does not exist in OMPI v1.3.0. Since v1.3.1 final is not yet released, the only way to try it out is via the v1.3.1 nightly tarballs (via the URL above).

Hope that helps!





On Feb 19, 2009, at 9:48 AM, viral.me...@einfochips.com wrote:

Hi all,
I successfully installed OpenMPI-1.3. I am trying to run OpenMPI over iWARP.
But I am getting error

RDMA_CM_EVENT_CONNECT_ERROR

I tried to run with more debug messages
mpirun --mca orte_base_help_aggregate 0 -np 2 -display-map -v -host
100.168.54.49,100.168.54.50
/usr/mpi/gcc/openmpi-1.3/tests/osu_benchmarks-3.0/osu_bw

And I got
[qa49:06449] *** Process received signal ***
[qa49:06449] Signal: Segmentation fault (11)
[qa49:06449] Signal code: Address not mapped (1)
[qa49:06449] Failing at address: 0x1c
[qa49:06449] [ 0] /lib64/tls/libpthread.so.0 [0x3c4d80c5b0]
[qa49:06449] [ 1] /usr/mpi/gcc/openmpi-1.3/lib64/libopen-pal.so.0 [0x2a95868604]
[qa49:06449] [ 2]
/usr/mpi/gcc/openmpi-1.3/lib64/libopen-pal.so. 0(opal_show_help_vstring+0xd5)
[0x2a95867215]
[qa49:06449] [ 3]
/usr/mpi/gcc/openmpi-1.3/lib64/libopen-rte.so.0(orte_show_help+0xaf)
[0x2a9570d36f]
[qa49:06449] [ 4] /usr/mpi/gcc/openmpi-1.3/lib64/openmpi/ mca_btl_openib.so
[0x2a970a8e64]
[qa49:06449] [ 5] /usr/mpi/gcc/openmpi-1.3/lib64/openmpi/ mca_btl_openib.so
[0x2a970a2d0b]
[qa49:06449] [ 6] /usr/mpi/gcc/openmpi-1.3/lib64/libopen-pal.so.0 [0x2a958557b8]
[qa49:06449] [ 7]
/usr/mpi/gcc/openmpi-1.3/lib64/libopen-pal.so.0(opal_progress+0xac)
[0x2a9584a80c]
[qa49:06449] [ 8] /usr/mpi/gcc/openmpi-1.3/lib64/libmpi.so.0 [0x2a9558aa15] [qa49:06449] [ 9] /usr/mpi/gcc/openmpi-1.3/lib64/libmpi.so. 0(PMPI_Waitall+0x8a)
[0x2a955b756a]
[qa49:06449] [10]
/usr/mpi/gcc/openmpi-1.3/tests/osu_benchmarks-3.0/osu_bw(main+0x29d) [0x401135] [qa49:06449] [11] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3c4cf1c3fb] [qa49:06449] [12] /usr/mpi/gcc/openmpi-1.3/tests/osu_benchmarks-3.0/ osu_bw
[0x400e0a]
[qa49:06449] *** End of error message ***

Am I doing something wrong ??

Surprisingly,
mpirun --mca orte_base_help_aggregate 0 -np 2 -display-map -v -host
100.168.54.50,100.168.54.49
/usr/mpi/gcc/openmpi-1.3/tests/osu_benchmarks-3.0/osu_bw
is working fine (notice just host arguments are swapped)


Thanks,
Viral

--
_____________________________________________________________________
Disclaimer: This e-mail message and all attachments transmitted with it are intended solely for the use of the addressee and may contain legally
privileged and confidential information. If the reader of this message
is not the intended recipient, or an employee or agent responsible for
delivering this message to the intended recipient, you are hereby
notified that any dissemination, distribution, copying, or other use of
this message or its attachments is strictly prohibited. If you have
received this message in error, please notify the sender immediately by
replying to this message and please delete it from your computer. Any
views expressed in this message are those of the individual sender
unless otherwise stated.Company has taken enough precautions to prevent the spread of viruses. However the company accepts no liability for any
damage caused by any virus transmitted by this email.
__________________________________________________________________________

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to