George Bosilca wrote:
A fix for this problem is now available on the trunk. Please use any
revision after 14963 and your problem will vanish [I hope!]. There are
now some additional parameters which allow you to select which Myrinet
network you want to use in the case there are several available (--mca
btl_mx_if_include and --mca btl_mx_if_exclude). Even multi-rails should
now work over MX.
I have tried nightly snapshot openmpi-1.3a1r14981 and it (almost)
seems to work. The version as is, when run in combination with
MX-1.2.0j and the FMA mapper, currently results in the following
error on each node:
mx_get_info(MX_LINE_SPEED) failed with status 35 (Bad info length)
However, with the small patch below, multi-cluster jobs indeed seem
to be running fine (using MX locally). I'll do some more testing
later this week.
Thanks a lot for the fix!
Kees
*** ./ompi/mca/btl/mx/btl_mx_component.c.orig 2007-06-11 17:12:11.000000000
+0200
--- ./ompi/mca/btl/mx/btl_mx_component.c 2007-06-11 17:13:34.000000000
+0200
***************
*** 310,316 ****
#if defined(MX_HAS_NET_TYPE)
{
int value;
! if( (status = mx_get_info( mx_btl->mx_endpoint, MX_LINE_SPEED, NULL,
0,
&value, sizeof(int))) != MX_SUCCESS ) {
opal_output( 0, "mx_get_info(MX_LINE_SPEED) failed with status %d
(%s)\n",
status, mx_strerror(status) );
--- 310,317 ----
#if defined(MX_HAS_NET_TYPE)
{
int value;
! if( (status = mx_get_info( mx_btl->mx_endpoint, MX_LINE_SPEED,
! &nic_id, sizeof(nic_id),
&value, sizeof(int))) != MX_SUCCESS ) {
opal_output( 0, "mx_get_info(MX_LINE_SPEED) failed with status %d
(%s)\n",
status, mx_strerror(status) );