You could confirm that it is the IPv6 loop by simply disabling IPv6 support - 
configure with --disable-ipv6 and see if you still get the error messages

Thanks for continuing to pursue this!
Ralph

On Dec 16, 2009, at 8:41 PM, kevin.buck...@ecs.vuw.ac.nz wrote:

>> Just to say that I built the NetBSD OpenMPI 1.4 port from the CVS,
>> so includsing all the recent work and get the exmaples to run, albeit
>> still with the:
>> 
>> opal_sockaddr2str failed:Unknown error (return code 4)
>> 
>> non-fatal errors.
>> 
>> As promised, I'll do bit more digging into this.
> 
> Here's the result of me "fancying a dig":
> 
> The software I was adding on top of OpenMPI, initially PETSc, and
> above that PISM, has exhibited errors when run within an SGE/OpenMPI
> environment when FOUR or EIGHT processors are used, but not TWO
> 
> The codes run when 2 or 4 processes are run on a single machine
> outside of SGE.
> 
> 
> I added a bit of debugging code into the
> 
> opal/util/net.c:opal_net_get_hostname()
> 
> routine.
> 
> --- opal-util-net.c.000 2009-12-17 13:55:18.000000000 +1300
> +++ opal-util-net.c     2009-12-17 14:24:08.000000000 +1300
> @@ -369,6 +369,10 @@
>         return NULL;
>     }
> 
> +    /* KMB */
> +    opal_output(0, "KMB: addr.sa_len %d, addr->sa_family %d, addrlen %d\n",
> +               addr->sa_len, addr->sa_family, addrlen ) ;
> +    /* KMB */
>     error = getnameinfo(addr, addrlen,
>                         name, NI_MAXHOST, NULL, 0, NI_NUMERICHOST);
> 
> 
> Here's what I see, from stderr, when running the SkaMPI 5 test:
> 
> skampi -i ski/skampi_pt2pt.ski
> 
> across a 4-node SGE submission.
> 
> The SkaMPI test runs through by the way.
> 
> [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Temporary failure in
> name resolution (return code 4)
> [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Temporary
> failure in name resolution (return code 4)
> [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Temporary failure
> in name resolution (return code 4)
> [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return
> code 4)
> [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error
> (return code 4)
> [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [old-bailey.ecs.vuw.ac.nz:28315] opal_sockaddr2str failed:Temporary
> failure in name resolution (return code 4)
> [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return
> code 4)
> [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Unknown error
> (return code 4)
> [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error
> (return code 4)
> [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> 
> 
> You'll notice that at least one "addr" that is making it's way into
> 
> opal_net_get_hostname
> 
> has an sa_len of zero and that that is what seems to be triggering
> the
> 
> opal_sockaddr2str
> 
> messages.
> 
> I was wondering whether this was coming out of the IPv6 getifaddr
> loop, as I thought I'd set everything explictly in the munged IPv4
> stanza.
> 
> I'd like to "tidy up" those messages, if only because failing with
> bith an unknown error and a temporay failure doesn't seem right !
> 
> Any thoughts welcome,
> Kevin
> 
> -- 
> Kevin M. Buckley                                  Room:  CO327
> School of Engineering and                         Phone: +64 4 463 5971
> Computer Science
> Victoria University of Wellington
> New Zealand
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to