You could confirm that it is the IPv6 loop by simply disabling IPv6 support - configure with --disable-ipv6 and see if you still get the error messages
Thanks for continuing to pursue this! Ralph On Dec 16, 2009, at 8:41 PM, kevin.buck...@ecs.vuw.ac.nz wrote: >> Just to say that I built the NetBSD OpenMPI 1.4 port from the CVS, >> so includsing all the recent work and get the exmaples to run, albeit >> still with the: >> >> opal_sockaddr2str failed:Unknown error (return code 4) >> >> non-fatal errors. >> >> As promised, I'll do bit more digging into this. > > Here's the result of me "fancying a dig": > > The software I was adding on top of OpenMPI, initially PETSc, and > above that PISM, has exhibited errors when run within an SGE/OpenMPI > environment when FOUR or EIGHT processors are used, but not TWO > > The codes run when 2 or 4 processes are run on a single machine > outside of SGE. > > > I added a bit of debugging code into the > > opal/util/net.c:opal_net_get_hostname() > > routine. > > --- opal-util-net.c.000 2009-12-17 13:55:18.000000000 +1300 > +++ opal-util-net.c 2009-12-17 14:24:08.000000000 +1300 > @@ -369,6 +369,10 @@ > return NULL; > } > > + /* KMB */ > + opal_output(0, "KMB: addr.sa_len %d, addr->sa_family %d, addrlen %d\n", > + addr->sa_len, addr->sa_family, addrlen ) ; > + /* KMB */ > error = getnameinfo(addr, addrlen, > name, NI_MAXHOST, NULL, 0, NI_NUMERICHOST); > > > Here's what I see, from stderr, when running the SkaMPI 5 test: > > skampi -i ski/skampi_pt2pt.ski > > across a 4-node SGE submission. > > The SkaMPI test runs through by the way. > > [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 > [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 > [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 > [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 > [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 > [matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 > [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 > [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Temporary failure in > name resolution (return code 4) > [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2 > addrlen 16 > [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2 > addrlen 16 > [matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Temporary > failure in name resolution (return code 4) > [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Temporary failure > in name resolution (return code 4) > [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 > [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return > code 4) > [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2 > addrlen 16 > [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error > (return code 4) > [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 0, addr.sa_family 2 > addrlen 16 > [old-bailey.ecs.vuw.ac.nz:28315] opal_sockaddr2str failed:Temporary > failure in name resolution (return code 4) > [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 > [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return > code 4) > [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2 > addrlen 16 > [matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Unknown error > (return code 4) > [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2 > addrlen 16 > [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error > (return code 4) > [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 > [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2 > addrlen 16 > > > You'll notice that at least one "addr" that is making it's way into > > opal_net_get_hostname > > has an sa_len of zero and that that is what seems to be triggering > the > > opal_sockaddr2str > > messages. > > I was wondering whether this was coming out of the IPv6 getifaddr > loop, as I thought I'd set everything explictly in the munged IPv4 > stanza. > > I'd like to "tidy up" those messages, if only because failing with > bith an unknown error and a temporay failure doesn't seem right ! > > Any thoughts welcome, > Kevin > > -- > Kevin M. Buckley Room: CO327 > School of Engineering and Phone: +64 4 463 5971 > Computer Science > Victoria University of Wellington > New Zealand > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users