> Just to say that I built the NetBSD OpenMPI 1.4 port from the CVS, > so includsing all the recent work and get the exmaples to run, albeit > still with the: > > opal_sockaddr2str failed:Unknown error (return code 4) > > non-fatal errors. > > As promised, I'll do bit more digging into this.
Here's the result of me "fancying a dig": The software I was adding on top of OpenMPI, initially PETSc, and above that PISM, has exhibited errors when run within an SGE/OpenMPI environment when FOUR or EIGHT processors are used, but not TWO The codes run when 2 or 4 processes are run on a single machine outside of SGE. I added a bit of debugging code into the opal/util/net.c:opal_net_get_hostname() routine. --- opal-util-net.c.000 2009-12-17 13:55:18.000000000 +1300 +++ opal-util-net.c 2009-12-17 14:24:08.000000000 +1300 @@ -369,6 +369,10 @@ return NULL; } + /* KMB */ + opal_output(0, "KMB: addr.sa_len %d, addr->sa_family %d, addrlen %d\n", + addr->sa_len, addr->sa_family, addrlen ) ; + /* KMB */ error = getnameinfo(addr, addrlen, name, NI_MAXHOST, NULL, 0, NI_NUMERICHOST); Here's what I see, from stderr, when running the SkaMPI 5 test: skampi -i ski/skampi_pt2pt.ski across a 4-node SGE submission. The SkaMPI test runs through by the way. [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Temporary failure in name resolution (return code 4) [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 [matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Temporary failure in name resolution (return code 4) [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Temporary failure in name resolution (return code 4) [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return code 4) [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error (return code 4) [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 [old-bailey.ecs.vuw.ac.nz:28315] opal_sockaddr2str failed:Temporary failure in name resolution (return code 4) [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return code 4) [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 [matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Unknown error (return code 4) [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16 [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error (return code 4) [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16 You'll notice that at least one "addr" that is making it's way into opal_net_get_hostname has an sa_len of zero and that that is what seems to be triggering the opal_sockaddr2str messages. I was wondering whether this was coming out of the IPv6 getifaddr loop, as I thought I'd set everything explictly in the munged IPv4 stanza. I'd like to "tidy up" those messages, if only because failing with bith an unknown error and a temporay failure doesn't seem right ! Any thoughts welcome, Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand