[OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-03-30 Thread openmpi
Hello List,

I hope you can help us out on that one, as we are trying to figure out
since weeks.

The situation: We have a program being capable of slitting to several
processes to be shared on nodes within a cluster network using openmpi.
We were running that system on "older" cluster hardware (Intel Core2 Duo
based, 2GB RAM) using an "older" kernel (2.6.18.6). All nodes are
diskless network booting. Recently we upgraded the hardware (Intel i5,
8GB RAM) which also required an upgrade to a recent kernel version
(2.6.26+).

Here is the problem: We experience overall performance loss on the new
hardware and think, we can break it down to a communication issue
inbetween the processes.

Also, we found out, the issue araises in the transition from kernel
2.6.23 to 2.6.24 (tested on the Core2 Duo system).

Here is an output from our programm:

2.6.23.17 (64bit), MPI 1.2.7
5 Iterationen (Core2 Duo) 6 CPU:
93.33 seconds per iteration.
 Node   0 communication/computation time:  6.83 /647.64 seconds.
 Node   1 communication/computation time: 10.09 /644.36 seconds.
 Node   2 communication/computation time:  7.27 /645.03 seconds.
 Node   3 communication/computation time:165.02 /485.52 seconds.
 Node   4 communication/computation time:  6.50 /643.82 seconds.
 Node   5 communication/computation time:  7.80 /627.63 seconds.
 Computation time:897.00 seconds.

2.6.24.7 (64bit) .. re-evaluated, MPI 1.2.7
5 Iterationen (Core2 Duo) 6 CPU:
   131.33 seconds per iteration.
 Node   0 communication/computation time:364.15 /645.24 seconds.
 Node   1 communication/computation time:362.83 /645.26 seconds.
 Node   2 communication/computation time:349.39 /645.07 seconds.
 Node   3 communication/computation time:508.34 /485.53 seconds.
 Node   4 communication/computation time:349.94 /643.81 seconds.
 Node   5 communication/computation time:349.07 /627.47 seconds.
 Computation time:   1251.00 seconds.

The program is 32 bit software, but it doesn't make any difference
whether the kernel is 64 or 32 bit. Also the OpenMPI version 1.4.1 was
tested, cut communication times by half (which still is too high), but
improvement decreased with increasing kernel version number.

The communication time is meant to be the time the master process
distributes the data portions for calculation and collecting the results
from the slave processes. The value also contains times a slave has to
wait to communicate with the master as he is occupied. This explains the
extended communication time of node #3 as the calculation time is
reduced (based on the nature of the data)

The command to start the calculation:
mpirun -np 2 -host cluster-17 invert-master -b -s -p inv_grav.inp : -np
4 -host cluster-18,cluster-19

Using top (with 'f' and 'j' showing P row) we could track which process
runs on which core. We found processes stayed on its initial core in
kernel 2.6.23, but started to flip around with 2.6.24. Using the
--bind-to-core option in openmpi 1.4.1 kept the processes on its cores
again, but that didn't influence the overall outcome, didn't fix the issue.

We found top showing ~25% CPU wait time, and processes showing 'D' ,
also on slave only nodes. According to our programmer communications are
only between the master process and its slaves, but not among slaves. On
kernel 2.6.23 and lower CPU usage is 100% on user, no wait or system
percentage.

Example from top:

Cpu(s): 75.3%us,  0.6%sy,  0.0%ni,  0.0%id, 23.1%wa,  0.7%hi,  0.3%si,
0.0%st
Mem:   8181236k total,   131224k used,  8050012k free,0k buffers
Swap:0k total,0k used,0k free,49868k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
 3386 oli   20   0 90512  20m 3988 R   74  0.3  12:31.80 0 invert-
 3387 oli   20   0 85072  15m 3780 D   67  0.2  11:59.30 1 invert-
 3388 oli   20   0 85064  14m 3588 D   77  0.2  12:56.90 2 invert-
 3389 oli   20   0 84936  14m 3436 R   85  0.2  13:28.30 3 invert-


Some system information that might be helpful:

Nodes Hardware:
1. "older": Intel Core2 Duo, (2x1)GB RAM
2. "newer": Intel(R) Core(TM) i5 CPU, Mainboard ASUS RS100-E6, (4x2)GB RAM

Debian stable (lenny) distribution with
ii  libc6         2.7-18lenny2
ii  libopenmpi1   1.2.7~rc2-2
ii  openmpi-bin   1.2.7~rc2-2
ii  openmpi-common1.2.7~rc2-2

Nodes are booting diskless with nfs-root and a kernel with all drivers
needed compiled in.

Information on the program using openmpi and tools used to compile it:

mpirun --version:
mpirun (Open MPI) 1.2.7rc2

libopenmpi-dev 1.2.7~rc2-2
depends on:
 libc6 (2.7-18lenny2)
 libopenmpi1 (1.2.7~rc2-2)
 openmpi-common (1.2.7~rc2-2)


Compilation command:
mpif90


FORTRAN compiler (FC):
gfortran --version:
GNU Fortran 

Re: [OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN) (Terry D. Dontje)

2006-06-28 Thread openmpi-user

Hi Terry,

unfortunately I haven't got a stack trace.

OS: Mac OS X 10.4.7 Server on the Xgrid-server and Mac OS X 10.4.7 
Client on every node (G4 and G5). For testing-purposes I've installed 
OpenMPI 1.1 on a Dual-G4-node and on a Dual-G5-node with my Xgrid 
consisting of only either the Dual-G4- or the Dual-G5-node. No matter 
which configuration, I ran into the bus error.


Yours,
Frank

users-requ...@open-mpi.org wrote:

Send users mailing list submissions to
us...@open-mpi.org

To subscribe or unsubscribe via the World Wide Web, visit
http://www.open-mpi.org/mailman/listinfo.cgi/users
or, via email, send a message with subject or body 'help' to
users-requ...@open-mpi.org

You can reach the person managing the list at
users-ow...@open-mpi.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."


Today's Topics:

   1. Re: OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown, error:0)
  si_code:1(BUS_ADRALN) (Terry D. Dontje)
   2. Re: OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown error: 0),
  si_code:1(BUS_ADRALN) (Frank) (Frank)


--

Message: 1
Date: Wed, 28 Jun 2006 07:26:46 -0400
From: "Terry D. Dontje" 
Subject: Re: [OMPI users] OpenMPI 1.1: Signal:10
info.si_errno:0(Unknown, error: 0) si_code:1(BUS_ADRALN)
To: us...@open-mpi.org
Message-ID: <44a26776.2000...@sun.com>
Content-Type: text/plain; format=flowed; charset=ISO-8859-1

Frank,

Do you possibly have a stack trace of the program that failed.  Also,
what OS and platform are you running on? 


--td

  

--

Message: 1
Date: Wed, 28 Jun 2006 12:53:14 +0200
From: Frank 
Subject: [OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown
error:  0) si_code:1(BUS_ADRALN)
To: us...@open-mpi.org
Message-ID: <44a25f9a.3070...@fraka-mp.de>
Content-Type: text/plain; charset="iso-8859-1"

Hi!

I've recently updated to OpenMPI 1.1 on a few nodes and running into a 
problem that wasn't there with OpenMPI 1.0.2.


Submitting a job to the XGrid with OpenMPI 1.1 yields a Bus error that 
isn't there when not submitting the job to the XGrid:


[g5dual:/Network/CFD/MVH-1.0] motte% mpirun -d -np 2 ./vhone
[g5dual.3-net:08794] [0,0,0] setting up session dir with
[g5dual.3-net:08794]universe default-universe
[g5dual.3-net:08794]user motte
[g5dual.3-net:08794]host g5dual.3-net
[g5dual.3-net:08794]jobid 0
[g5dual.3-net:08794]procid 0
[g5dual.3-net:08794] procdir: 
/tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe/0/0
[g5dual.3-net:08794] jobdir: 
/tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe/0
[g5dual.3-net:08794] unidir: 
/tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe

[g5dual.3-net:08794] top: openmpi-sessions-motte@g5dual.3-net_0
[g5dual.3-net:08794] tmp: /tmp
[g5dual.3-net:08794] [0,0,0] contact_file 
/tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe/universe-setup.txt

[g5dual.3-net:08794] [0,0,0] wrote setup file
Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN)
Failing at addr:0x10
*** End of error message ***
Bus error
[g5dual:/Network/CFD/MVH-1.0] motte%

When not xgrid-submitting the job with OpenMPI 1.1 everything is just fine:

[g5dual:/Network/CFD/MVH-1.0] motte% mpirun -d -np 2 ./vhone
[g5dual.3-net:08957] procdir: (null)
[g5dual.3-net:08957] jobdir: (null)
[g5dual.3-net:08957] unidir: 
/tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe

[g5dual.3-net:08957] top: openmpi-sessions-motte@g5dual.3-net_0
[g5dual.3-net:08957] tmp: /tmp
[g5dual.3-net:08957] connect_uni: contact info read
[g5dual.3-net:08957] connect_uni: connection not allowed
[g5dual.3-net:08957] [0,0,0] setting up session dir with
[g5dual.3-net:08957]tmpdir /tmp
[g5dual.3-net:08957]universe default-universe-8957
[g5dual.3-net:08957]user motte
[g5dual.3-net:08957]host g5dual.3-net
[g5dual.3-net:08957]jobid 0
[g5dual.3-net:08957]procid 0
[g5dual.3-net:08957] procdir: 
/tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe-8957/0/0
[g5dual.3-net:08957] jobdir: 
/tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe-8957/0
[g5dual.3-net:08957] unidir: 
/tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe-8957

[g5dual.3-net:08957] top: openmpi-sessions-motte@g5dual.3-net_0
[g5dual.3-net:08957] tmp: /tmp
[g5dual.3-net:08957] [0,0,0] contact_file 
/tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe-8957/universe-setup.txt

[g5dual.3-net:08957] [0,0,0] wrote setup file
[g5dual.3-net:08957] pls:rsh: local csh: 1, local bash: 0
[g5dual.3-net:08957] pls:rsh: assuming same remote shell as local shell
[g5dual.3-net:08957] pls:rsh: remote csh: 1, remote bash: 0
[g5dual.3-net:08957] pls:rsh: final template argv:
[g5dual.3-net:08

Re: [OMPI users] users Digest, Vol 317, Issue 4

2006-06-28 Thread openmpi-user

Hi Eric (and all),

don't know if this really messes things up, but you have set up lam-mpi 
in your path-variables, too:


[enterprise:24786] pls:rsh: reset LD_LIBRARY_PATH: 
/export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u/lib:/export/lca/appl/Forte/SUNWspro/WS6U2/lib:/usr/local/lib:*/usr/local/lam-mpi/7.1.1/lib*:/opt/sfw/lib


Yours,
Frank

users-requ...@open-mpi.org wrote:
Send users mailing list submissions to

us...@open-mpi.org

To subscribe or unsubscribe via the World Wide Web, visit
http://www.open-mpi.org/mailman/listinfo.cgi/users
or, via email, send a message with subject or body 'help' to
users-requ...@open-mpi.org

You can reach the person managing the list at
users-ow...@open-mpi.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."


Today's Topics:

   1. Re: Installing OpenMPI on a solaris (Jeff Squyres (jsquyres))


--

Message: 1
Date: Wed, 28 Jun 2006 08:56:36 -0400
From: "Jeff Squyres \(jsquyres\)" 
Subject: Re: [OMPI users] Installing OpenMPI on a solaris
To: "Open MPI Users" 
Message-ID:

Content-Type: text/plain; charset="iso-8859-1"

Bummer!  :-(
 
Just to be sure -- you had a clean config.cache file before you ran configure, right?  (e.g., the file didn't exist -- just to be sure it didn't get potentially erroneous values from a previous run of configure)  Also, FWIW, it's not necessary to specify --enable-ltdl-convenience; that should be automatic.
 
If you had a clean configure, we *suspect* that this might be due to alignment issues on Solaris 64 bit platforms, but thought that we might have had a pretty good handle on it in 1.1.  Obviously we didn't solve everything.  Bonk.
 
Did you get a corefile, perchance?  If you could send a stack trace, that would be most helpful.





From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
Behalf Of Eric Thibodeau
Sent: Tuesday, June 20, 2006 8:36 PM
    To: us...@open-mpi.org
Subject: Re: [OMPI users] Installing OpenMPI on a solaris



Hello Brian (and all),



Well, the joy was short lived. On a 12 CPU Enterprise machine and on a 
4 CPU one, I seem to be able to start up to 4 processes. Above 4, I seem to 
inevitably get BUS_ADRALN (Bus collisions?). Below are some traces of the 
failling runs as well as a detailed (mpirun -d) of one of these situations and 
ompi_info output. Obviously, don't hesitate to ask if more information is 
requred.



Buid version: openmpi-1.1b5r10421

Config parameters:

Open MPI config.status 1.1b5

configured by ./configure, generated by GNU Autoconf 2.59,

with options \"'--cache-file=config.cache' 'CFLAGS=-mcpu=v9' 
'CXXFLAGS=-mcpu=v9' 'FFLAGS=-mcpu=v9' 
'--prefix=/export/lca/home/lca0/etudiants/ac38820/openmp

i_sun4u' --enable-ltdl-convenience\"



The traces:

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ 
~/openmpi_sun4u/bin/mpirun -np 10 mandelbrot-mpi 100 400 400

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2f4f04

*** End of error message ***

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ 
~/openmpi_sun4u/bin/mpirun -np 8 mandelbrot-mpi 100 400 400

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2b354c

*** End of error message ***

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ 
~/openmpi_sun4u/bin/mpirun -np 6 mandelbrot-mpi 100 400 400

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2b1ecc

*** End of error message ***

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ 
~/openmpi_sun4u/bin/mpirun -np 5 mandelbrot-mpi 100 400 400

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2b12cc

*** End of error message ***

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ 
~/openmpi_sun4u/bin/mpirun -np 4 mandelbrot-mpi 100 400 400

maxiter = 100, width = 400, height = 400

execution time in seconds = 1.48

Taper q pour quitter le programme, autrement, on fait un refresh

q

sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ 
~/openmpi_sun4u/bin/mpirun -np 5 mandelbrot-mpi 100 400 400

Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)

Failing at addr:2b12cc

*** End of error message ***



I also got the same behaviour on a different machine (with the exact 
same code base, $HOME is an NFS mount) and same

Re: [OMPI users] users Digest, Vol 318, Issue 1

2006-06-29 Thread openmpi-user

@Terry

I hope this is of any help (debugged with TotalView):

Enclose you will find a graph from TotalView as well as this:
/Created process 2 (7633), named "mpirun"
Thread 2.1 has appeared
Thread 2.2 has appeared
Thread 2.1 received a signal (Segmentation Violation)/

and the stack trace:
/ _mca_pls_xgrid_set_node_name,  FP=b090
-[PlsXGridClient launchJob:],  FP=b100
_orte_pls_xgrid_launch,FP=b240
_orte_rmgr_urm_spawn,  FP=b290
orterun,   FP=b310
main,  FP=b3b0
_start,FP=b400/

and this (bold crashed):
/ 0x00257680: 0x805e0044  lwz   rtoc,68(r30)
0x00257684: 0x3801  lir0,1
*0x00257688: 0x90020010  stw   r0,16(rtoc)*
0x0025768c: 0x805e0044  lwz   rtoc,68(r30)
0x00257690: 0x38008000  lir0,-32768/

from function /_mca_pls_xgrid_set_node_name/ in /mca_pls_xgrid.so/

Unfortunately I'm not yet familiar with TotalView, so let me know if you 
like to get more output (sorry: haven't found dbx for Mac OS X -> that's 
why TotalView was used)


Yours,
Frank

users-requ...@open-mpi.org wrote:

--

Message: 2
List-Post: users@lists.open-mpi.org
Date: Wed, 28 Jun 2006 10:35:03 -0400
From: "Terry D. Dontje" 
Subject: [OMPI users] Re : OpenMPI 1.1: Signal:10,
info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN)
To: us...@open-mpi.org
Message-ID: <44a29397.2000...@sun.com>
Content-Type: text/plain; format=flowed; charset=ISO-8859-1

Frank,

Can you set your limit coredumpsize to non-zero rerun the program
and then get the stack via dbx?

So, I have a similar case of BUS_ADRALN on SPARC systems with an
older version (June 21st) of the trunk.  I've since run using the latest 
trunk and the
bus went away.  I am now going to try this out with v1.1 to see if I get 
similar
results.  Your stack would help me try and determine if this is an 
OpenMPI issue

or possibly some type of platform problem.

There is another thread with Eric Thibodeau that I am unsure if it is 
the same issue
as either of our situation. 


--td



>
>Message: 3
>Date: Wed, 28 Jun 2006 14:30:12 +0200
>From: openmpi-user 
>Subject: Re: [OMPI users] OpenMPI 1.1: Signal:10
>info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN) (Terry D.
>Dontje)
>To: us...@open-mpi.org
>Message-ID: <44a27654.9060...@fraka-mp.de>
>Content-Type: text/plain; charset="iso-8859-1"
>
>Hi Terry,
>
>unfortunately I haven't got a stack trace.
>
>OS: Mac OS X 10.4.7 Server on the Xgrid-server and Mac OS X 10.4.7 
>Client on every node (G4 and G5). For testing-purposes I've installed 
>OpenMPI 1.1 on a Dual-G4-node and on a Dual-G5-node with my Xgrid 
>consisting of only either the Dual-G4- or the Dual-G5-node. No matter 
>which configuration, I ran into the bus error.

>
>Yours,
>Frank
>
>
>  
>
  




--



_mca_pls_xgrid_set_node_name.dot
Description: Binary data


Re: [OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN) (Terry D. Dontje)

2006-07-01 Thread openmpi-user

@ Terry (and All)!

Enclose you'll find a (minor) bugfix with respect to the BUS_ADRALN I've 
reported recently when submitting jobs to the XGrid with OpenMPI 1.1. 
The BUS_ADRALN error on SPARC systems might be caused by a similar code 
segment. For the "bugfix" see line 55ff of the attached code file 
pls_xgrid_cliemt.m


I haven't check this yet, but it's very likely that the same code 
segment causes the BUS_ADRALN error in the trunk-tarballs when 
submitting jobs to with XGrid with those releases.


Hope this will help you, too, Eric.

Frank
/*
 * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
 * University Research and Technology
 * Corporation.  All rights reserved.
 * Copyright (c) 2004-2005 The University of Tennessee and The University
 * of Tennessee Research Foundation.  All rights
 * reserved.
 * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
 * University of Stuttgart.  All rights reserved.
 * Copyright (c) 2004-2005 The Regents of the University of California.
 * All rights reserved.
 * $COPYRIGHT$
 * 
 * Additional copyrights may follow
 * 
 * $HEADER$
 */

#import "orte_config.h"

#import 

#import "orte/mca/pls/base/base.h"
#import "orte/orte_constants.h"
#import "orte/mca/ns/ns.h"
#import "orte/mca/ras/base/ras_base_node.h"
#import "orte/mca/gpr/gpr.h"
#import "orte/mca/rml/rml.h"
#import "opal/util/path.h"

#import "pls_xgrid_client.h"

char **environ;

/**
 * Set the daemons name in the registry.
 */

static int
mca_pls_xgrid_set_node_name(orte_ras_node_t* node, 
orte_jobid_t jobid, 
orte_process_name_t* name)
{
orte_gpr_value_t *values[1], *value;
orte_gpr_keyval_t *kv;
char* jobid_string;
size_t i;
int rc;

values[0] = OBJ_NEW(orte_gpr_value_t);
if (NULL == values[0]) {
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
return ORTE_ERR_OUT_OF_RESOURCE;
}
//  BUS_ADRALIN error in line value->cnt = 1, if value isn't assigned first
value = values[0];
value->cnt = 1;
value->addr_mode = ORTE_GPR_OVERWRITE;
value->segment = strdup(ORTE_NODE_SEGMENT);
//  value = values[0];
value->keyvals = (orte_gpr_keyval_t**)malloc(value->cnt * 
sizeof(orte_gpr_keyval_t*));
if (NULL == value->keyvals) {
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
OBJ_RELEASE(value);
return ORTE_ERR_OUT_OF_RESOURCE;
}
value->keyvals[0] = OBJ_NEW(orte_gpr_keyval_t);
if (NULL == value->keyvals[0]) {
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
OBJ_RELEASE(value);
return ORTE_ERR_OUT_OF_RESOURCE;
}
kv = value->keyvals[0];

if (ORTE_SUCCESS != 
(rc = orte_ns.convert_jobid_to_string(&jobid_string, jobid))) {
ORTE_ERROR_LOG(rc);
OBJ_RELEASE(value);
return rc;
}

if (ORTE_SUCCESS != 
(rc = orte_schema.get_node_tokens(&(value->tokens), 
&(value->num_tokens), 
node->node_cellid, node->node_name))) {
ORTE_ERROR_LOG(rc);
OBJ_RELEASE(value);
free(jobid_string);
return rc;
}

asprintf(&(kv->key), "%s-%s", ORTE_NODE_BOOTPROXY_KEY, jobid_string);
kv->value = OBJ_NEW(orte_data_value_t);
if (NULL == kv->value) {
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
OBJ_RELEASE(value);
return ORTE_ERR_OUT_OF_RESOURCE;
}
kv->value->type = ORTE_NAME;
if (ORTE_SUCCESS != (rc = orte_dss.copy(&(kv->value->data), name, 
ORTE_NAME))) {
ORTE_ERROR_LOG(rc);
OBJ_RELEASE(value);
return rc;
}

rc = orte_gpr.put(1, values);
if(ORTE_SUCCESS != rc) {
ORTE_ERROR_LOG(rc);
}

OBJ_RELEASE(value);

return rc;
}


@implementation PlsXGridClient

/* init / finalize */
-(id) init
{
return [self initWithControllerHostname: NULL
 AndControllerPassword: NULL
 AndOrted: NULL
 AndCleanup: 1];
}

-(id) initWithControllerHostname: (char*) hostname
   AndControllerPassword: (char*) password
AndOrted: (char*) ortedname
  AndCleanup: (int) val
{
if (self = [super init]) {
/* class-specific initialization goes here */
OBJ_CONSTRUCT(&state_cond, opal_condition_t);
OBJ_CONSTRUCT(&state_mutex, opal_mutex_t);

if (NULL != password) {
controller_password = [NSString stringWithCString: password];
}
if (NULL != hostname) {
controller_hostname = [NSString stringWithCString: hostname];
}
cleanup = val;
if

[OMPI users] OS X, OpenMPI 1.1: An error occurred in MPI_Allreduce on communicator MPI_COMM_WORLD

2006-07-02 Thread openmpi-user

Hi All,

when the nodes belong to different subnets the following error messages 
pop up:

[powerbook.2-net:20826] *** An error occurred in MPI_Allreduce
[powerbook.2-net:20826] *** on communicator MPI_COMM_WORLD
[powerbook.2-net:20826] *** MPI_ERR_INTERN: internal error
[powerbook.2-net:20826] *** MPI_ERRORS_ARE_FATAL (goodbye)

Here hostfile sets up three nodes in two subnets (192.168.3.x and 
192.168.2.x with mask 255.255.255.0). The 192.168.3.x-nodes are 
connected via Gigabit-Ethernet, the 192.168.2.x-nodes are connected via 
WLAN.


Frank


This is the full output:

[powerbook:/Network/CFD/MVH-1.0] motte% mpirun -d -np 7 --hostfile 
./hostfile /Network/CFD/MVH-1.0/vhone

[powerbook.2-net:20821] procdir: (null)
[powerbook.2-net:20821] jobdir: (null)
[powerbook.2-net:20821] unidir: 
/tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe

[powerbook.2-net:20821] top: openmpi-sessions-motte@powerbook.2-net_0
[powerbook.2-net:20821] tmp: /tmp
[powerbook.2-net:20821] connect_uni: contact info read
[powerbook.2-net:20821] connect_uni: connection not allowed
[powerbook.2-net:20821] [0,0,0] setting up session dir with
[powerbook.2-net:20821] tmpdir /tmp
[powerbook.2-net:20821] universe default-universe-20821
[powerbook.2-net:20821] user motte
[powerbook.2-net:20821] host powerbook.2-net
[powerbook.2-net:20821] jobid 0
[powerbook.2-net:20821] procid 0
[powerbook.2-net:20821] procdir: 
/tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe-20821/0/0
[powerbook.2-net:20821] jobdir: 
/tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe-20821/0
[powerbook.2-net:20821] unidir: 
/tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe-20821

[powerbook.2-net:20821] top: openmpi-sessions-motte@powerbook.2-net_0
[powerbook.2-net:20821] tmp: /tmp
[powerbook.2-net:20821] [0,0,0] contact_file 
/tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe-20821/universe-setup.txt

[powerbook.2-net:20821] [0,0,0] wrote setup file
[powerbook.2-net:20821] pls:rsh: local csh: 1, local bash: 0
[powerbook.2-net:20821] pls:rsh: assuming same remote shell as local shell
[powerbook.2-net:20821] pls:rsh: remote csh: 1, remote bash: 0
[powerbook.2-net:20821] pls:rsh: final template argv:
[powerbook.2-net:20821] pls:rsh: /usr/bin/ssh  orted 
--debug --bootproxy 1 --name  --num_procs 4 --vpid_start 0 
--nodename  --universe 
motte@powerbook.2-net:default-universe-20821 --nsreplica 
"0.0.0;tcp://192.168.2.3:54609" --gprreplica 
"0.0.0;tcp://192.168.2.3:54609" --mpi-call-yield 0

[powerbook.2-net:20821] pls:rsh: launching on node Powerbook.2-net
[powerbook.2-net:20821] pls:rsh: not oversubscribed -- setting 
mpi_yield_when_idle to 0

[powerbook.2-net:20821] pls:rsh: Powerbook.2-net is a LOCAL node
[powerbook.2-net:20821] pls:rsh: changing to directory /Users/motte
[powerbook.2-net:20821] pls:rsh: executing: orted --debug --bootproxy 1 
--name 0.0.1 --num_procs 4 --vpid_start 0 --nodename Powerbook.2-net 
--universe motte@powerbook.2-net:default-universe-20821 --nsreplica 
"0.0.0;tcp://192.168.2.3:54609" --gprreplica 
"0.0.0;tcp://192.168.2.3:54609" --mpi-call-yield 0

[powerbook.2-net:20822] [0,0,1] setting up session dir with
[powerbook.2-net:20822] universe default-universe-20821
[powerbook.2-net:20822] user motte
[powerbook.2-net:20822] host Powerbook.2-net
[powerbook.2-net:20822] jobid 0
[powerbook.2-net:20822] procid 1
[powerbook.2-net:20822] procdir: 
/tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe-20821/0/1
[powerbook.2-net:20822] jobdir: 
/tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe-20821/0
[powerbook.2-net:20822] unidir: 
/tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe-20821

[powerbook.2-net:20822] top: openmpi-sessions-motte@Powerbook.2-net_0
[powerbook.2-net:20822] tmp: /tmp
[powerbook.2-net:20821] pls:rsh: launching on node g4d003.3-net
[powerbook.2-net:20821] pls:rsh: not oversubscribed -- setting 
mpi_yield_when_idle to 0

[powerbook.2-net:20821] pls:rsh: g4d003.3-net is a REMOTE node
[powerbook.2-net:20821] pls:rsh: executing: /usr/bin/ssh g4d003.3-net 
orted --debug --bootproxy 1 --name 0.0.2 --num_procs 4 --vpid_start 0 
--nodename g4d003.3-net --universe 
motte@powerbook.2-net:default-universe-20821 --nsreplica 
"0.0.0;tcp://192.168.2.3:54609" --gprreplica 
"0.0.0;tcp://192.168.2.3:54609" --mpi-call-yield 0

[powerbook.2-net:20821] pls:rsh: launching on node G5Dual.3-net
[powerbook.2-net:20821] pls:rsh: not oversubscribed -- setting 
mpi_yield_when_idle to 0

[powerbook.2-net:20821] pls:rsh: G5Dual.3-net is a REMOTE node
[powerbook.2-net:20821] pls:rsh: executing: /usr/bin/ssh G5Dual.3-net 
orted --debug --bootproxy 1 --name 0.0.3 --num_procs 4 --vpid_start 0 
--nodename G5Dual.3-net --universe 
motte@powerbook.2-net:default-universe-20821 --nsreplica 
"0.0.0;tcp://192.168.2.3:5460

[OMPI users] Xgrid and Kerberos

2006-10-28 Thread openmpi-user
What's the proper setup for using kerberos-single-sign-on with OpenMPI? For now 
I've added the two environment variable XGRID_CONTROLLER_HOSTNAME and 
XGRID_CONTROLLER_PASSWORD to submit jobs to the grid.

Yours,
Frank