[OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times
Hello List, I hope you can help us out on that one, as we are trying to figure out since weeks. The situation: We have a program being capable of slitting to several processes to be shared on nodes within a cluster network using openmpi. We were running that system on "older" cluster hardware (Intel Core2 Duo based, 2GB RAM) using an "older" kernel (2.6.18.6). All nodes are diskless network booting. Recently we upgraded the hardware (Intel i5, 8GB RAM) which also required an upgrade to a recent kernel version (2.6.26+). Here is the problem: We experience overall performance loss on the new hardware and think, we can break it down to a communication issue inbetween the processes. Also, we found out, the issue araises in the transition from kernel 2.6.23 to 2.6.24 (tested on the Core2 Duo system). Here is an output from our programm: 2.6.23.17 (64bit), MPI 1.2.7 5 Iterationen (Core2 Duo) 6 CPU: 93.33 seconds per iteration. Node 0 communication/computation time: 6.83 /647.64 seconds. Node 1 communication/computation time: 10.09 /644.36 seconds. Node 2 communication/computation time: 7.27 /645.03 seconds. Node 3 communication/computation time:165.02 /485.52 seconds. Node 4 communication/computation time: 6.50 /643.82 seconds. Node 5 communication/computation time: 7.80 /627.63 seconds. Computation time:897.00 seconds. 2.6.24.7 (64bit) .. re-evaluated, MPI 1.2.7 5 Iterationen (Core2 Duo) 6 CPU: 131.33 seconds per iteration. Node 0 communication/computation time:364.15 /645.24 seconds. Node 1 communication/computation time:362.83 /645.26 seconds. Node 2 communication/computation time:349.39 /645.07 seconds. Node 3 communication/computation time:508.34 /485.53 seconds. Node 4 communication/computation time:349.94 /643.81 seconds. Node 5 communication/computation time:349.07 /627.47 seconds. Computation time: 1251.00 seconds. The program is 32 bit software, but it doesn't make any difference whether the kernel is 64 or 32 bit. Also the OpenMPI version 1.4.1 was tested, cut communication times by half (which still is too high), but improvement decreased with increasing kernel version number. The communication time is meant to be the time the master process distributes the data portions for calculation and collecting the results from the slave processes. The value also contains times a slave has to wait to communicate with the master as he is occupied. This explains the extended communication time of node #3 as the calculation time is reduced (based on the nature of the data) The command to start the calculation: mpirun -np 2 -host cluster-17 invert-master -b -s -p inv_grav.inp : -np 4 -host cluster-18,cluster-19 Using top (with 'f' and 'j' showing P row) we could track which process runs on which core. We found processes stayed on its initial core in kernel 2.6.23, but started to flip around with 2.6.24. Using the --bind-to-core option in openmpi 1.4.1 kept the processes on its cores again, but that didn't influence the overall outcome, didn't fix the issue. We found top showing ~25% CPU wait time, and processes showing 'D' , also on slave only nodes. According to our programmer communications are only between the master process and its slaves, but not among slaves. On kernel 2.6.23 and lower CPU usage is 100% on user, no wait or system percentage. Example from top: Cpu(s): 75.3%us, 0.6%sy, 0.0%ni, 0.0%id, 23.1%wa, 0.7%hi, 0.3%si, 0.0%st Mem: 8181236k total, 131224k used, 8050012k free,0k buffers Swap:0k total,0k used,0k free,49868k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ P COMMAND 3386 oli 20 0 90512 20m 3988 R 74 0.3 12:31.80 0 invert- 3387 oli 20 0 85072 15m 3780 D 67 0.2 11:59.30 1 invert- 3388 oli 20 0 85064 14m 3588 D 77 0.2 12:56.90 2 invert- 3389 oli 20 0 84936 14m 3436 R 85 0.2 13:28.30 3 invert- Some system information that might be helpful: Nodes Hardware: 1. "older": Intel Core2 Duo, (2x1)GB RAM 2. "newer": Intel(R) Core(TM) i5 CPU, Mainboard ASUS RS100-E6, (4x2)GB RAM Debian stable (lenny) distribution with ii libc6 2.7-18lenny2 ii libopenmpi1 1.2.7~rc2-2 ii openmpi-bin 1.2.7~rc2-2 ii openmpi-common1.2.7~rc2-2 Nodes are booting diskless with nfs-root and a kernel with all drivers needed compiled in. Information on the program using openmpi and tools used to compile it: mpirun --version: mpirun (Open MPI) 1.2.7rc2 libopenmpi-dev 1.2.7~rc2-2 depends on: libc6 (2.7-18lenny2) libopenmpi1 (1.2.7~rc2-2) openmpi-common (1.2.7~rc2-2) Compilation command: mpif90 FORTRAN compiler (FC): gfortran --version: GNU Fortran
Re: [OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN) (Terry D. Dontje)
Hi Terry, unfortunately I haven't got a stack trace. OS: Mac OS X 10.4.7 Server on the Xgrid-server and Mac OS X 10.4.7 Client on every node (G4 and G5). For testing-purposes I've installed OpenMPI 1.1 on a Dual-G4-node and on a Dual-G5-node with my Xgrid consisting of only either the Dual-G4- or the Dual-G5-node. No matter which configuration, I ran into the bus error. Yours, Frank users-requ...@open-mpi.org wrote: Send users mailing list submissions to us...@open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit http://www.open-mpi.org/mailman/listinfo.cgi/users or, via email, send a message with subject or body 'help' to users-requ...@open-mpi.org You can reach the person managing the list at users-ow...@open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Re: OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown, error:0) si_code:1(BUS_ADRALN) (Terry D. Dontje) 2. Re: OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown error: 0), si_code:1(BUS_ADRALN) (Frank) (Frank) -- Message: 1 Date: Wed, 28 Jun 2006 07:26:46 -0400 From: "Terry D. Dontje" Subject: Re: [OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown, error: 0) si_code:1(BUS_ADRALN) To: us...@open-mpi.org Message-ID: <44a26776.2000...@sun.com> Content-Type: text/plain; format=flowed; charset=ISO-8859-1 Frank, Do you possibly have a stack trace of the program that failed. Also, what OS and platform are you running on? --td -- Message: 1 Date: Wed, 28 Jun 2006 12:53:14 +0200 From: Frank Subject: [OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN) To: us...@open-mpi.org Message-ID: <44a25f9a.3070...@fraka-mp.de> Content-Type: text/plain; charset="iso-8859-1" Hi! I've recently updated to OpenMPI 1.1 on a few nodes and running into a problem that wasn't there with OpenMPI 1.0.2. Submitting a job to the XGrid with OpenMPI 1.1 yields a Bus error that isn't there when not submitting the job to the XGrid: [g5dual:/Network/CFD/MVH-1.0] motte% mpirun -d -np 2 ./vhone [g5dual.3-net:08794] [0,0,0] setting up session dir with [g5dual.3-net:08794]universe default-universe [g5dual.3-net:08794]user motte [g5dual.3-net:08794]host g5dual.3-net [g5dual.3-net:08794]jobid 0 [g5dual.3-net:08794]procid 0 [g5dual.3-net:08794] procdir: /tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe/0/0 [g5dual.3-net:08794] jobdir: /tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe/0 [g5dual.3-net:08794] unidir: /tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe [g5dual.3-net:08794] top: openmpi-sessions-motte@g5dual.3-net_0 [g5dual.3-net:08794] tmp: /tmp [g5dual.3-net:08794] [0,0,0] contact_file /tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe/universe-setup.txt [g5dual.3-net:08794] [0,0,0] wrote setup file Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN) Failing at addr:0x10 *** End of error message *** Bus error [g5dual:/Network/CFD/MVH-1.0] motte% When not xgrid-submitting the job with OpenMPI 1.1 everything is just fine: [g5dual:/Network/CFD/MVH-1.0] motte% mpirun -d -np 2 ./vhone [g5dual.3-net:08957] procdir: (null) [g5dual.3-net:08957] jobdir: (null) [g5dual.3-net:08957] unidir: /tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe [g5dual.3-net:08957] top: openmpi-sessions-motte@g5dual.3-net_0 [g5dual.3-net:08957] tmp: /tmp [g5dual.3-net:08957] connect_uni: contact info read [g5dual.3-net:08957] connect_uni: connection not allowed [g5dual.3-net:08957] [0,0,0] setting up session dir with [g5dual.3-net:08957]tmpdir /tmp [g5dual.3-net:08957]universe default-universe-8957 [g5dual.3-net:08957]user motte [g5dual.3-net:08957]host g5dual.3-net [g5dual.3-net:08957]jobid 0 [g5dual.3-net:08957]procid 0 [g5dual.3-net:08957] procdir: /tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe-8957/0/0 [g5dual.3-net:08957] jobdir: /tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe-8957/0 [g5dual.3-net:08957] unidir: /tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe-8957 [g5dual.3-net:08957] top: openmpi-sessions-motte@g5dual.3-net_0 [g5dual.3-net:08957] tmp: /tmp [g5dual.3-net:08957] [0,0,0] contact_file /tmp/openmpi-sessions-motte@g5dual.3-net_0/default-universe-8957/universe-setup.txt [g5dual.3-net:08957] [0,0,0] wrote setup file [g5dual.3-net:08957] pls:rsh: local csh: 1, local bash: 0 [g5dual.3-net:08957] pls:rsh: assuming same remote shell as local shell [g5dual.3-net:08957] pls:rsh: remote csh: 1, remote bash: 0 [g5dual.3-net:08957] pls:rsh: final template argv: [g5dual.3-net:08
Re: [OMPI users] users Digest, Vol 317, Issue 4
Hi Eric (and all), don't know if this really messes things up, but you have set up lam-mpi in your path-variables, too: [enterprise:24786] pls:rsh: reset LD_LIBRARY_PATH: /export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u/lib:/export/lca/appl/Forte/SUNWspro/WS6U2/lib:/usr/local/lib:*/usr/local/lam-mpi/7.1.1/lib*:/opt/sfw/lib Yours, Frank users-requ...@open-mpi.org wrote: Send users mailing list submissions to us...@open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit http://www.open-mpi.org/mailman/listinfo.cgi/users or, via email, send a message with subject or body 'help' to users-requ...@open-mpi.org You can reach the person managing the list at users-ow...@open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Re: Installing OpenMPI on a solaris (Jeff Squyres (jsquyres)) -- Message: 1 Date: Wed, 28 Jun 2006 08:56:36 -0400 From: "Jeff Squyres \(jsquyres\)" Subject: Re: [OMPI users] Installing OpenMPI on a solaris To: "Open MPI Users" Message-ID: Content-Type: text/plain; charset="iso-8859-1" Bummer! :-( Just to be sure -- you had a clean config.cache file before you ran configure, right? (e.g., the file didn't exist -- just to be sure it didn't get potentially erroneous values from a previous run of configure) Also, FWIW, it's not necessary to specify --enable-ltdl-convenience; that should be automatic. If you had a clean configure, we *suspect* that this might be due to alignment issues on Solaris 64 bit platforms, but thought that we might have had a pretty good handle on it in 1.1. Obviously we didn't solve everything. Bonk. Did you get a corefile, perchance? If you could send a stack trace, that would be most helpful. From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Eric Thibodeau Sent: Tuesday, June 20, 2006 8:36 PM To: us...@open-mpi.org Subject: Re: [OMPI users] Installing OpenMPI on a solaris Hello Brian (and all), Well, the joy was short lived. On a 12 CPU Enterprise machine and on a 4 CPU one, I seem to be able to start up to 4 processes. Above 4, I seem to inevitably get BUS_ADRALN (Bus collisions?). Below are some traces of the failling runs as well as a detailed (mpirun -d) of one of these situations and ompi_info output. Obviously, don't hesitate to ask if more information is requred. Buid version: openmpi-1.1b5r10421 Config parameters: Open MPI config.status 1.1b5 configured by ./configure, generated by GNU Autoconf 2.59, with options \"'--cache-file=config.cache' 'CFLAGS=-mcpu=v9' 'CXXFLAGS=-mcpu=v9' 'FFLAGS=-mcpu=v9' '--prefix=/export/lca/home/lca0/etudiants/ac38820/openmp i_sun4u' --enable-ltdl-convenience\" The traces: sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 10 mandelbrot-mpi 100 400 400 Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN) Failing at addr:2f4f04 *** End of error message *** sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 8 mandelbrot-mpi 100 400 400 Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN) Failing at addr:2b354c *** End of error message *** sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 6 mandelbrot-mpi 100 400 400 Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN) Failing at addr:2b1ecc *** End of error message *** sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 5 mandelbrot-mpi 100 400 400 Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN) Failing at addr:2b12cc *** End of error message *** sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 4 mandelbrot-mpi 100 400 400 maxiter = 100, width = 400, height = 400 execution time in seconds = 1.48 Taper q pour quitter le programme, autrement, on fait un refresh q sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ ~/openmpi_sun4u/bin/mpirun -np 5 mandelbrot-mpi 100 400 400 Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN) Failing at addr:2b12cc *** End of error message *** I also got the same behaviour on a different machine (with the exact same code base, $HOME is an NFS mount) and same
Re: [OMPI users] users Digest, Vol 318, Issue 1
@Terry I hope this is of any help (debugged with TotalView): Enclose you will find a graph from TotalView as well as this: /Created process 2 (7633), named "mpirun" Thread 2.1 has appeared Thread 2.2 has appeared Thread 2.1 received a signal (Segmentation Violation)/ and the stack trace: / _mca_pls_xgrid_set_node_name, FP=b090 -[PlsXGridClient launchJob:], FP=b100 _orte_pls_xgrid_launch,FP=b240 _orte_rmgr_urm_spawn, FP=b290 orterun, FP=b310 main, FP=b3b0 _start,FP=b400/ and this (bold crashed): / 0x00257680: 0x805e0044 lwz rtoc,68(r30) 0x00257684: 0x3801 lir0,1 *0x00257688: 0x90020010 stw r0,16(rtoc)* 0x0025768c: 0x805e0044 lwz rtoc,68(r30) 0x00257690: 0x38008000 lir0,-32768/ from function /_mca_pls_xgrid_set_node_name/ in /mca_pls_xgrid.so/ Unfortunately I'm not yet familiar with TotalView, so let me know if you like to get more output (sorry: haven't found dbx for Mac OS X -> that's why TotalView was used) Yours, Frank users-requ...@open-mpi.org wrote: -- Message: 2 List-Post: users@lists.open-mpi.org Date: Wed, 28 Jun 2006 10:35:03 -0400 From: "Terry D. Dontje" Subject: [OMPI users] Re : OpenMPI 1.1: Signal:10, info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN) To: us...@open-mpi.org Message-ID: <44a29397.2000...@sun.com> Content-Type: text/plain; format=flowed; charset=ISO-8859-1 Frank, Can you set your limit coredumpsize to non-zero rerun the program and then get the stack via dbx? So, I have a similar case of BUS_ADRALN on SPARC systems with an older version (June 21st) of the trunk. I've since run using the latest trunk and the bus went away. I am now going to try this out with v1.1 to see if I get similar results. Your stack would help me try and determine if this is an OpenMPI issue or possibly some type of platform problem. There is another thread with Eric Thibodeau that I am unsure if it is the same issue as either of our situation. --td > >Message: 3 >Date: Wed, 28 Jun 2006 14:30:12 +0200 >From: openmpi-user >Subject: Re: [OMPI users] OpenMPI 1.1: Signal:10 >info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN) (Terry D. >Dontje) >To: us...@open-mpi.org >Message-ID: <44a27654.9060...@fraka-mp.de> >Content-Type: text/plain; charset="iso-8859-1" > >Hi Terry, > >unfortunately I haven't got a stack trace. > >OS: Mac OS X 10.4.7 Server on the Xgrid-server and Mac OS X 10.4.7 >Client on every node (G4 and G5). For testing-purposes I've installed >OpenMPI 1.1 on a Dual-G4-node and on a Dual-G5-node with my Xgrid >consisting of only either the Dual-G4- or the Dual-G5-node. No matter >which configuration, I ran into the bus error. > >Yours, >Frank > > > > -- _mca_pls_xgrid_set_node_name.dot Description: Binary data
Re: [OMPI users] OpenMPI 1.1: Signal:10 info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN) (Terry D. Dontje)
@ Terry (and All)! Enclose you'll find a (minor) bugfix with respect to the BUS_ADRALN I've reported recently when submitting jobs to the XGrid with OpenMPI 1.1. The BUS_ADRALN error on SPARC systems might be caused by a similar code segment. For the "bugfix" see line 55ff of the attached code file pls_xgrid_cliemt.m I haven't check this yet, but it's very likely that the same code segment causes the BUS_ADRALN error in the trunk-tarballs when submitting jobs to with XGrid with those releases. Hope this will help you, too, Eric. Frank /* * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, * University of Stuttgart. All rights reserved. * Copyright (c) 2004-2005 The Regents of the University of California. * All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow * * $HEADER$ */ #import "orte_config.h" #import #import "orte/mca/pls/base/base.h" #import "orte/orte_constants.h" #import "orte/mca/ns/ns.h" #import "orte/mca/ras/base/ras_base_node.h" #import "orte/mca/gpr/gpr.h" #import "orte/mca/rml/rml.h" #import "opal/util/path.h" #import "pls_xgrid_client.h" char **environ; /** * Set the daemons name in the registry. */ static int mca_pls_xgrid_set_node_name(orte_ras_node_t* node, orte_jobid_t jobid, orte_process_name_t* name) { orte_gpr_value_t *values[1], *value; orte_gpr_keyval_t *kv; char* jobid_string; size_t i; int rc; values[0] = OBJ_NEW(orte_gpr_value_t); if (NULL == values[0]) { ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE); return ORTE_ERR_OUT_OF_RESOURCE; } // BUS_ADRALIN error in line value->cnt = 1, if value isn't assigned first value = values[0]; value->cnt = 1; value->addr_mode = ORTE_GPR_OVERWRITE; value->segment = strdup(ORTE_NODE_SEGMENT); // value = values[0]; value->keyvals = (orte_gpr_keyval_t**)malloc(value->cnt * sizeof(orte_gpr_keyval_t*)); if (NULL == value->keyvals) { ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE); OBJ_RELEASE(value); return ORTE_ERR_OUT_OF_RESOURCE; } value->keyvals[0] = OBJ_NEW(orte_gpr_keyval_t); if (NULL == value->keyvals[0]) { ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE); OBJ_RELEASE(value); return ORTE_ERR_OUT_OF_RESOURCE; } kv = value->keyvals[0]; if (ORTE_SUCCESS != (rc = orte_ns.convert_jobid_to_string(&jobid_string, jobid))) { ORTE_ERROR_LOG(rc); OBJ_RELEASE(value); return rc; } if (ORTE_SUCCESS != (rc = orte_schema.get_node_tokens(&(value->tokens), &(value->num_tokens), node->node_cellid, node->node_name))) { ORTE_ERROR_LOG(rc); OBJ_RELEASE(value); free(jobid_string); return rc; } asprintf(&(kv->key), "%s-%s", ORTE_NODE_BOOTPROXY_KEY, jobid_string); kv->value = OBJ_NEW(orte_data_value_t); if (NULL == kv->value) { ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE); OBJ_RELEASE(value); return ORTE_ERR_OUT_OF_RESOURCE; } kv->value->type = ORTE_NAME; if (ORTE_SUCCESS != (rc = orte_dss.copy(&(kv->value->data), name, ORTE_NAME))) { ORTE_ERROR_LOG(rc); OBJ_RELEASE(value); return rc; } rc = orte_gpr.put(1, values); if(ORTE_SUCCESS != rc) { ORTE_ERROR_LOG(rc); } OBJ_RELEASE(value); return rc; } @implementation PlsXGridClient /* init / finalize */ -(id) init { return [self initWithControllerHostname: NULL AndControllerPassword: NULL AndOrted: NULL AndCleanup: 1]; } -(id) initWithControllerHostname: (char*) hostname AndControllerPassword: (char*) password AndOrted: (char*) ortedname AndCleanup: (int) val { if (self = [super init]) { /* class-specific initialization goes here */ OBJ_CONSTRUCT(&state_cond, opal_condition_t); OBJ_CONSTRUCT(&state_mutex, opal_mutex_t); if (NULL != password) { controller_password = [NSString stringWithCString: password]; } if (NULL != hostname) { controller_hostname = [NSString stringWithCString: hostname]; } cleanup = val; if
[OMPI users] OS X, OpenMPI 1.1: An error occurred in MPI_Allreduce on communicator MPI_COMM_WORLD
Hi All, when the nodes belong to different subnets the following error messages pop up: [powerbook.2-net:20826] *** An error occurred in MPI_Allreduce [powerbook.2-net:20826] *** on communicator MPI_COMM_WORLD [powerbook.2-net:20826] *** MPI_ERR_INTERN: internal error [powerbook.2-net:20826] *** MPI_ERRORS_ARE_FATAL (goodbye) Here hostfile sets up three nodes in two subnets (192.168.3.x and 192.168.2.x with mask 255.255.255.0). The 192.168.3.x-nodes are connected via Gigabit-Ethernet, the 192.168.2.x-nodes are connected via WLAN. Frank This is the full output: [powerbook:/Network/CFD/MVH-1.0] motte% mpirun -d -np 7 --hostfile ./hostfile /Network/CFD/MVH-1.0/vhone [powerbook.2-net:20821] procdir: (null) [powerbook.2-net:20821] jobdir: (null) [powerbook.2-net:20821] unidir: /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe [powerbook.2-net:20821] top: openmpi-sessions-motte@powerbook.2-net_0 [powerbook.2-net:20821] tmp: /tmp [powerbook.2-net:20821] connect_uni: contact info read [powerbook.2-net:20821] connect_uni: connection not allowed [powerbook.2-net:20821] [0,0,0] setting up session dir with [powerbook.2-net:20821] tmpdir /tmp [powerbook.2-net:20821] universe default-universe-20821 [powerbook.2-net:20821] user motte [powerbook.2-net:20821] host powerbook.2-net [powerbook.2-net:20821] jobid 0 [powerbook.2-net:20821] procid 0 [powerbook.2-net:20821] procdir: /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe-20821/0/0 [powerbook.2-net:20821] jobdir: /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe-20821/0 [powerbook.2-net:20821] unidir: /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe-20821 [powerbook.2-net:20821] top: openmpi-sessions-motte@powerbook.2-net_0 [powerbook.2-net:20821] tmp: /tmp [powerbook.2-net:20821] [0,0,0] contact_file /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe-20821/universe-setup.txt [powerbook.2-net:20821] [0,0,0] wrote setup file [powerbook.2-net:20821] pls:rsh: local csh: 1, local bash: 0 [powerbook.2-net:20821] pls:rsh: assuming same remote shell as local shell [powerbook.2-net:20821] pls:rsh: remote csh: 1, remote bash: 0 [powerbook.2-net:20821] pls:rsh: final template argv: [powerbook.2-net:20821] pls:rsh: /usr/bin/ssh orted --debug --bootproxy 1 --name --num_procs 4 --vpid_start 0 --nodename --universe motte@powerbook.2-net:default-universe-20821 --nsreplica "0.0.0;tcp://192.168.2.3:54609" --gprreplica "0.0.0;tcp://192.168.2.3:54609" --mpi-call-yield 0 [powerbook.2-net:20821] pls:rsh: launching on node Powerbook.2-net [powerbook.2-net:20821] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [powerbook.2-net:20821] pls:rsh: Powerbook.2-net is a LOCAL node [powerbook.2-net:20821] pls:rsh: changing to directory /Users/motte [powerbook.2-net:20821] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 4 --vpid_start 0 --nodename Powerbook.2-net --universe motte@powerbook.2-net:default-universe-20821 --nsreplica "0.0.0;tcp://192.168.2.3:54609" --gprreplica "0.0.0;tcp://192.168.2.3:54609" --mpi-call-yield 0 [powerbook.2-net:20822] [0,0,1] setting up session dir with [powerbook.2-net:20822] universe default-universe-20821 [powerbook.2-net:20822] user motte [powerbook.2-net:20822] host Powerbook.2-net [powerbook.2-net:20822] jobid 0 [powerbook.2-net:20822] procid 1 [powerbook.2-net:20822] procdir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe-20821/0/1 [powerbook.2-net:20822] jobdir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe-20821/0 [powerbook.2-net:20822] unidir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe-20821 [powerbook.2-net:20822] top: openmpi-sessions-motte@Powerbook.2-net_0 [powerbook.2-net:20822] tmp: /tmp [powerbook.2-net:20821] pls:rsh: launching on node g4d003.3-net [powerbook.2-net:20821] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [powerbook.2-net:20821] pls:rsh: g4d003.3-net is a REMOTE node [powerbook.2-net:20821] pls:rsh: executing: /usr/bin/ssh g4d003.3-net orted --debug --bootproxy 1 --name 0.0.2 --num_procs 4 --vpid_start 0 --nodename g4d003.3-net --universe motte@powerbook.2-net:default-universe-20821 --nsreplica "0.0.0;tcp://192.168.2.3:54609" --gprreplica "0.0.0;tcp://192.168.2.3:54609" --mpi-call-yield 0 [powerbook.2-net:20821] pls:rsh: launching on node G5Dual.3-net [powerbook.2-net:20821] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [powerbook.2-net:20821] pls:rsh: G5Dual.3-net is a REMOTE node [powerbook.2-net:20821] pls:rsh: executing: /usr/bin/ssh G5Dual.3-net orted --debug --bootproxy 1 --name 0.0.3 --num_procs 4 --vpid_start 0 --nodename G5Dual.3-net --universe motte@powerbook.2-net:default-universe-20821 --nsreplica "0.0.0;tcp://192.168.2.3:5460
[OMPI users] Xgrid and Kerberos
What's the proper setup for using kerberos-single-sign-on with OpenMPI? For now I've added the two environment variable XGRID_CONTROLLER_HOSTNAME and XGRID_CONTROLLER_PASSWORD to submit jobs to the grid. Yours, Frank