Re: [OMPI users] error in running openmpi on remote node
Thank you very much. This problem is solved when I change the shell of remote node to B shell. Because I set the LD_LIBRARY_PATH in .bashrc file while the default shell was C shell. Althoguth it works on my testing program test.x, some errors occured when I run other programme. BTW, I tried to run this programme on single PC with 2 np successfully. Any suggestions? Thank you [say@wolf45 tmp]$ mpirun -np 2 --host wolf45,wolf46 /usr/local/amber9/exe/sander.MPI -O -i /tmp/amber9mintest.in -o /tmp/amber9mintest.out -c /tmp/amber9mintest.inpcrd -p /tmp/amber9mintest.prmtop -r /tmp/amber9mintest.rst [wolf46.chem.cuhk.edu.hk:06002] *** An error occurred in MPI_Barrier [ wolf46.chem.cuhk.edu.hk:06002] *** on communicator MPI_COMM_WORLD [wolf46.chem.cuhk.edu.hk:06002] *** MPI_ERR_INTERN: internal error [ wolf46.chem.cuhk.edu.hk:06002] *** MPI_ERRORS_ARE_FATAL (goodbye) 1 process killed (possibly by Open MPI) On 7/4/06, Brian Barrett wrote: On Jul 4, 2006, at 1:53 AM, Chengwen Chen wrote: > Dear openmpi users, > > I am using openmpi-1.0.2 on Redhat linux. I can succussfully run > mpirun in single PC with 2 np. But fail in remote node. Can you > give me some advices? thank you very much in advance. > > [say@wolf45 tmp]$ mpirun -np 2 /tmp/test.x > > [say@wolf45 tmp]$ mpirun -np 2 --host wolf45,wolf46 /tmp/test.x > say@wolf46's password: > orted: Command not found. > [wolf45:11357] ERROR: A daemon on node wolf46 failed to start as > expected. > [wolf45:11357] ERROR: There may be more information available from > [wolf45:11357] ERROR: the remote shell (see above). > [wolf45:11357] ERROR: The daemon exited unexpectedly with status 1. Kefeng is correct that you should setup your ssh keys so that you aren't prompted for a password, but that isn't the cause of your failure. The problem appears to be that orted (one of the Open MPI commands) is not in your path on the remote node. You should take a look at one of the other FAQ sections on the setup required for Open MPI in an rsh/ssh type environment. http://www.open-mpi.org/faq/?category=running Hope this helps, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] open-mpi on MacOS X
> -Original Message- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Jack Howarth > Sent: Monday, July 03, 2006 10:35 AM > To: us...@open-mpi.org > Subject: [OMPI users] open-mpi on MacOS X > > I have created simple fink (http://fink.sourceforge.net) > packaging > for open-mpi v1.1 on MacOS X. The packaging builds open-mpi with its > default settings in configure and appears to pass all of its > make check > without problems. Thanks! > However, the lack of clear documentation > for open-mpi > still is a problem. Agreed. This is something that we're actively working on. In the meantime, feel free to send your questions to this list. > I seem able to manually run the test programs from > the open-mpi distribution using... > > mdrun -np 2 ... Just to clarify -- what is mdrun? Do you mean mpirun? Open MPI does not provide an executable named "mdrun". > after starting the orted daemon with > > orted --seed --persistent --scope public Per Brock's comments, you don't need to start the orted manually. Indeed, this model is only loosely tested -- it has known problems with not releasing all resources at the end of each mpirun (e.g., the memory footprint of that orted will keep growing over time). See below. > I can see both cpus spike when I do the mdrun's so I think > that works. However, I can't figure how the proper way to > monitor the status of the available nodes. Specifically, > what is the equivalent to the lamnodes program in open-mpi? Right now, Open MPI does not have a "persistent universe" model like LAM's (e.g., lamboot over a bunch of nodes). orted's are launched behind the scenes for each job for each node (e.g., in the rsh/ssh case, we rsh/ssh to each node once, launch an orted, and then the orted launches as many user processes as necessary). However, equivalent to LAM, Open MPI can use the back-end schedule/resource manager to know which nodes to launch on. Even with lamboot, you had to specify a hostfile or have a back-end resource manager that said "use these nodes." lamnodes was not really a monitoring tool -- it was more of a "here's the nodes that you specified to me earlier" tool. If you really want monitoring tools for your nodes, you might want to look outside of MPI -- SLURM and Torque are fairly common open source resource managers. And there's a bunch of tools available for monitoring nodes in a cluster, too. > Also, is there a simple test program that runs for a significant > period of time that I can use to test the different options to > monitor and control the open-mpi jobs that are running under > orted? Thanks in advance for any clarifications. Open MPI's run-time options are [essentially] read at startup and used for the duration of the job's run. Most of the options are not changeable after a given run has started. We have not yet included any sample apps inside Open MPI (a la LAM), but we'll likely include some simple "hello world" and other well-known sample MPI apps in the future. For long-running tests, you might want to run any of the MPI benchmark suites available (e.g., NetPIPE, the Intel benchmarks, HPL, etc.). > Jack > ps I assume that at v1.1, open-mpi is considered to be a usable > replacement for lam? Certainly, gromacs 3.3.1 seems to compile > its mpi support against open-mpi. Yes. There are still some features in LAM that are not yet in Open MPI (e.g., a persistent universe), but most of the good/important ones are being added to Open MPI over time. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
[OMPI users] MPI_Recv, is it possible to switch on/off aggresive mode during runtime?
Dear open-mpi users, I saw some posts ago almost the same question as I have, but it didn't give me satisfactional answer. I have setup like this: GUI program on some machine (f.e. laptop) Head listening on tcpip socket for commands from GUI. Workers waiting for commands from Head / processing the data. And now it's problematic. For passing the commands from Head I'm using: while(true) { MPI_Recv... do whatever head said (process small portion of the data, return result to head, wait for another commands) } So in the idle time workers are stuck in MPI_Recv and have 100% CPU usage, even if they are just waiting for the commands from Head. Normally, I would not prefer to have this situation as I sometimes have to share the cluster with others. I would prefer not to stop whole mpi program, but just go into 'idle' mode, and thus make it run again soon. Also I would like to have this aggresive MPI_Recv approach switched on when I'm alone on the cluster. So is it possible somehow to switch this mode on/off during runtime? Thank you in advance! greetings, Marcin
Re: [OMPI users] Datatype bug regression from Open MPI 1.0.2 to Open MPI 1.1
On Sat, 2006-07-01 at 00:25 +0200, Yvan Fournier wrote: > Hello, > > I had encountered a bug in Open MPI 1.0.1 using indexed datatypes > with MPI_Recv (which seems to be of the "off by one" sort), which > was corrected in Open MPI 1.0.2. > > It seems to have resurfaced in Open MPI 1.1 (I encountered it using > different data and did not recognize it immediately, but it seems > it can reproduced using the same simplified test I had sent > the first time, which I re-attach here just in case). Thank you for the bug report. It's going to take us a little while to track the issue down. I've filed a bug in our bug tracker (you should receive e-mails when the ticket is updated) and someone familiar with the datatype engine will take a look as soon as possible. Brian
Re: [OMPI users] OS X, OpenMPI 1.1: An error occurred in MPI_Allreduce on, communicator MPI_COMM_WORLD (Jeff Squyres (jsquyres))
users-requ...@open-mpi.org wrote: A few clarifying questions: What is your netmask on these hosts? Where is the MPI_ALLREDUCE in your app -- right away, or somewhere deep within the application? Can you replicate this with a simple MPI application that essentially calls MPI_INIT, MPI_ALLREDUCE, and MPI_FINALIZE? Can you replicate this with a simple MPI app that does an MPI_SEND / MPI_RECV between two processes on the different subnets? Thanks. @ Jeff, netmask 255.255.255.0 Running a simple "hello world" yields no error on each subnet, but running "hello world" on both subnets yields the error [g5dual.3-net:00436] *** An error occurred in MPI_Send [g5dual.3-net:00436] *** on communicator MPI_COMM_WORLD [g5dual.3-net:00436] *** MPI_ERR_INTERN: internal error [g5dual.3-net:00436] *** MPI_ERRORS_ARE_FATAL (goodbye) Hope this helps! Frank Just in case you wanna check the source: cFortran example hello_world program hello include 'mpif.h' integer rank, size, ierror, tag, status(MPI_STATUS_SIZE) character*12 message call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) tag = 100 if (rank .eq. 0) then message = 'Hello, world' do i=1, size-1 call MPI_SEND(message, 12, MPI_CHARACTER, i, tag, & MPI_COMM_WORLD, ierror) enddo else call MPI_RECV(message, 12, MPI_CHARACTER, 0, tag, &MPI_COMM_WORLD, status, ierror) endif print*, 'node', rank, ':', message call MPI_FINALIZE(ierror) end or the full output: [powerbook:/Network/CFD/hello] motte% mpirun -d -np 5 --hostfile ./hostfile /Network/CFD/hello/hello_world [powerbook.2-net:00606] [0,0,0] setting up session dir with [powerbook.2-net:00606] universe default-universe [powerbook.2-net:00606] user motte [powerbook.2-net:00606] host powerbook.2-net [powerbook.2-net:00606] jobid 0 [powerbook.2-net:00606] procid 0 [powerbook.2-net:00606] procdir: /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe/0/0 [powerbook.2-net:00606] jobdir: /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe/0 [powerbook.2-net:00606] unidir: /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe [powerbook.2-net:00606] top: openmpi-sessions-motte@powerbook.2-net_0 [powerbook.2-net:00606] tmp: /tmp [powerbook.2-net:00606] [0,0,0] contact_file /tmp/openmpi-sessions-motte@powerbook.2-net_0/default-universe/universe-setup.txt [powerbook.2-net:00606] [0,0,0] wrote setup file [powerbook.2-net:00606] pls:rsh: local csh: 1, local bash: 0 [powerbook.2-net:00606] pls:rsh: assuming same remote shell as local shell [powerbook.2-net:00606] pls:rsh: remote csh: 1, remote bash: 0 [powerbook.2-net:00606] pls:rsh: final template argv: [powerbook.2-net:00606] pls:rsh: /usr/bin/ssh orted --debug --bootproxy 1 --name --num_procs 6 --vpid_start 0 --nodename --universe motte@powerbook.2-net:default-universe --nsreplica "0.0.0;tcp://192.168.2.3:49443" --gprreplica "0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0 [powerbook.2-net:00606] pls:rsh: launching on node Powerbook.2-net [powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [powerbook.2-net:00606] pls:rsh: Powerbook.2-net is a LOCAL node [powerbook.2-net:00606] pls:rsh: changing to directory /Users/motte [powerbook.2-net:00606] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 6 --vpid_start 0 --nodename Powerbook.2-net --universe motte@powerbook.2-net:default-universe --nsreplica "0.0.0;tcp://192.168.2.3:49443" --gprreplica "0.0.0;tcp://192.168.2.3:49443" --mpi-call-yield 0 [powerbook.2-net:00607] [0,0,1] setting up session dir with [powerbook.2-net:00607] universe default-universe [powerbook.2-net:00607] user motte [powerbook.2-net:00607] host Powerbook.2-net [powerbook.2-net:00607] jobid 0 [powerbook.2-net:00607] procid 1 [powerbook.2-net:00607] procdir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe/0/1 [powerbook.2-net:00607] jobdir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe/0 [powerbook.2-net:00607] unidir: /tmp/openmpi-sessions-motte@Powerbook.2-net_0/default-universe [powerbook.2-net:00607] top: openmpi-sessions-motte@Powerbook.2-net_0 [powerbook.2-net:00607] tmp: /tmp [powerbook.2-net:00606] pls:rsh: launching on node g4d003.3-net [powerbook.2-net:00606] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [powerbook.2-net:00606] pls:rsh: g4d003.3-net is a REMOTE node [powerbook.2-net:00606] pls:rsh: executing: /usr/bin/ssh g4d003.3-net orted --debug --bootproxy 1 --name 0.0.2 --num_procs 6 --vpid_start 0 --nodename g4d003.3-net --universe motte@powerbook.2-net:default-universe --nsreplica "0.0.0;tcp://192.168.2.3:49443" --gprreplica "0.0.0;tcp://192.168.2.3:4944
[OMPI users] MPI_Comm_spawn
I have a very simple program which spawns a number of slaves. I am getting erratic results from this program. It seems that all the slave processes are spawned but not all of them complete the MPI_Init() before the main program ends. In addition I get the following error messages for which I haven't been able to find any documentation: [turkana:26736] [0,0,0] ORTE_ERROR_LOG: Not found in file base/soh_base_get_proc_soh.c at line 80 [turkana:26736] [0,0,0] ORTE_ERROR_LOG: Not found in file base/oob_base_xcast.c at line 108 [turkana:26736] [0,0,0] ORTE_ERROR_LOG: Not found in file base/rmgr_base_stage_gate.c at line 276 [turkana:26736] [0,0,0] ORTE_ERROR_LOG: Not found in file base/soh_base_get_proc_soh.c at line 80 [turkana:26736] [0,0,0] ORTE_ERROR_LOG: Not found in file base/oob_base_xcast.c at line 108 [turkana:26736] [0,0,0] ORTE_ERROR_LOG: Not found in file base/rmgr_base_stage_gate.c at line 276 I am using openmpi 1.1 on FC4 on a dual AMD Athlon machine. My program is as follows: #include #include #include #include #include int main(int ac, char *av[]) { int rank, size; char name[MPI_MAX_PROCESSOR_NAME]; int nameLen; int n = 5, i; int slave = 0; int errs[5]; char *args[] = { av[0], "-W", NULL}; MPI_Comm intercomm; int err; memset(name, sizeof(name), 0); for(i=1; i
[OMPI users] MPI_Comm_spawn
What configuration do I need to run child/slave MPI processes created via MPI_Comm_spawn on another machine. Thanks. Saadat.
[OMPI users] Mac OS X, ppc64, opal_if
Hi, I'm trying to build a 64 bit version of openmpi on a powerpc running Mac OS X 10.4.7 and using a recent snapshot of gcc/g++/gfortran from the gcc svn repository. The config and build process goes smoothly but the result doesn't pass all the tests. It hans on the opal_if test as well as simply running mpirun -np 1 hostname. The process ramps up the member up to about 3.5G and the machine becomes very sluggish (it has 8G). Has anybody seen this before or had experience building on similar systems? Chris % uname -a Darwin kees.local 8.7.0 Darwin Kernel Version 8.7.0: Fri May 26 15:20:53 PDT 2006; root:xnu-792.6.76.obj~1/RELEASE_PPC Power Macintosh powerpc % gcc -v Using built-in specs. Target: powerpc-apple-darwin8.7.0 Configured with: ./configure Thread model: posix gcc version 4.2.0 20060703 (experimental) ompi_info.out Description: Binary data config.log.gz Description: GNU Zip compressed data
Re: [OMPI users] Mac OS X, ppc64, opal_if
Chris, I'm doing most of my work on Open MPI on a similar machine (Dual g5 64 bits with 4GB of RAM). I'm using several compiler (gcc as well as IBM). Usually, I'm compiling it with the latest unstable version of gcc/gfortran (directly fetched from subversion). I never noticed this problem. Everything runs smoothly for me. However, last time I update my gcc/gfortran was 3 weeks ago. I will try to update to the latest version and see if I notice something weird. Thanks, George. On Jul 5, 2006, at 4:04 PM, Chris Kees wrote: Hi, I'm trying to build a 64 bit version of openmpi on a powerpc running Mac OS X 10.4.7 and using a recent snapshot of gcc/g++/ gfortran from the gcc svn repository. The config and build process goes smoothly but the result doesn't pass all the tests. It hans on the opal_if test as well as simply running mpirun -np 1 hostname. The process ramps up the member up to about 3.5G and the machine becomes very sluggish (it has 8G). Has anybody seen this before or had experience building on similar systems? Chris
[OMPI users] Dynamic COMM_WORLD
Hello all, Before I embark on a train that will run out of tracks, I wanted to get a WFF concerning the spwaning mechanisme in OpenMPI. The intent is that I would program a "simple" parallel application that would demonstrate the ability of recent MPI implementations (OpenMPI) to dynamically ADD or remove processes from the parallel task pool. Of course, any documentation concerning these new features would also be greatly appreciated ;) Thanks, -- Eric Thibodeau Neural Bucket Solutions Inc. T. (514) 736-1436 C. (514) 710-0517
Re: [OMPI users] OpenMPI, debugging, and Portland Group's pgdbg
This took a long time for me to get to, but once I did, what I found was that the closest thing to working for the PGI compilers with OpenMPI is this command: mpirun --debugger "pgdbg @mpirun@ @mpirun_args@" --debug -np 2 ./cpi It appears to work, that is, you can select a process with the "proc" command in pgdbg and set break points and all, but pgdbg prints a lot of error messages that are all the same: db_set_code_brk : DiBreakpointSet fails which is sort of annoying, but didn't impede my debugging of my 100-line MPI test program. I posted this to the PGI Debugger Forum: http://www.pgroup.com/userforum/viewtopic.php?p=1969 and got a response saying (hopefully Mat doesn't mind me quoting him):: Hi Andy, Actually I'm pleasantly surprised that PGDBG works at all with OpenMPI since PGDBG currently only supports MPICH. While we're planning on adding OpenMPI and MPICH-2 support later this year, in the immediate future, there isn't a work around this problem, other than to use MPICH. Thanks, Mat So I guess the short answer is that is might sort of work if you really need it, otherwise it's best to wait a little while. --andy On Fri, 16 Jun 2006, Jeff Squyres (jsquyres) wrote: I'm afraid that I'm not familiar with the PG debugger, so I don't know how it is supposed to be launched. The intent with --debugger / --debug is that you could do a single invocation of some command and it launches both the parallel debugger and tells that debugger to launch your parallel MPI process (assumedly allowing the parallel debugger to attach to your parallel MPI process). This is what fx2 and Totalview allow, for example. As such, the "--debug" option is simply syntactic sugar for invoking another [perhaps non-obvious] command. We figured it was simpler for users to add "--debug" to the already-familiar mpirun command line than to learn a new syntax for invoking a debugger (although both would certainly work equally well). As such, when OMPI's mpirun sees "--debug", it ends up exec'ing something else -- the parallel debugger command. In the example that I gave in http://www.open-mpi.org/community/lists/users/2005/11/0370.php, mpirun looked for two things in your path: totalview and fx2. For example, if you did this: mpirun --debug -np 4 a.out If it found totalview, it would end up exec'ing: totalview @mpirun@ -a @mpirun_args@ which would get substituted to totalview mpirun -a -np 4 a.out (note the additional "-a") Which is the totalview command line syntax to launch their debugger and tell it to launch your parallel process. If totalview is not found in your path, it'll look for fx2. If fx2 is found, it'll invoke: fx2 @mpirun@ -a @mpirun_args@ which would get substitued to fx2 mpirun -a -np 4 a.out You can see that fx2's syntax was probably influenced by totalview's. So what you need is the command line that tells pgdbg to do the same thing -- launch your app and attach to it. You can then substitute that into the "--debugger" option (using the @mpirun@ and @mpirun_args@ tokens), or set the MCA parameter "orte_base_user_debugger", and then use --debug. For example, if the pgdbg syntax is similar to that of totalview and fx2, then you could do the following: mpirun --debugger pgdbg @mpirun@ -a @mpirun_args@ --debug -np 4 a.out or (assuming tcsh) shell% setenv OMPI_MCA_orte_base_user_debugger "pgdbg @mpirun@ -a @mpirun_args@" shell% mpirun --debug -np 4 a.out Make sense? If you find a fixed format for pgdb, we'd be happy to add it to the default value of the orte_base_user_debugger MCA parameter. Note that OMPI currently only supports the Totalview API for attaching to MPI processes -- I don't know if pgdbg requires something else.