Re: [OMPI users] More newbie question: --hostfile option

2011-01-12 Thread Ralph Castain
On Jan 12, 2011, at 7:23 PM, Tena Sakai wrote: > Hi, > > I can execute the command below: >$ mpirun -H vixen -np 1 hostname : -H compute-0-0,compute-0-1,compute-0-2 > -np 3 hostname > and I get: >vixen.egcrc.org >compute-0-0.local >compute-0-1.local >compute-0-2.local > > I

Re: [OMPI users] Argument parsing issue

2011-01-27 Thread Ralph Castain
The problem is that mpirun regenerates itself to exec a command of "totalview mpirun ", and the quotes are lost in the process. Just start your debugged job with "totalview mpirun ..." and it should work fine. On Jan 27, 2011, at 3:00 AM, Gabriele Fatigati wrote: > The problem is how mpiru

Re: [OMPI users] allow job to survive process death

2011-01-27 Thread Ralph Castain
On Jan 27, 2011, at 7:47 AM, Reuti wrote: > Am 27.01.2011 um 15:23 schrieb Joshua Hursey: > >> The current version of Open MPI does not support continued operation of an >> MPI application after process failure within a job. If a process dies, so >> will the MPI job. Note that this is true of

Re: [OMPI users] Default hostfile not being used by mpirun

2011-02-05 Thread Ralph Castain
The easiest solution is to take advantage of the fact that the default hostfile is an MCA parameter - so you can specify it in several ways other than on the cmd line. It can be in your environment, in the default MCA parameter file, or in an MCA param file in your home directory. See http://w

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-07 Thread Ralph Castain
The 1.4 series is regularly tested on slurm machines after every modification, and has been running at LANL (and other slurm installations) for quite some time, so I doubt that's the core issue. Likewise, nothing in the system depends upon the FQDN (or anything regarding hostname) - it's just us

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain
Another possibility to check - are you sure you are getting the same OMPI version on the backend nodes? When I see it work on local node, but fail multi-node, the most common problem is that you are picking up a different OMPI version due to path differences on the backend nodes. On Feb 8, 201

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain
See below On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: > > On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: > >> Hi Michael, >> >> You may have tried to send some debug information to the list, but it >> appears to have been blocked. Compressed text output of the backtrace text >

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain
(due out any time now) with the 1.5.1 slurm support. Any interested parties can follow it here: https://svn.open-mpi.org/trac/ompi/ticket/2717 Ralph On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote: > > On 09/02/2011, at 9:16 AM, Ralph Castain wrote: > >> See below >>

Re: [OMPI users] Mpirun --app option not working

2011-02-09 Thread Ralph Castain
Gus is correct - the -host option needs to be in the appfile On Feb 9, 2011, at 3:32 PM, Gus Correa wrote: > Sindhi, Waris PW wrote: >> Hi, >>I am having trouble using the --app option with OpenMPI's mpirun >> command. The MPI processes launched with the --app option get launched >> on the l

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-12 Thread Ralph Castain
Have you searched the email archive and/or web for openmpi and Amazon cloud? Others have previously worked through many of these problems for that environment - might be worth a look to see if someone already solved this, or at least a contact point for someone who is already running in that env

Re: [OMPI users] Use mca_base_param_file_path to set .ssh?

2011-02-15 Thread Ralph Castain
OMPI doesn't do anything relative to the .ssh directory, or what key is used for ssh authentication. Afraid that is one you have to solve at the system level :-/ On Feb 15, 2011, at 11:35 AM, Barnet Wagman wrote: > I need to find a way of controlling the rsa key used when open-mpi uses ssh >

Re: [OMPI users] Use mca_base_param_file_path to set .ssh?

2011-02-15 Thread Ralph Castain
Setting the mca param plm_rsh_agent to "ssh -i xxx" should do the trick, I think - haven't tried it, but it should work. On Feb 15, 2011, at 12:24 PM, Barnet Wagman wrote: > >> OMPI doesn't do anything relative to the .ssh directory, or what key is used >> for ssh authentication. >> >> Afrai

Re: [OMPI users] Selecting different processors during function

2011-02-19 Thread Ralph Castain
Your question actually doesn't make sense in an MPI application. In MPI, you would have two independent processes running. One does the send, and the other does the receive. Both processes are running all the time, each on its own processor. So you don't "switch" to another processor - the rece

Re: [OMPI users] --without-tm [SEC=UNCLASSIFIED]

2011-02-21 Thread Ralph Castain
Simplest soln: add -bynode to your mpirun cmd line On Feb 20, 2011, at 10:50 PM, DOHERTY, Greg wrote: > In order to be able to checkpoint openmpi jobs with blcr, we have > configured openmpi as follows > > ./configure --prefix=/data1/packages/openmpi/1.5.1-blcr-without-tm > --disable-openib-co

Re: [OMPI users] MPI_Comm_spawn_multiple

2011-02-21 Thread Ralph Castain
I very much doubt that either of those mappers has ever been tested against comm_spawn. Just glancing thru them, I don't see an immediate reason why loadbalance wouldn't work, but the error indicates that the system wound up mapping one or more processes to an unknown node. We are revising the

Re: [OMPI users] SLURM environment variables at runtime

2011-02-23 Thread Ralph Castain
Resource managers generally frown on the idea of any program passing RM-managed envars from one node to another, and this is certainly true of slurm. The reason is that the RM reserves those values for its own use when managing remote nodes. For example, if you got an allocation and then used mpiru

Re: [OMPI users] SLURM environment variables at runtime

2011-02-23 Thread Ralph Castain
oes OpenMPI start the > processes on the remote nodes under the covers?  (use srun, generate a > hostfile and launch as you would outside SLURM, …)  This may be the > difference between HP-MPI and OpenMPI. Thanks, Brent  From: > users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Ralph Castain
ALID=0 > > SLURM_LOCALID=1 > > SLURM_LOCALID=1 > > SLURM_NODEID=0 > > SLURM_NODEID=0 > > SLURM_NODEID=1 > > SLURM_NODEID=1 > > SLURM_PROCID=0 > > SLURM_PROCID=1 > > SLURM_PROCID=2 > > SLURM_PROCID=3 > > [brent@node1 mpi]$ > &g

Re: [OMPI users] multicast not available

2011-02-24 Thread Ralph Castain
If you are trying to use OMPI as the base for ORCM, then you can tell ORCM to use OMPI's "tcp" multicast module - it fakes multicast using pt-2-pt tcp messaging. -mca rmcast tcp will do the trick. On Thu, Feb 24, 2011 at 6:27 AM, Jeff Squyres wrote: > I'm still not sure what you're asking --

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Ralph Castain
MPI). > We should note that you -can- directly srun an OMPI job now. I believe that capability was released in the 1.5 series. It takes a minimum slurm release level plus a slurm configuration setting to do so. > > > On Feb 24, 2011, at 10:02 AM, Ralph Castain wrote: > > >

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Ralph Castain
t; > SLURM_TASK_PID=2590 > [brent@node2 mpi]$ > [brent@node2 mpi]$ > [brent@node2 mpi]$ grep SLURM_PROCID srun.out > SLURM_PROCID=0 > SLURM_PROCID=1 > [brent@node2 mpi]$ grep SLURM_PROCID mpirun.out > SLURM_PROCID=0 > [brent@node2 mpi]$ grep SLURM_PROCID hpmpi.out > SLURM

Re: [OMPI users] SLURM environment variables at runtime

2011-02-24 Thread Ralph Castain
I guess I wasn't clear earlier - I don't know anything about how HP-MPI works. I was only theorizing that perhaps they did something different that results in some other slurm vars showing up in Brent's tests. From Brent's comments, I guess they don't - but they launch jobs in a different manner th

Re: [OMPI users] Mac OS X Static PGI

2011-03-01 Thread Ralph Castain
Error means OMPI didn't find a network interface - do you have your networks turned off? Sometimes people travel with Airport turned off. If you haven wire connected, then no interfaces exist. Sent from my iPad On Mar 1, 2011, at 11:50 AM, David Robertson wrote: > Hi all, > > I am having tr

Re: [OMPI users] Mac OS X Static PGI

2011-03-01 Thread Ralph Castain
On Mar 1, 2011, at 1:34 PM, David Robertson wrote: > Hi, > > > Error means OMPI didn't find a network interface - do you have your > > networks turned off? Sometimes people travel with Airport turned off. > > If you haven wire connected, then no interfaces exist. > > I am logged in to the machi

Re: [OMPI users] Mac OS X Static PGI

2011-03-03 Thread Ralph Castain
Really appreciate you having looked into this! Unfortunately, I can't see a way to resolve this for the general public. It looks more to me like a PGI bug, frankly - not supporting code in a system-level include makes no sense to me. But I confess this seems to be PGI's mode of operation as I'v

Re: [OMPI users] Number of processes and spawn

2011-03-05 Thread Ralph Castain
OpenMPI version, > that the LD_LIBRARY_PATH is consistent. > So I would like to re-compile the openmpi-1.7a1r22794.tar.bz2 but where can I > found it ? > > > Thank you, > Federico > > > > > > > > > > > Il giorno 23 febbraio 2011 03:

Re: [OMPI users] Number of processes and spawn

2011-03-05 Thread Ralph Castain
FWIW: just tried current trunk on a multi-node cluster, and the loop_spawn test worked fine there too. On Mar 5, 2011, at 11:05 AM, Ralph Castain wrote: > Hi Federico > > I tested the trunk today and it works fine for me - I let it spin for 1000 > cycles without issue. My tes

Re: [OMPI users] Number of processes and spawn

2011-03-07 Thread Ralph Castain
shprofile to export the correct > LD_LIBRARY_PATH. > - thank you for the usefull trick about svn. No idea, then - all that error says is that the receiving code and the sending code are mismatched. > > > Thank you very much !!! > Federico. > > > > > > &

Re: [OMPI users] Problem running openmpi-1.4.3

2011-03-08 Thread Ralph Castain
You need to set your LD_LIBRARY_PATH to point to where you installed openmpi. On Mar 8, 2011, at 5:47 PM, Amos Leffler wrote: > Hi, >I am trying to get openmpi-1.4.3 to run but am having trouble. > It is run using SUSE-11.3 with Intel XE-2011 Composer C and Fortran > compilers. The comp

Re: [OMPI users] Mac OS X Static PGI

2011-03-11 Thread Ralph Castain
we can AC_DEFINE something to skip including that file > in opal/util/if.h. > > > > On Mar 3, 2011, at 4:22 PM, Ralph Castain wrote: > >> Really appreciate you having looked into this! >> >> Unfortunately, I can't see a way to resolve this for the genera

Re: [OMPI users] Number of processes and spawn

2011-03-14 Thread Ralph Castain
11/bin:/home/fandreasi/libtool-2.2.6b/bin:$PATH > > setenv LD_LIBRARY_PATH /home/fandreasi/libtool-2.2.6b/lib > > > > When I do the autogen it return me the error I've attached. > > Can you help me on this ? > > > > Thank you, > > Federico. > > > &

Re: [OMPI users] Error in Binding MPI Process to a socket

2011-03-17 Thread Ralph Castain
The error is telling you that your OS doesn't support queries telling us what cores are on which sockets, so we can't perform a "bind to socket" operation. You can probably still "bind to core", so if you know what cores are in which sockets, then you could use the rank_file mapper to assign pro

Re: [OMPI users] Error in Binding MPI Process to a socket

2011-03-17 Thread Ralph Castain
> but ended up getting same error. Is there any patch that I can install in my > system to make it > topology aware? > > Thanks > > > On Thu, Mar 17, 2011 at 2:05 PM, Ralph Castain wrote: > The error is telling you that your OS doesn't support queries telling

Re: [OMPI users] Error in Binding MPI Process to a socket

2011-03-17 Thread Ralph Castain
7 13:14:04 CDT 2009 x86_64 x86_64 > x86_64 GNU/Linux > > > On Thu, Mar 17, 2011 at 2:55 PM, Ralph Castain wrote: > What OS version is it? > > uname -a > > will tell you, if you are on linux. > > On Mar 17, 2011, at 1:31 PM, vaibhav dutt wrote: > >> Hi

Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Ralph Castain
Just looking at this for another question. Yes, SGE integration is broken in 1.5. Looking at how to fix now. Meantime, you can get it work by adding "-mca plm ^rshd" to your mpirun cmd line. On Mar 21, 2011, at 9:47 AM, Dave Love wrote: > Terry Dontje writes: > >> Dave what version of Grid

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Ralph Castain
Can you run anything under TM? Try running "hostname" directly from Torque to see if anything works at all. The error message is telling you that the Torque daemon on the remote node reported a failure when trying to launch the OMPI daemon. Could be that Torque isn't setup to forward environmen

Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-21 Thread Ralph Castain
On Mar 21, 2011, at 11:12 AM, Dave Love wrote: > Ralph Castain writes: > >> Just looking at this for another question. Yes, SGE integration is broken in >> 1.5. Looking at how to fix now. >> >> Meantime, you can get it work by adding "-mca plm ^rshd"

Re: [OMPI users] Displaying MAIN in Totalview

2011-03-21 Thread Ralph Castain
Ick - appears that got dropped a long time ago. I'll add it back in and post a CMR for 1.4 and 1.5 series. Thanks! Ralph On Mar 21, 2011, at 11:08 AM, David Turner wrote: > Hi, > > About a month ago, this topic was discussed with no real resolution: > > http://www.open-mpi.org/community/list

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Ralph Castain
; OMPI_COMM_WORLD_LOCAL_SIZE=1 > OMPI_MCA_orte_ess_jobid=3236233217 > OMPI_MCA_orte_ess_vpid=0 > OMPI_COMM_WORLD_RANK=0 > OMPI_COMM_WORLD_LOCAL_RANK=0 > OPAL_OUTPUT_STDERR_FD=19 > > MPIExec with -mca plm rsh: > > [rsvancara@node164 ~]$ mpiexec -mca plm rsh -mca orte_tmpdi

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Ralph Castain
.82:33559 >> OMPI_MCA_mpi_yield_when_idle=0 >> OMPI_MCA_orte_app_num=0 >> OMPI_UNIVERSE_SIZE=1 >> OMPI_MCA_ess=env >> OMPI_MCA_orte_ess_num_procs=1 >> OMPI_COMM_WORLD_SIZE=1 >> OMPI_COMM_WORLD_LOCAL_SIZE=1 >> OMPI_MCA_orte_ess_jobid=3236233217 >> OM

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Ralph Castain
wsuhpc.edu >>>>> SHLVL=1 >>>>> HOME=/home/admins/rsvancara >>>>> INTEL_LICENSES=/home/software/intel/Compiler/11.1/075/licenses:/opt/intel/licenses >>>>> PBS_O_HOST=login1 >>>>> DYLD_LIBRARY_PATH=/home/software/intel/Compiler/11.1/075/tbb/intel6

Re: [OMPI users] Is there an mca parameter equivalent to -bind-to-core?

2011-03-22 Thread Ralph Castain
On Mar 21, 2011, at 9:27 PM, Eugene Loh wrote: > Gustavo Correa wrote: > >> Dear OpenMPI Pros >> >> Is there an MCA parameter that would do the same as the mpiexec switch >> '-bind-to-core'? >> I.e., something that I could set up not in the mpiexec command line, >> but for the whole cluster, o

Re: [OMPI users] 1.5.3 and SGE integration?

2011-03-22 Thread Ralph Castain
On Mar 22, 2011, at 6:02 AM, Dave Love wrote: > Ralph Castain writes: > >>> Should rshd be mentioned in the release notes? >> >> Just starting the discussion on the best solution going forward. I'd >> rather not have to tell SGE users to add this to

Re: [OMPI users] intel compiler linking issue and issue of environment variable on remote node, with open mpi 1.4.3 (Tim Prince)

2011-03-22 Thread Ralph Castain
On a beowulf cluster? So you are using bproc? If so, you have to use the OMPI 1.2 series - we discontinued bproc support at the start of 1.3. Bproc will take care of the envars. If not bproc, then I assume you will use ssh for launching? Usually, the environment is taken care of by setting up y

Re: [OMPI users] Is there an mca parameter equivalent to -bind-to-core?

2011-03-23 Thread Ralph Castain
On Mar 23, 2011, at 2:20 PM, Gus Correa wrote: > Ralph Castain wrote: >> On Mar 21, 2011, at 9:27 PM, Eugene Loh wrote: >>> Gustavo Correa wrote: >>> >>>> Dear OpenMPI Pros >>>> >>>> Is there an MCA parameter that would do

Re: [OMPI users] keyval parser: error 1 reading file mpicc-wrapper-data.txt

2011-03-23 Thread Ralph Castain
On Mar 23, 2011, at 3:19 PM, Gus Correa wrote: > Dear OpenMPI Pros > > Why am I getting the parser error below? > It seems not to recognize comment lines (#). > > This is OpenMPI 1.4.3. > The same error happens with the other compiler wrappers too. > However, the wrappers compile and produce an

Re: [OMPI users] intel compiler linking issue and issue of environment variable on remote node, with open mpi 1.4.3

2011-03-24 Thread Ralph Castain
On Mar 24, 2011, at 12:45 PM, ya...@adina.com wrote: > Thanks for your information. For my Open MPI installation, actually > the executables such as mpirun and orted are dependent on those > dynamic intel libraries, when I use ldd on mpirun, some dynamic > libraries show up. I am trying to mak

Re: [OMPI users] keyval parser: error 1 reading file mpicc-wrapper-data.txt

2011-03-24 Thread Ralph Castain
> > The whole problem must have been some computer daemon spell. > Other than my growing feeling that flipping bits and logic gates > have fun conspiring against my sanity, all is well now. > > Thank you, > Gus Correa > > > > Gus Correa wrote: >> Ralph Cast

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Ralph Castain
Try adding some print statements so you can see where the error occurs. On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote: > Hi , All: > > I running a Open MPI (1.3.4) program by 200 parallel processes. > > But, the program is terminated with > > ---

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Ralph Castain
Have you tried a parallel debugger such as padb? On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote: > Hi, > > I have tried this. But, the printout from 200 parallel processes make it > very hard to locate the possible bug. > > They may not stop at the same point when the program got signal 9. >

Re: [OMPI users] Shared Memory Problem.

2011-03-26 Thread Ralph Castain
On Mar 26, 2011, at 11:34 AM, Michele Marena wrote: > Hi, > I've a problem with shared memory. When my application runs using pure > message passing (one process for node), it terminates and returns correct > results. When 2 processes share a node and use shared memory for exchanges > messages

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Ralph Castain
You don't need to install anything on a system folder - you can just install it in your home directory, assuming that is accessible on the remote nodes. As for the script - unless you can somehow modify it to allow you to run under a debugger, I am afraid you are completely out of luck. On Mar

Re: [OMPI users] Shared Memory Problem.

2011-03-26 Thread Ralph Castain
mmunicating processes are on the same node the > application don't terminate, otherwise the application terminate and its > results are correct. My OpenMPI version is 1.2.7. > > 2011/3/26 Ralph Castain > > On Mar 26, 2011, at 11:34 AM, Michele Marena wrote: > > >

Re: [OMPI users] Shared Memory Problem.

2011-03-26 Thread Ralph Castain
during > compilation of your source and execution. > > -- Reuti > > >> 2011/3/26 Ralph Castain >> Can you update to a more recent version? That version is several years >> out-of-date - we don't even really support it any more. >> >> >> On Mar

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Ralph Castain
I don't know, but Ashley may be able to help - or you can see his web site for instructions. Alternatively, since you can put print statements into your code, have you considered using mpirun's option to direct output from each rank into its own file? Look at "mpirun -h" for the options. -o

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Ralph Castain
If you use that mpirun option, mpirun will place the output from each rank into a -separate- file for you. Give it: mpirun --output-filename /myhome/debug/run01 and in /myhome/debug, you will find files: run01.0 run01.1 ... each with the output from the indicated rank. On Mar 26, 2011, at 3

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-26 Thread Ralph Castain
That command line cannot possibly work. Both the -rf and --output-filename options require arguments. PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you how to correctly use these options. On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote: > Hi, I used : > > mpirun -np 200

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-27 Thread Ralph Castain
On Mar 27, 2011, at 7:37 AM, Tim Prince wrote: > On 3/27/2011 2:26 AM, Michele Marena wrote: >> Hi, >> My application performs good without shared memory utilization, but with >> shared memory I get performance worst than without of it. >> Do I make a mistake? Don't I pay attention to something?

Re: [OMPI users] OMPI error terminate w/o reasons

2011-03-27 Thread Ralph Castain
It means that Torque is unhappy with your job - either you are running longer than it permits, or you exceeded some other system limit. Talk to your sys admin about imposed limits. Usually, there are flags you can provide to your job submission that allow you to change limits for your program.

Re: [OMPI users] Shared Memory Performance Problem.

2011-03-28 Thread Ralph Castain
ur overall application down*. >> >> How much does your application slow down in wall clock time? Seconds? >> Minutes? Hours? (anything less than 1 second is in the noise) >> >> >> >> On Mar 27, 2011, at 10:33 AM, Ralph Castain wrote: >> >> > >

Re: [OMPI users] Cannot launch slots on more than 2 remote machines

2011-03-28 Thread Ralph Castain
It is hanging because your last nodes are not receiving the launch command. The daemons receive a message from mpirun telling them what to launch. That message is sent via a tree-like routing algorithm. So mpirun sends to the first two daemons, each of which relays it on to some number of daemon

Re: [OMPI users] keyval parser: error 1 reading file mpicc-wrapper-data.txt

2011-03-28 Thread Ralph Castain
tual mpicc-wrapper-data.txt file > that could cause this? > Anything in the parser code? > > We have Linux CentOS 5.5 x86_62 with gcc 4.1.2. > I built OpenMPI both with gfortran and Intel Ifort 12.0.0. > Same problem on both builds. > > Thank you, > Gus Correa > >

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-02 Thread Ralph Castain
I'm afraid I have no idea what you are talking about. Are you saying you are launching OMPI processes via mpirun, but with "pbsdsh" as the plm_rsh_agent??? That would be a very bad idea. If you are running under Torque, then let mpirun "do the right thing" and use its Torque-based launcher. On

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote: > Let me expand on this slightly (in response to Ralph Castain's posting > -- I had digest mode set). As currently constructed a shellscript in > Wien2k (www.wien2k.at) launches a series of tasks using > > ($remote $remotemachine "cd $PWD;$t $ttt

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 9:12 AM, Reuti wrote: > Am 03.04.2011 um 16:56 schrieb Ralph Castain: > >> On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote: >> >>> Let me expand on this slightly (in response to Ralph Castain's posting >>> -- I had digest mode set).

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 9:34 AM, Laurence Marks wrote: > On Sun, Apr 3, 2011 at 9:56 AM, Ralph Castain wrote: >> >> On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote: >> >>> Let me expand on this slightly (in response to Ralph Castain's posting >>

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 2:00 PM, Laurence Marks wrote: >>> >>> I am not using that computer. A scenario that I have come across is >>> that when a msub job is killed because it has exceeded it's Walltime >>> mpi tasks spawned by ssh may not be terminated because (so I am told) >>> Torque does not kno

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
Works great for me...sleep is dead every time. On Apr 3, 2011, at 3:13 PM, David Singleton wrote: > >> You can prove this to yourself rather easily. Just ssh to a remote node and >> execute any command that lingers for awhile - say something simple like >> "sleep". Then kill the ssh and do a

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 3:22 PM, Reuti wrote: > Am 03.04.2011 um 22:57 schrieb Ralph Castain: > >> On Apr 3, 2011, at 2:00 PM, Laurence Marks wrote: >> >>>>> >>>>> I am not using that computer. A scenario that I have come across is >>>>&

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 4:08 PM, Reuti wrote: > Am 03.04.2011 um 23:59 schrieb David Singleton: > >> On 04/04/2011 12:56 AM, Ralph Castain wrote: >>> >>> What I still don't understand is why you are trying to do it this way. Why >>> not just run >&g

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
On Apr 3, 2011, at 4:37 PM, Laurence Marks wrote: > On Sun, Apr 3, 2011 at 5:08 PM, Reuti wrote: >> Am 03.04.2011 um 23:59 schrieb David Singleton: >> >>> On 04/04/2011 12:56 AM, Ralph Castain wrote: >>>> >>>> What I still don't un

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-03 Thread Ralph Castain
oach. Good luck! > > On Sun, Apr 3, 2011 at 6:13 PM, Ralph Castain wrote: >> >> On Apr 3, 2011, at 4:37 PM, Laurence Marks wrote: >> >>> On Sun, Apr 3, 2011 at 5:08 PM, Reuti wrote: >>>> Am 03.04.2011 um 23:59 schrieb David Singleton: >>

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-04 Thread Ralph Castain
anned from the supercomputers I use >> I want to find a adequate patch for myself --- and then try and >> persuade the developers to adopt it. >> >> On Sun, Apr 3, 2011 at 6:13 PM, Ralph Castain wrote: >>> >>> On Apr 3, 2011, at 4:37 PM, Laurenc

Re: [OMPI users] Deadlock with mpi_init_thread + mpi_file_set_view

2011-04-04 Thread Ralph Castain
On Apr 4, 2011, at 8:18 AM, Rob Latham wrote: > On Sat, Apr 02, 2011 at 04:59:34PM -0400, fa...@email.com wrote: >> >> opal_mutex_lock(): Resource deadlock avoided >> #0 0x0012e416 in __kernel_vsyscall () >> #1 0x01035941 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 >> #2 0x

Re: [OMPI users] mpi problems,

2011-04-04 Thread Ralph Castain
Guess I can/will add the node name to the error message - should have been there before now. If it is a debug build, you can add "-mca plm_base_verbose 1" to the cmd line and get output tracing the launch and showing you what nodes are having problems. On Apr 4, 2011, at 8:24 AM, Nehemiah Dac

Re: [OMPI users] mpi problems,

2011-04-04 Thread Ralph Castain
e currently doesn't include the node name - not in the OMPI main code base, nor in the SCT port. So I will add it, which won't help you at the moment. Hence my suggestion about using the param :-) > > On Mon, Apr 4, 2011 at 9:34 AM, Ralph Castain wrote: > Guess I can/will

Re: [OMPI users] openmpi/pbsdsh/Torque problem

2011-04-04 Thread Ralph Castain
robust > and portable. This passes the simple test with B of "sleep 600" when > terminating the process where the mpirun is launched kills the sleep > on a remote node (unlike ssh on some but not all computers). > > On Mon, Apr 4, 2011 at 6:35 AM, Ralph Castain wrote: >

Re: [OMPI users] mpi problems,

2011-04-04 Thread Ralph Castain
Well, where is libfui located? Is that location in your ld path? Is the lib present on all nodes in your hostfile? On Apr 4, 2011, at 1:58 PM, Nehemiah Dacres wrote: > [jian@therock ~]$ echo $LD_LIBRARY_PATH > /opt/sun/sunstudio12.1/lib:/opt/vtk/lib:/opt/gridengine/lib/lx26-amd64:/opt/gridengin

Re: [OMPI users] WRF run on multiple Nodes

2011-04-05 Thread Ralph Castain
Did you request an allocation from PCM? If not, then PCM will block you from arbitrarily launching jobs on non-allocated nodes. Print out your environment and look for any envars from PCM and/or LSF (e.g., LSB_JOBID). I don't know what you mean about "no OMPI application is yet integrated with

Re: [OMPI users] problems with the -xterm option

2011-04-06 Thread Ralph Castain
If I read your error messages correctly, it looks like mpirun is crashing - the daemon is complaining that it lost the socket connection back to mpirun, and hence will abort. Are you seeing mpirun still alive? On Apr 5, 2011, at 4:46 AM, jody wrote: > Hi > > On my workstation and the cluste

Re: [OMPI users] problems with the -xterm option

2011-04-06 Thread Ralph Castain
; Warning: No xauth data; using fake authentication data for X11 forwarding. > Last login: Wed Apr 6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh > > So perhaps the whole problem is linked to that xauth-thing. > Do you have a suggestion how this can be solved? > > Thank You > J

Re: [OMPI users] problems with the -xterm option

2011-04-06 Thread Ralph Castain
Sorry Jody - I should have read your note more carefully to see that you already tried -Y. :-( Not sure what to suggest... On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote: > Like I said, I'm not expert. However, a quick "google" of revealed this > result: > > &

Re: [OMPI users] problems with the -xterm option

2011-04-06 Thread Ralph Castain
Here's a little more info - it's for Cygwin, but I don't see anything Cygwin-specific in the answers: http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote: > Sorry Jody - I should have read your note more car

Re: [OMPI users] mpi problems,

2011-04-06 Thread Ralph Castain
Look at your output from mpicc --showme. It indicates that the OMPI libs were put in the lib64 directory, not lib. On Apr 6, 2011, at 1:38 PM, Nehemiah Dacres wrote: > I am also trying to get netlib's hpl to run via sun cluster tools so i am > trying to compile it and am having trouble. Which

Re: [OMPI users] mpi problems,

2011-04-06 Thread Ralph Castain
at should they be in? > > > On Wed, Apr 6, 2011 at 2:44 PM, Ralph Castain wrote: > Look at your output from mpicc --showme. It indicates that the OMPI libs were > put in the lib64 directory, not lib. > > > On Apr 6, 2011, at 1:38 PM, Nehemiah Dacres wrote: > >

Re: [OMPI users] problem with configure and c++, lib and lib64

2011-04-06 Thread Ralph Castain
On Apr 6, 2011, at 1:27 PM, Jason Palmer wrote: > Hello, > > I’m trying again with the 1.4.3 version to use compile openmpi statically > with my program … but I’m running into a more basic problem, similar to one I > previously encountered and solved using LD_LIBRARY_PATH. > > The configure

Re: [OMPI users] OMPI 1.4.3 and "make distclean" error

2011-04-06 Thread Ralph Castain
On Apr 6, 2011, at 1:21 PM, David Gunter wrote: > We tend to build OMPI for several different architectures. Rather than untar > the archive file each time I'd rather do a "make distclean" in between > builds. However, this always produces the following error: > > ... > Making distclean in li

Re: [OMPI users] SGE and openmpi

2011-04-06 Thread Ralph Castain
Are you able to run non-MPI programs like "hostname"? I ask because that error message indicates that everything started just fine, but there is an error in your application. On Apr 6, 2011, at 6:01 PM, Jason Palmer wrote: > Btw, I did compile openmpi with the --with-sge flag. > > I am able t

Re: [OMPI users] SGE and openmpi

2011-04-12 Thread Ralph Castain
On Apr 11, 2011, at 11:33 PM, kevin.buck...@ecs.vuw.ac.nz wrote: > >>> #!/bin/bash >>> #$ -cwd >>> #$ -j y >>> #$ -S /bin/bash >>> #$ -q all.q >>> #$ -pe orte 18 >>> MPI_DIR=/home/jason/openmpi-1.4.3-install/bin >>> /home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS myprog > > >> If you

Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

2011-04-12 Thread Ralph Castain
Let's simplify the issue as we have no idea what your codes are doing. Can you run two copies of hostname, for example? What about multiple copies of an MPI version of "hello" - see the examples directory in the OMPI tarball. On Apr 12, 2011, at 8:43 AM, Stergiou, Jonathan C CIV NSWCCD West Be

Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc

2011-04-12 Thread Ralph Castain
Okay, that says that mpirun is working correctly - the problem appears to be in MPI_Init. How was OMPI configured? On Apr 12, 2011, at 9:24 AM, Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640 wrote: > Ralph, > > Thanks for the reply and guidance. > > I ran the following: > > $> mpirun

Re: [OMPI users] OMPI monitor each process behavior

2011-04-13 Thread Ralph Castain
What version are you using? If you are using 1.5.x, there is an "orte-top" command that will do what you ask. It queries the daemons to get the info. On Apr 12, 2011, at 9:55 PM, Jack Bryan wrote: > Hi , All: > > I need to monitor the memory usage of each parallel process on a linux Open > M

Re: [OMPI users] Over committing?

2011-04-13 Thread Ralph Castain
On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote: > The bulk of our compute nodes are 8 cores (twin 4-core IBM x3550-m2). > Jobs are submitted by Torque/MOAB. When run with up to np=8 there is > good performance. Attempting to run with more processors brings > problems, specifically if any one

Re: [OMPI users] Over committing?

2011-04-13 Thread Ralph Castain
nvironment before printing this email. > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: 13 April 2011 15:34 > To: Open MPI Users > Subject: Re: [OMPI users] Over committing? > > > On Apr

Re: [OMPI users] OMPI monitor each process behavior

2011-04-13 Thread Ralph Castain
On Apr 13, 2011, at 10:29 AM, Jack Bryan wrote: > Hi , > > If I cannot ssh to a worker node, it means that my program cannot work > correctly ? No, that's not true. People thought you were on a cluster using ssh as the launcher. From prior notes, you were using Torque, so not being allowed

Re: [OMPI users] OMPI monitor each process behavior

2011-04-13 Thread Ralph Castain
On Apr 13, 2011, at 10:19 AM, Jack Bryan wrote: > Hi, I am using > > mpirun (Open MPI) 1.3.4 > > But, I have these, > > orte-clean orted orte-ioforte-ps orterun > > Can they do the same thing ? Unfortunately, no > > If I use them, will they use a lot of memory on each wo

Re: [OMPI users] Over committing?

2011-04-13 Thread Ralph Castain
tried test jobs on 8+7 (or 7+8) with inconclusive > results. >> Some of the live jobs run for a month or more and cut down versions do > >> not model well. >> >> Martin Rushton >> HPC System Manager, Weapons Technologies >> Tel: 01959 514777, Mobile: 07939

Re: [OMPI users] shm unlinking

2011-04-14 Thread Ralph Castain
Difficult to follow your thread here, but I think you're wondering about post-job cleanup? Torque runs an epilogue script on all nodes included in the allocation. It is advisable to always have the epilogue script clean out the tmp directories, assuming single-user use of allocated nodes. If mu

Re: [OMPI users] Over committing?

2011-04-14 Thread Ralph Castain
vering customer-focused solutions >> >> Please consider the environment before printing this email. >> -Original Message- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >> Behalf Of Ralph Castain >> Sent: 14 April 2011 04:55 >> To

Re: [OMPI users] Condor and MPI

2011-04-14 Thread Ralph Castain
Not much we can say with that little info. :-/ Are you using Open MPI? If so, what version? When you say the job gets restarted, do you mean that Condor restarts the entire MPI job? If so, you had best talk to the Condor folks - it has nothing to do with Open MPI, but is due to a job control fl

<    3   4   5   6   7   8   9   10   11   12   >