[OMPI users] Unable to run WRF on large core counts (1024+), queue pair error

2009-12-17 Thread Craig Tierney
I am trying to run WRF on 1024 cores with OpenMPI 1.3.3 and 1.4. I can get the code to run with 512 cores, but it crashes at startup on 1024 cores. I am getting the following error message: [n172][[43536,1],0][connect/btl_openib_connect_oob.c:463:qp_create_one] error creating qp errno says Can

Re: [OMPI users] NetBSD OpenMPI - SGE - PETSc - PISM

2009-12-17 Thread Jeff Squyres
On Dec 17, 2009, at 5:55 PM, wrote: > I am happy to be able to inform you that the problems we were > seeing would seem to have been arising down at the OpenMPI > level. Happy for *them*, at least. ;-) > If I remove any acknowledgement of IPv6 within the OpenMPI > code, then both the PETSc exa

[OMPI users] NetBSD OpenMPI - SGE - PETSc - PISM

2009-12-17 Thread Kevin . Buckley
A whole swathe of people have been made aware of the issues that have arisen as a result of a researcher here looking to run PISM, which sits on top of PETSc, which sits on top of OpenMPI. I am happy to be able to inform you that the problems we were seeing would seem to have been arising down at

Re: [OMPI users] Notifier Framework howto

2009-12-17 Thread Jeff Squyres
That would be great! On Dec 17, 2009, at 3:52 PM, Ralph Castain wrote: > If it would help, I have time and am willing to add notifier calls to this > area of the code base. You'll still get the errors shown here as I always > bury the notifier call behind the error check that surrounds these er

Re: [OMPI users] Notifier Framework howto

2009-12-17 Thread Ralph Castain
If it would help, I have time and am willing to add notifier calls to this area of the code base. You'll still get the errors shown here as I always bury the notifier call behind the error check that surrounds these error messages to avoid impacting the critical path, but you would be able to gt

Re: [OMPI users] error performing MPI_Comm_spawn

2009-12-17 Thread Ralph Castain
Will be in the 1.4 nightly tarball generated later tonight... Thanks again Ralph On Dec 17, 2009, at 4:07 AM, Marcia Cristina Cera wrote: > very good news > I will wait carefully for the release :) > > Thanks, Ralph > márcia. > > On Wed, Dec 16, 2009 at 10:56 PM, Ralph Castain wrote: > A

Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-17 Thread Nicolas Bock
Ok, I'll give it a try. Thanks, nick On Thu, Dec 17, 2009 at 12:44, Ralph Castain wrote: > In case you missed it, this patch should be in the 1.4 nightly tarballs - > feel free to test and let me know what you find. > > Thanks > Ralph > > On Dec 2, 2009, at 10:06 PM, Nicolas Bock wrote: > > Th

Re: [OMPI users] [visit-developers] /usr/bin/ld: cannot find -lrdmacm on 9184

2009-12-17 Thread tom fogal
Simon Su writes: > Hi Tom, > > "hello world" MPI program also won't compile if > librdmacm-devel-1.0.8-5.el5 is not installed. I have asked the person > who maintain the openmpi package on how they were compiled. My guess > is librdmacm-devel-1.0.8-5.el5 may need to be added as dependency > packa

Re: [OMPI users] MPI_Comm_spawn lots of times

2009-12-17 Thread Ralph Castain
In case you missed it, this patch should be in the 1.4 nightly tarballs - feel free to test and let me know what you find. Thanks Ralph On Dec 2, 2009, at 10:06 PM, Nicolas Bock wrote: > That was quick. I will try the patch as soon as you release it. > > nick > > > On Wed, Dec 2, 2009 at 21:

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeroen Kleijer
Hi Min, I've had a chance to use the openmpi-mpirun script. Though unfortunately I do not have openmpi available at the moment (customer where I'm at runs RHEL4 with LAM/MPI) I've been able to see what quotes need to be used and this one seems to work on my end: bsub -e ERR -o OUT -n 16 './open

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Jeroen, Thanks a lot. Unfortunately I don't think I have got mpirun.lsf. The Dell company only asked us to use openmpi-mpirun as a wrapper script. Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeroen Kleijer S

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeroen Kleijer
Hi Min, Sorry for the mixup but it's been a while since I've actually used LSF. I've had a look at my notes and to use mpirun with LSF you should give something like this: bsub -a openmpi -n 16 "mpirun.lsf -x PATH -x LD_LIBRARY_PATH -x MPI_BUFFER_SIZE \" ulimit -s unlimited ; ./wrf.exe \" " (a

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Ashley, I changed the content in openmpi-mpirun script you mentioned but wrf.exe still not executed. Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ashley Pittman Sent: 17 December 2009 16:39 To: Open MPI Users Su

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, This time the OUT file is Sender: LSF System Subject: Job 667: Exited Job was submitted from host by user . Job was executed on host(s) <8*compute-01>, in queue , as user . <8*compute-12> was used as the home directory. was u

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Ashley Pittman
On Thu, 2009-12-17 at 14:40 +, Min Zhu wrote: > Here is the content of openmpi-mpirun file, so maybe something needs to > be changed? > if [ x"${LSB_JOBFILENAME}" = x -o x"${LSB_HOSTS}" = x ]; then > usage > exit -1 > fi > > MYARGS=$* Shouldn't this be MYARGS=$@ It'll change the way

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeroen Kleijer
Hi Min, Seems like the command ulimit was executed 16 times but after that (the "; ./wrf.exe") was ignored. Could you give the following a try: bsub -e ERR -o OUT -n 16 "openmpi-mpirun \"/bin/sh -c 'ulimit -s unlimited ; ./wrf.exe ' \" " Somewhere, quoting goes wrong and I'm trying to figure out

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Jeroen, Here is the OUT file, ERR file is empty. -- Sender: LSF System Subject: Job 662: Done Job was submitted from host by user . Job was executed on host(s) <8*compute-10>, in queue , as user .

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeroen Kleijer
Hi Min Did you get any type of error message in the ERR or OUT files? I don't have mpirun installed in the environment at the moment but giving the following: bsub -q interq -I "ssh /bin/sh -c 'ulimit -s unlimited ; /bin/hostname ' " seems to work for me, so I'm kind of curious what the error m

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Jeroen, Thanks for your reply. I tried the command bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " and wrf.exe not executed. Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeroen Kleijer
It's just that the "'s on the command line get parsed by LSF / bash (or whatever shell you use) If you wish to use it without the script you can give this a try: bsub -e ERR -o OUT -n 16 "openmpi-mpirun /bin/sh -c 'ulimit -s unlimited; ./wrf.exe ' " This causes to pass the whole string "openmpi-m

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Jeff, Your script method works for me. Thank you very much, Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: 17 December 2009 14:56 To: Open MPI Users Subject: Re: [OMPI users] About openmpi-mpir

Re: [OMPI users] Debugging spawned processes

2009-12-17 Thread Ralph Castain
On Dec 17, 2009, at 3:54 AM, jody wrote: > yeah, know that you mention it, i remember (old brain here, as well) > But IIRC you created a OMPI version which was called 1.4a1r or something, > where i indeed could use this xterm. When i updated to 1.3.2, i sort > of forgot about it again... The 1.4

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeff Squyres
This might be something you need to talk to Platform about...? Another option would be to openmpi-mpirun a script that is just a few lines long: #!/bin/sh ulimit -s unlimited ./wrf.exe On Dec 17, 2009, at 9:40 AM, Min Zhu wrote: > Hi, Jeff, > > Thanks. For bsub -e ERR -o OUT -n 16 openmpi-m

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Jeff, Thanks. For bsub -e ERR -o OUT -n 16 openmpi-mpirun /bin/sh -c "ulimit -s unlimited; ./wrf.exe", I tried and wrf.exe doesn't executed. Here is the content of openmpi-mpirun file, so maybe something needs to be changed? -- #!/bin/sh # # Copyr

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Jeff Squyres
On Dec 17, 2009, at 9:15 AM, Min Zhu wrote: > Thanks for your reply. Yes, your mpirun command works for me. But I need to > use bsub job scheduler. I wonder why > bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; > ./wrf.exe" doesn't work. Try with different quoting...?

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, Thanks for your reply. Yes, your mpirun command works for me. But I need to use bsub job scheduler. I wonder why bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; ./wrf.exe" doesn't work. Cheers, Min Zhu -Original Message- From: users-boun...@open-mpi.org

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Romaric David
Min Zhu a écrit : Hi, I excuted bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; ./wrf.exe" Hello, Here we use : mpirun /bin/tcsh -c " limit stacksize unlimited ; /full/path/to/wrf.exe" Regards, R. David

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Hi, I excuted bsub -e ERR -o OUT -n 16 openmpi-mpirun "/bin/sh -c ulimit -s unlimited; ./wrf.exe" Here is the results, it seems that wrf.exe didn't run. ERR file is empty. Sender: LSF System Subject: Job 647: Done Job was subm

Re: [OMPI users] About openmpi-mpirun

2009-12-17 Thread Romaric David
Min Zhu a écrit : Dear all, I have a question to ask you. If I issue a command "bsub -n 16 openmpi-mpirun ./wrf.exe" to my 128-core (16 nodes)cluster, the job failed to run due to stacksize problem. Openmpi-mpirun is a wrapper script written by Platform. If I want to add '/bin/sh -c ulimit -s

[OMPI users] About openmpi-mpirun

2009-12-17 Thread Min Zhu
Dear all, I have a question to ask you. If I issue a command "bsub -n 16 openmpi-mpirun ./wrf.exe" to my 128-core (16 nodes)cluster, the job failed to run due to stacksize problem. Openmpi-mpirun is a wrapper script written by Platform. If I want to add '/bin/sh -c ulimit -s unlimited' to the abov

[OMPI users] MPI-IO, providing buffers

2009-12-17 Thread Ricardo Reis
Hi all I have a doubt. I'm starting using MPI-IO and was wondering if I can use the MPI_BUFFER_ATTACH to provide the necessary IO buffer (or will it use the array I'm passing the MPI_Write...??) many thanks, Ricardo Reis 'Non Serviam' PhD candidate @ Lasef Computational Fluid Dynam

Re: [OMPI users] error performing MPI_Comm_spawn

2009-12-17 Thread Marcia Cristina Cera
very good news I will wait carefully for the release :) Thanks, Ralph márcia. On Wed, Dec 16, 2009 at 10:56 PM, Ralph Castain wrote: > Ah crumb - I found the problem. Sigh. > > I actually fixed this in the trunk over 5 months ago when the problem first > surfaced in my own testing, but it n

Re: [OMPI users] Debugging spawned processes

2009-12-17 Thread jody
yeah, know that you mention it, i remember (old brain here, as well) But IIRC you created a OMPI version which was called 1.4a1r or something, where i indeed could use this xterm. When i updated to 1.3.2, i sort of forgot about it again... Another question though: You said "If it includes the -xte

Re: [OMPI users] Pointers for understanding failure messages on NetBSD

2009-12-17 Thread Kevin . Buckley
> You could confirm that it is the IPv6 loop by simply disabling IPv6 > support - configure with --disable-ipv6 and see if you still get the error > messages > > Thanks for continuing to pursue this! > Ralph > Yeah, but if you disable the IPv6 stuff then there's a completely different path taken