Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
2009/3/31 Ralph Castain : > I have no idea why your processes are crashing when run via Torque - are you > sure that the processes themselves crash? Are they segfaulting - if so, can > you use gdb to find out where? I have to admit I'm a newbiee with gdb. I am trying to recompile my code as "ifort

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-03-31 Thread PN
Dear Rolf, Thanks for your reply. I've created another PE and changed the submission script, explicitly specify the hostname with "--host". However the result is the same. # qconf -sp orte pe_nameorte slots 8 user_lists NONE xuser_listsNONE start_proc_args

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
2009/3/31 Ralph Castain : > It is very hard to debug the problem with so little information. We > regularly run OMPI jobs on Torque without issue. Another small thing that I noticed. Not sure if it is relevant. When the job starts running there is an orte process. The args to this process are sli

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
2009/3/31 Ralph Castain : > > Information would be most helpful - the information we really need is > specified here: http://www.open-mpi.org/community/help/ Output of "ompi_info --all" is attached in a file. echo $LD_LIBRARY_PATH /usr/local/ompi-ifort/lib:/opt/intel/fce/10.1.018/lib:/opt/intel

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
2009/3/31 Ralph Castain : > It is very hard to debug the problem with so little information. We Thanks Ralph! I'm sorry my first post lacked enough specifics. I'll try my best to fill you guys in on as much debug info as I can. > regularly run OMPI jobs on Torque without issue. So do we. In fac

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Ralph Castain
It is very hard to debug the problem with so little information. We regularly run OMPI jobs on Torque without issue. Are you getting an allocation from somewhere for the nodes? If so, are you using Moab to get it? Do you have a $PBS_NODEFILE in your environment? I have no idea why your pr

[OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-03-31 Thread Rahul Nabar
I've a strange OpenMPI/Torque problem while trying to run a job on our Opteron-SC-1435 based cluster: Each node has 8 cpus. If I got to a node and run like so then the job works: mpirun -np 6 ${EXE_PATH}/${DACAPOEXE_PAR} ${ARGS} Same job if I submit through PBS/Torque then it starts running but

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Kevin McManus
On Tue, Mar 31, 2009 at 05:36:19PM -0400, Jeff Squyres wrote: > On Mar 31, 2009, at 5:25 PM, Kevin McManus wrote: > > >--- MCA component mtl:psm (m4 configuration macro) > >checking for MCA component mtl:psm compile mode... static > >checking --with-psm value... simple ok (unspecified) > >checking

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Jeff Squyres
On Mar 31, 2009, at 5:25 PM, Kevin McManus wrote: --- MCA component mtl:psm (m4 configuration macro) checking for MCA component mtl:psm compile mode... static checking --with-psm value... simple ok (unspecified) checking --with-psm-libdir value... sanity check ok (/usr/lib64) checking psm.h usab

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Kevin McManus
On Tue, Mar 31, 2009 at 04:59:00PM -0400, Jeff Squyres wrote: > My goal in having you try that statement in a standalone shell script > wasn't the success or failure of the uname command -- but rather to > figure out if something in that statement itself was causing the > syntax error. > > A

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-03-31 Thread Rolf Vandevaart
On 03/31/09 14:50, Dave Love wrote: Rolf Vandevaart writes: However, I found that if I explicitly specify the "-machinefile $TMPDIR/machines", all 8 mpi processes were spawned within a single node, i.e. node0002. I had that sort of behaviour recently when the tight integration was broken on

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Jeff Squyres
My goal in having you try that statement in a standalone shell script wasn't the success or failure of the uname command -- but rather to figure out if something in that statement itself was causing the syntax error. Apparently it is not. There's an errant character elsewhere that is cau

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Kevin McManus
On Tue, Mar 31, 2009 at 10:11:17PM +0200, Bogdan Costescu wrote: > On Tue, 31 Mar 2009, Bogdan Costescu wrote: > > >'uname -X' is valid on Solaris, but not on Linux. > > Not good to reply to oneself, but I've looked at the archives and > realized that 'uname -X' comes from a message of the OP. M

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Kevin McManus
On Tue, Mar 31, 2009 at 09:57:02PM +0200, Bogdan Costescu wrote: > On Tue, 31 Mar 2009, Jeff Squyres wrote: > > >UNAME_REL=`(/bin/uname -X|grep Release|sed -e 's/.*= //')` > > Not sure what you want to achieve here... 'uname -X' is valid on > Solaris, but not on Linux. The OP has indicated alrea

Re: [OMPI users] Cannot build OpenMPI 1.3 with PGI pgf90 and Gnu gcc/g++.

2009-03-31 Thread Gus Correa
Hi Jeff, list Jeff: Thank you for your help and suggestions. Please, correct my argument below if I am wrong. I am not sure yet if the problem is caused by libtool, because somehow it was not present in OpenMPI 1.2.8. Just as a comparison, the libtool commands on 1.2.8 and 1.3 are very similar,

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Bogdan Costescu
On Tue, 31 Mar 2009, Bogdan Costescu wrote: 'uname -X' is valid on Solaris, but not on Linux. Not good to reply to oneself, but I've looked at the archives and realized that 'uname -X' comes from a message of the OP. My guess is that the same source directory was used to build for Solaris p

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Bogdan Costescu
On Tue, 31 Mar 2009, Jeff Squyres wrote: UNAME_REL=`(/bin/uname -X|grep Release|sed -e 's/.*= //')` Not sure what you want to achieve here... 'uname -X' is valid on Solaris, but not on Linux. The OP has indicated already that he is running this on Linux (SLES) so the above line is supposed t

Re: [OMPI users] OpenMPI 1.3.1 + BLCR build problem

2009-03-31 Thread Josh Hursey
I think that the missing configure option might be the problem as well. The BLCR configure logic checks to see if you have enabled checkpoint/restart in Open MPI. If you haven't then it fails out of configure (probably should print a better error message - I'll put that on my todo list).

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-03-31 Thread Dave Love
Rolf Vandevaart writes: >> However, I found that if I explicitly specify the "-machinefile >> $TMPDIR/machines", all 8 mpi processes were spawned within a single >> node, i.e. node0002. I had that sort of behaviour recently when the tight integration was broken on the installation we'd been give

Re: [OMPI users] OpenMPI 1.3.1 + BLCR build problem

2009-03-31 Thread Dave Love
M C writes: > --- MCA component crs:blcr (m4 configuration macro) > checking for MCA component crs:blcr compile mode... dso > checking --with-blcr value... sanity check ok (/opt/blcr) > checking --with-blcr-libdir value... sanity check ok (/opt/blcr/lib) > configure: WARNING: BLCR support request

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Kevin McManus
On Tue, Mar 31, 2009 at 01:37:22PM -0400, Jeff Squyres wrote: > On Mar 31, 2009, at 1:31 PM, Terry Dontje wrote: > > >Can you manually run UNAME_REL=`(/bin/uname -X|grep Release|sed -e > >'s/.*= //')` in your shell without error? > > > > Better would be to put this small script by itself: > > #!

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Kevin McManus
On Tue, Mar 31, 2009 at 01:31:12PM -0400, Terry Dontje wrote: > I was talking with Jeff Squyres about your issue and he thinks the > config.guess issue needs to be resolved first, even though your > specification of x86_64 seems to get you by. > > So, do you still see the unexpected "(" if you t

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Jeff Squyres
On Mar 31, 2009, at 1:31 PM, Terry Dontje wrote: Can you manually run UNAME_REL=`(/bin/uname -X|grep Release|sed -e 's/.*= //')` in your shell without error? Better would be to put this small script by itself: #! /bin/sh UNAME_REL=`(/bin/uname -X|grep Release|sed -e 's/.*= //')` echo got $UN

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-03-31 Thread Rolf Vandevaart
On 03/31/09 11:43, PN wrote: Dear all, I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2 I have 2 compute nodes for testing, each node has a single quad core CPU. Here is my submission script and PE config: $ cat hpl-8cpu.sge #!/bin/bash # #$ -N HPL_8cpu_IB #$ -pe mpi-fu 8 #$ -cwd #$ -j y #$

Re: [OMPI users] Linux opteron infiniband sunstudio configure, problem

2009-03-31 Thread Terry Dontje
I was talking with Jeff Squyres about your issue and he thinks the config.guess issue needs to be resolved first, even though your specification of x86_64 seems to get you by. So, do you still see the unexpected "(" if you try and run config/config.guess directly? The original issue IIRC was

[OMPI users] OpenMPI 1.3.1 + BLCR build problem

2009-03-31 Thread M C
Hi guys, This is my first foray into the world of OpenMPI (MPICH 1, 2 and LAM so far), and I'm keen to test checkpointing using the BLCR kernel modules. I get the BLCR components to build just fine (v0.8.1), but the OpenMPI build fails with: % ./configure --with-blcr=/opt/blcr --with-blcr-lib

[OMPI users] Strange behaviour of SGE+OpenMPI

2009-03-31 Thread PN
Dear all, I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2 I have 2 compute nodes for testing, each node has a single quad core CPU. Here is my submission script and PE config: $ cat hpl-8cpu.sge #!/bin/bash # #$ -N HPL_8cpu_IB #$ -pe mpi-fu 8 #$ -cwd #$ -j y #$ -S /bin/bash #$ -V # cd /home/

Re: [OMPI users] Generic Type

2009-03-31 Thread Massimo Cafaro
Hi, unfortunatelly it's up to us to provide the starting address of the buffer and the number of elements to be received multiplied by the datatype extent. This kind of things is dealt automatically in the internals of collective communication operations. Massimo On 31/mar/09, at 14:00,

Re: [OMPI users] Generic Type

2009-03-31 Thread Gabriele Fatigati
Thanks Massimo, now it works well. I've erroneous think that Irecv did this automatically using recvtype fields. 2009/3/31 Massimo Cafaro : > Hi, > > let me see that it is still not clear to me why you want to reimplement the > MPI_Gather supplied by an MPI implementation with your own version. >

Re: [OMPI users] Generic Type

2009-03-31 Thread Massimo Cafaro
Hi, let me see that it is still not clear to me why you want to reimplement the MPI_Gather supplied by an MPI implementation with your own version. You will never be able to attain the same level of performance using point to point communication, since MPI_Gather uses internally a binomia

Re: [OMPI users] Generic Type

2009-03-31 Thread Gabriele Fatigati
Mm, OpenMPI functions like MPI_Irecv, does pointer arithmetics over recv buffer using type info in ompi_datatype_t i suppose. I'm trying to write a wrapper to MPI_Gather using Irecv functions: int MPI_FT_Gather(void*sendbuf, int sendcount, MPI_Datatype sendtype, void*recvbuff,

Re: [OMPI users] [OMPI devel] mpirun: symbol lookup error:/usr/local/lib/openmpi/mca_plm_lsf.so: undefined symbol: ls b_init

2009-03-31 Thread Alessandro Surace
Hi Jeff, Yes I've installed LSF and the liblsf and libbat are found by the configure how you can see in the previous attach and here: /opt/lsf/7.0/linux2.6-glibc2.3-x86/lib -rw-r--r-- 1 root 10007 1771182 Sep 24 2008 libbat.a -rw-r--r-- 1 root 10007 31278 Nov 23 2007 libbat.jsdl.a -rwxr-xr-x 1