[OMPI users] mpicc Segmentation Fault with Intel Compiler

2007-11-06 Thread Michael Schulz

Hi,

I've the same problem described by some other users, that I can't
compile anything if I'm using the open-mpi compiled with the Intel- 
Compiler.


> ompi_info --all
Segmentation fault

OpenSUSE 10.3
Kernel: 2.6.22.9-0.4-default
Intel P4

Configure-Flags: CC=icc, CXX=icpc, F77=ifort, F90=ifort

Intel-Compiler: both, C and Fortran 10.0.025

Is there any known solution?
Thanks,

Michael



Re: [OMPI users] mpicc Segmentation Fault with Intel Compiler

2007-11-06 Thread Åke Sandgren
On Tue, 2007-11-06 at 10:28 +0100, Michael Schulz wrote:
> Hi,
> 
> I've the same problem described by some other users, that I can't
> compile anything if I'm using the open-mpi compiled with the Intel- 
> Compiler.
> 
>  > ompi_info --all
> Segmentation fault
> 
> OpenSUSE 10.3
> Kernel: 2.6.22.9-0.4-default
> Intel P4
> 
> Configure-Flags: CC=icc, CXX=icpc, F77=ifort, F90=ifort
> 
> Intel-Compiler: both, C and Fortran 10.0.025
> 
> Is there any known solution?

I had the same problem with pathscale.
Try this, i think it is the solution i found.

diff -ru site/opal/runtime/opal_init.c
amd64_ubuntu606-psc/opal/runtime/opal_init.c
--- site/opal/runtime/opal_init.c   2007-10-20 03:00:35.0
+0200
+++ amd64_ubuntu606-psc/opal/runtime/opal_init.c2007-10-23
16:12:15.0 +0200
@@ -169,7 +169,7 @@
 }

 /* register params for opal */
-if (OPAL_SUCCESS !=  opal_register_params()) {
+if (OPAL_SUCCESS !=  (ret = opal_register_params())) {
 error = "opal_register_params";
 goto return_error;
 }

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



[OMPI users] Re: [OMPI users] MPI_Probe succeeds, but subsequent MPI_Recv gets stuck

2007-11-06 Thread hpe...@infonie.fr
Just a thought,

Behaviour can be unpredictable if you use MPI_Bsend or MPI_Ibsend ... on your 
sender side cause nothing is checked regard to allocated buffer.

MPI_Send or MPI_Isend shall be used instead.

Regards

Herve



 ALICE C'EST ENCORE MIEUX AVEC CANAL+ LE BOUQUET ! 
---
Découvrez vite l'offre exclusive ALICEBOX et CANAL+ LE BOUQUET, en cliquant ici 
http://alicebox.fr
Soumis à conditions.





[OMPI users] OpenMPI - can you switch off threads?

2007-11-06 Thread Mostyn Lewis

I'm trying to build a udapl OpenMPI from last Friday's SVN and using
Qlogic/QuickSilver/SilverStorm 4.1.0.0.1 software. I can get it
made and it works in machine. With IB between 2 machines is fails
near termination of a job. Qlogic says they don't have a threaded
udapl (libpthread is in the traceback).

How do you (can you?) configure pthreads away alltogether?

Mostyn


Re: [OMPI users] OpenMPI - can you switch off threads?

2007-11-06 Thread Andrew Friedley
All thread support is disabled by default in Open MPI; the uDAPL BTL is 
neither thread safe nor makes use of a threaded uDAPL implementation. 
For completeness, the thread support is controlled by the 
--enable-mpi-threads and --enable-progress-threads options to the 
configure script.


The referense you're seeing to libpthread.so.0 is a side effect of the 
way we print backtraces when crashes occur and can be ignored.


How exactly does your MPI program fail?  Make sure you take a look at 
http://www.open-mpi.org/community/help/ and provide all relevant 
information.


Andrew

Mostyn Lewis wrote:

I'm trying to build a udapl OpenMPI from last Friday's SVN and using
Qlogic/QuickSilver/SilverStorm 4.1.0.0.1 software. I can get it
made and it works in machine. With IB between 2 machines is fails
near termination of a job. Qlogic says they don't have a threaded
udapl (libpthread is in the traceback).

How do you (can you?) configure pthreads away alltogether?

Mostyn
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] OpenMPI - can you switch off threads?

2007-11-06 Thread Mostyn Lewis

Andrew,

Failure looks like:


> + mpirun --prefix
> +
/tools/openmpi/1.3a1r16632_svn/infinicon/gcc64/4.1.2/udapl/suse_sles_1
> + 0/x86_64/opteron -np 8
>  -machinefile H ./a.out
> Process 0 of 8 on s1470
> Process 1 of 8 on s1470
> Process 4 of 8 on s1469
> Process 2 of 8 on s1470
> Process 7 of 8 on s1469
> Process 5 of 8 on s1469
> Process 6 of 8 on s1469
> Process 3 of 8 on s1470
> 30989:a.out *->0 (f=noaffinity,0,1,2,3)
> 30988:a.out *->0 (f=noaffinity,0,1,2,3)
> 30990:a.out *->0 (f=noaffinity,0,1,2,3)
> 30372:a.out *->0 (f=noaffinity,0,1,2,3)
> 30991:a.out *->0 (f=noaffinity,0,1,2,3)
> 30370:a.out *->0 (f=noaffinity,0,1,2,3)
> 30369:a.out *->0 (f=noaffinity,0,1,2,3)
> 30371:a.out *->0 (f=noaffinity,0,1,2,3)
>  get ASYNC ERROR = 6
> [s1469:30369] *** Process received signal *** [s1469:30369] Signal:
> Segmentation fault (11) [s1469:30369] Signal code: Address not mapped
> (1) [s1469:30369] Failing at address: 0x110 [s1469:30369] [ 0]
> /lib64/libpthread.so.0 [0x2b528ceefc10] [s1469:30369] [ 1]
> /lib64/libdapl.so(dapl_llist_next_entry+0x25) [0x2b528fba5df5]
> [s1469:30369] *** End of error message ***



> and in a /var/log/messages I see:
> 
> Nov  5 14:46:00 s1469 sshd[30363]: Accepted publickey for mostyn from

> 10.173.132.37 port 36211 ssh2 Nov  5 14:46:25 s1469 kernel: TVpd:
> !ERROR! Async Event:TAVOR_EQE_TYPE_CQ_ERR: (CQ Access Error) cqn:641
Nov
> 5 14:46:25 s1469 kernel: a.out[30374]: segfault at 0110
rip
> 2b528fba5df5 rsp 410010b0 error 4
> 
> This is repoducible.
> 
> Is this OpenMPI or your libdapl that's doing this, you think?
> 
> + ompi_info

> Open MPI: 1.3a1svn11022007
>Open MPI SVN revision: svn11022007
> Open RTE: 1.3a1svn11022007
>Open RTE SVN revision: svn11022007
> OPAL: 1.3a1svn11022007
>OPAL SVN revision: svn11022007
>   Prefix:
> 
/tools/openmpi/1.3a1r16632_svn/infinicon/gcc64/4.1.2/udapl/suse_sles_10/

> x86_64/opter
> on
>  Configured architecture: x86_64-unknown-linux-gnu
>   Configure host: s1471
>Configured by: root
>Configured on: Fri Nov  2 16:20:29 PDT 2007
>   Configure host: s1471
> Built by: mostyn
> Built on: Fri Nov  2 16:30:07 PDT 2007
>   Built host: s1471
>   C bindings: yes
> C++ bindings: yes
>   Fortran77 bindings: yes (all)
>   Fortran90 bindings: yes
>  Fortran90 bindings size: small
>   C compiler: gcc
>  C compiler absolute: /usr/bin/gcc
> C++ compiler: g++
>C++ compiler absolute: /usr/bin/g++
>   Fortran77 compiler: gfortran
>   Fortran77 compiler abs: /usr/bin/gfortran
>   Fortran90 compiler: gfortran
>   Fortran90 compiler abs: /usr/bin/gfortran
>  C profiling: yes
>C++ profiling: yes
>  Fortran77 profiling: yes
>  Fortran90 profiling: yes
>   C++ exceptions: no
>   Thread support: posix (mpi: no, progress: no)
>Sparse Groups: no
>   Internal debug support: no
>  MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
>  libltdl support: yes
>Heterogeneous support: yes
>  mpirun default --prefix: no
>  MPI I/O support: yes
>MCA backtrace: execinfo (MCA v1.0, API v1.0, Component
v1.3)
>   MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component
> v1.3)
>MCA paffinity: linux (MCA v1.0, API v1.1, Component v1.3)
>MCA maffinity: first_use (MCA v1.0, API v1.0, Component
> v1.3)
>MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.3)
>MCA timer: linux (MCA v1.0, API v1.0, Component v1.3)
>  MCA installdirs: env (MCA v1.0, API v1.0, Component v1.3)
>  MCA installdirs: config (MCA v1.0, API v1.0, Component v1.3)
>MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: basic (MCA v1.0, API v1.1, Component v1.3)
> MCA coll: inter (MCA v1.0, API v1.1, Component v1.3)
> MCA coll: self (MCA v1.0, API v1.1, Component v1.3)
> MCA coll: sm (MCA v1.0, API v1.1, Component v1.3)
> MCA coll: tuned (MCA v1.0, API v1.1, Component v1.3)
>   MCA io: romio (MCA v1.0, API v1.0, Component v1.3)
>MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.3)
>MCA mpool: sm (MCA v1.0, API v1.0, Component v1.3)
>  MCA pml: cm (MCA v1.0, API v1.0, Component v1.3)
>  MCA pml: dr (MCA v1.0, API v1.0, Component v1.3)
>  MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.3)
>  MCA bml: r2 (MCA v1.0, API v1.0, Component v1.3)
>   MCA rcache: vma (MCA v1.0, API v1.0, Component v1.3)
>  MCA btl: self (MCA v1.0, API v1.0.

Re: [OMPI users] OpenMPI - can you switch off threads?

2007-11-06 Thread Andrew Friedley



Mostyn Lewis wrote:

Andrew,

Failure looks like:


+ mpirun --prefix
+

/tools/openmpi/1.3a1r16632_svn/infinicon/gcc64/4.1.2/udapl/suse_sles_1

+ 0/x86_64/opteron -np 8
 -machinefile H ./a.out
Process 0 of 8 on s1470
Process 1 of 8 on s1470
Process 4 of 8 on s1469
Process 2 of 8 on s1470
Process 7 of 8 on s1469
Process 5 of 8 on s1469
Process 6 of 8 on s1469
Process 3 of 8 on s1470
30989:a.out *->0 (f=noaffinity,0,1,2,3)
30988:a.out *->0 (f=noaffinity,0,1,2,3)
30990:a.out *->0 (f=noaffinity,0,1,2,3)
30372:a.out *->0 (f=noaffinity,0,1,2,3)
30991:a.out *->0 (f=noaffinity,0,1,2,3)
30370:a.out *->0 (f=noaffinity,0,1,2,3)
30369:a.out *->0 (f=noaffinity,0,1,2,3)
30371:a.out *->0 (f=noaffinity,0,1,2,3)
 get ASYNC ERROR = 6


I thought this might be coming from the uDAPL BTL but I don't see where 
in the could this could possibly be printed from.



[s1469:30369] *** Process received signal *** [s1469:30369] Signal:
Segmentation fault (11) [s1469:30369] Signal code: Address not mapped
(1) [s1469:30369] Failing at address: 0x110 [s1469:30369] [ 0]
/lib64/libpthread.so.0 [0x2b528ceefc10] [s1469:30369] [ 1]
/lib64/libdapl.so(dapl_llist_next_entry+0x25) [0x2b528fba5df5]
[s1469:30369] *** End of error message ***



and in a /var/log/messages I see:

Nov  5 14:46:00 s1469 sshd[30363]: Accepted publickey for mostyn from
10.173.132.37 port 36211 ssh2 Nov  5 14:46:25 s1469 kernel: TVpd:
!ERROR! Async Event:TAVOR_EQE_TYPE_CQ_ERR: (CQ Access Error) cqn:641

Nov

5 14:46:25 s1469 kernel: a.out[30374]: segfault at 0110

rip

2b528fba5df5 rsp 410010b0 error 4



This makes me wonder if you're using the right DAT libraries.  Take a 
look at your dat.conf, it's usually found in /etc and make sure that it 
is configured properly for the Qlogic stuff, and does NOT contain any 
lines for any other stuff (like OFED-based interfaces).  Usually each 
line contains a path to a specific library to use for a particular 
interface, make sure it's the library you want.  You might have to 
contact you uDAPL vendor for help on that.



This is repoducible.

Is this OpenMPI or your libdapl that's doing this, you think?


I can't be sure -- every uDAPL implementation seems to interpret the 
spec differently (or completely change or leave out some functionality), 
making it hell to provide portable uDAPL support.  And currently the 
uDAPL BTL has seen little/no testing outside of Sun's and OFED's uDAPL.


What kind of interface adapters are you using?  Sounds like some kind of 
IB hardware; if possible I recommend using the OFED (openib BTL) or PSM 
(PSM MTL) interfaces instead of uDAPL.


Andrew



+ ompi_info
Open MPI: 1.3a1svn11022007
   Open MPI SVN revision: svn11022007
Open RTE: 1.3a1svn11022007
   Open RTE SVN revision: svn11022007
OPAL: 1.3a1svn11022007
   OPAL SVN revision: svn11022007
  Prefix:


/tools/openmpi/1.3a1r16632_svn/infinicon/gcc64/4.1.2/udapl/suse_sles_10/

x86_64/opter
on
 Configured architecture: x86_64-unknown-linux-gnu
  Configure host: s1471
   Configured by: root
   Configured on: Fri Nov  2 16:20:29 PDT 2007
  Configure host: s1471
Built by: mostyn
Built on: Fri Nov  2 16:30:07 PDT 2007
  Built host: s1471
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
  Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: no
  Thread support: posix (mpi: no, progress: no)
   Sparse Groups: no
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
 MPI I/O support: yes
   MCA backtrace: execinfo (MCA v1.0, API v1.0, Component

v1.3)

  MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component
v1.3)
   MCA paffinity: linux (MCA v1.0, API v1.1, Component v1.3)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component
v1.3)
   MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.3)
   MCA timer: linux (MCA v1.0, API v1.0, Component v1.3)
 MCA installdirs: env (MCA v1.0, API v1.0, Component v1.3)
 MCA installdirs: config (MCA v1.0, API v1.0, Component v1.3)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA

Re: [OMPI users] OpenMPI - can you switch off threads?

2007-11-06 Thread Mostyn Lewis

Andrew,

Thanks for looking. These machines are SUN X2200 and looking at the OUI of
the card it's a generic SUN Mellanox HCA.
This is SuSE SLES10 SP1 and the QuickSilver(SilverStorm) 4.1.0.0.1 software
release.

02:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor 
compatibility mode) (rev a0)
HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0
  Primary image is valid, unknown source
  Secondary image is valid, unknown source

  Vital Product Data
Product Name: Lion cub
P/N: 375-3382-01
E/C: A1
S/N: 1388FMH-0728200266
Freq/Power: N/A
Checksum: Ok
Date Code: N/A

s1471:/proc/iba/mt23108/1/port2 # cat info 
Port 2 Info

   PortState: Active   PhysState: LinkUpDownDefault: Polling
   LID:0x0392  LMC: 0
   Subnet: 0xfe80  GUID: 0x0003ba000100430e
   SMLID:  0x0001   SMSL:  0   RespTimeout :  33 ms  SubnetTimeout:  536 ms
   M_KEY:  0x  Lease: 0 s   Protect: Readonly
   MTU:   Active:2048  Supported:2048  VL Stall: 0
   LinkWidth: Active:  4x  Supported:1-4x  Enabled:1-4x
   LinkSpeed: Active:   2.5Gb  Supported:   2.5Gb  Enabled:   2.5Gb
   VLs:   Active: 4+1  Supported: 4+1  HOQLife: 4096 ns
   Capability 0x02010048: CR CM SL Trap
   Violations: M_Key: 0 P_Key: 0 Q_Key: 0
   ErrorLimits: Overrun: 15 LocalPhys: 15  DiagCode: 0x
   P_Key Enforcement: In: Off Out: Off  FilterRaw: In: Off Out: Off

s1471:/proc/iba/mt23108/1/port2 # cat /etc/dat.conf
#
# DAT 1.1 configuration file
#
# Each entry should have the following fields:
#
#  \ 
# 

#
# [ICS VERSION STRING: @(#) ./config/dat.conf.64 4_1_0_0_1G [10/22/07 19:25]

# Following are examples of valid entries:
#Hca  u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 " " " "
#Hca0 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost0 " " "
#Hca1 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost1 " " "
#Hca0Port1 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost0 ib1" " 
"
#Hca0Port2 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 "InfiniHost0 ib2" " 
"
#===
InfiniHost0 u1.1 nonthreadsafe default /lib64/libdapl.so ri.1.1 " " " "

Qlogic, now say they can reproduce it.

However, as we use the SilverStorm stuff a lot with many compilers and for
such as IB transport for the Lustre FileSystem, we try to stick with not
to many flavors of IB/MPI but also sometimes use OFED and Qlogics' OFED for
their PathScale cards. We also throw in Scali, MVAPICH amd mpich - so we
have a real mix to handle.

Regarding the lack of mvapi support in OpenMPI there's just udapl left for
such as SilverStorm :-(

Thanks for looking,
Mostyn


On Tue, 6 Nov 2007, Andrew Friedley wrote:




Mostyn Lewis wrote:

Andrew,

Failure looks like:


+ mpirun --prefix
+

/tools/openmpi/1.3a1r16632_svn/infinicon/gcc64/4.1.2/udapl/suse_sles_1

+ 0/x86_64/opteron -np 8
 -machinefile H ./a.out
Process 0 of 8 on s1470
Process 1 of 8 on s1470
Process 4 of 8 on s1469
Process 2 of 8 on s1470
Process 7 of 8 on s1469
Process 5 of 8 on s1469
Process 6 of 8 on s1469
Process 3 of 8 on s1470
30989:a.out *->0 (f=noaffinity,0,1,2,3)
30988:a.out *->0 (f=noaffinity,0,1,2,3)
30990:a.out *->0 (f=noaffinity,0,1,2,3)
30372:a.out *->0 (f=noaffinity,0,1,2,3)
30991:a.out *->0 (f=noaffinity,0,1,2,3)
30370:a.out *->0 (f=noaffinity,0,1,2,3)
30369:a.out *->0 (f=noaffinity,0,1,2,3)
30371:a.out *->0 (f=noaffinity,0,1,2,3)
 get ASYNC ERROR = 6


I thought this might be coming from the uDAPL BTL but I don't see where
in the could this could possibly be printed from.


[s1469:30369] *** Process received signal *** [s1469:30369] Signal:
Segmentation fault (11) [s1469:30369] Signal code: Address not mapped
(1) [s1469:30369] Failing at address: 0x110 [s1469:30369] [ 0]
/lib64/libpthread.so.0 [0x2b528ceefc10] [s1469:30369] [ 1]
/lib64/libdapl.so(dapl_llist_next_entry+0x25) [0x2b528fba5df5]
[s1469:30369] *** End of error message ***



and in a /var/log/messages I see:

Nov  5 14:46:00 s1469 sshd[30363]: Accepted publickey for mostyn from
10.173.132.37 port 36211 ssh2 Nov  5 14:46:25 s1469 kernel: TVpd:
!ERROR! Async Event:TAVOR_EQE_TYPE_CQ_ERR: (CQ Access Error) cqn:641

Nov

5 14:46:25 s1469 kernel: a.out[30374]: segfault at 0110

rip

2b528fba5df5 rsp 410010b0 error 4



This makes me wonder if you're using the right DAT libraries.  Take a
look at your dat.conf, it's usually found in /etc and make sure that it
is configured properly for the Qlogic stuff, and does NOT contain any
lines for any other stuff (like OFED-based interfaces).  Usually each
line contains a path to a specific library to use for a particular
interface, make sure it's the library you want.  You might have to
contact you uDAPL vendor for help on that.


This is repoducible.

Is this OpenMPI or your libdapl that's doing this, you think?


I can't be sure -- every uDAPL implementation seems to interpret the
spec differen

Re: [OMPI users] mpicc Segmentation Fault with Intel Compiler

2007-11-06 Thread Jeff Squyres

On Nov 6, 2007, at 4:28 AM, Michael Schulz wrote:


I've the same problem described by some other users, that I can't
compile anything if I'm using the open-mpi compiled with the Intel-
Compiler.


ompi_info --all

Segmentation fault

OpenSUSE 10.3
Kernel: 2.6.22.9-0.4-default
Intel P4

Configure-Flags: CC=icc, CXX=icpc, F77=ifort, F90=ifort

Intel-Compiler: both, C and Fortran 10.0.025


There's no real reason this should be happening.  Do you know if Intel  
10.0.025 is fully supported on OpenSUSE 10.3 / p4?  We've seen  
scenarios were compilers on unsupported platforms make Open MPI (and  
other MPI's) fail in interesting / unexplainable ways.


If it is, can you send the information described here:

http://www.open-mpi.org/community/help/

--
Jeff Squyres
Cisco Systems



Re: [OMPI users] mpicc Segmentation Fault with Intel Compiler

2007-11-06 Thread Jeff Squyres

On Nov 6, 2007, at 4:42 AM, Åke Sandgren wrote:


I had the same problem with pathscale.


There is a known outstanding problem with the pathscale problem.  I am  
still waiting for a solution from their engineers (we don't know yet  
whether it's an OMPI issue or a Pathscale issue, but my [biased] money  
is on a Pathscale issue :-) -- it doesn't happen with any other  
compiler).



Try this, i think it is the solution i found.

diff -ru site/opal/runtime/opal_init.c
amd64_ubuntu606-psc/opal/runtime/opal_init.c
--- site/opal/runtime/opal_init.c   2007-10-20 03:00:35.0
+0200
+++ amd64_ubuntu606-psc/opal/runtime/opal_init.c2007-10-23
16:12:15.0 +0200
@@ -169,7 +169,7 @@
}

/* register params for opal */
-if (OPAL_SUCCESS !=  opal_register_params()) {
+if (OPAL_SUCCESS !=  (ret = opal_register_params())) {
error = "opal_register_params";
goto return_error;
}


I don't see why this change would make any difference in terms of a  
segv...?


I see that ret is an uninitialized variable in the error case (which  
I'll fix -- thanks for pointing it out :-) ) -- but I don't see how  
that would fix a segv.  Am I missing something?


--
Jeff Squyres
Cisco Systems




[OMPI users] Job does not quit even when the simulation dies

2007-11-06 Thread Teng Lin

Hi,


Just realize I have a job run for a long time, while some of the nodes  
already die. Is there any way to ask other nodes to quit ?



[kyla-0-1.local:09741] mca_btl_tcp_frag_send: writev failed with  
errno=104
[kyla-0-1.local:09742] mca_btl_tcp_frag_send: writev failed with  
errno=104


The FAQ does mention it is related  to :
 Connection reset by peer: These types of errors usually occur after  
MPI_INIT has completed, and typically indicate that an MPI process has  
died unexpectedly (e.g., due to a seg fault). The specific error  
message indicates that a peer MPI process tried to write to the now- 
dead MPI process and failed.


Thanks,
Teng


Re: [OMPI users] machinefile and rank

2007-11-06 Thread Jeff Squyres
Unfortunately, not yet.  I believe that this kind of functionality is  
slated for the v1.3 series -- is that right Ralph/Voltaire?



On Nov 5, 2007, at 11:22 AM, Karsten Bolding wrote:


Hello

I'm using a machinefile like:
n03
n04
n03
n03
n03
n02
n01
..
..
..

the order of the entries is determined by an external program for load
balancing reasons. When the job is started the ranks do not correspond
to entries in the machinefile. Is there a way to force that entry  
one in

the machinefile gets rank=0, sencond entry gets rank=1 etc.


Karsten


--
--
Karsten BoldingBolding & Burchard Hydrodynamics
Strandgyden 25 Phone: +45 64422058
DK-5466 AsperupFax:   +45 64422068
DenmarkEmail: kars...@bolding-burchard.com

http://www.findvej.dk/Strandgyden25,5466,11,3
--
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Slightly OT: mpi job terminates

2007-11-06 Thread Jeff Squyres
You might want to run your app through a memory-checking debugger to  
see if anything obvious shows up.


Also, check to see if your corelimit size is greater than zero (i.e.,  
make it "unlimited").  Then run again and see if you can get corefiles  
to see if your app is silently dumping core, and these would give you  
a clue as to what is going on.


These are the ares where I typically start with parallel debugging;  
hopefully this is at least somewhat helpful...



On Nov 1, 2007, at 2:59 PM, Karsten Bolding wrote:

This is not OpenMPI specific - but maybe somebody on the list can  
give a

hint.

I start a parallel job with:
mpirun -np 19 -nolocal -machinefile machinefile bin/getm_prod_IFORT. 
0096x0096


everything starts OK and the simulation carries on 2+ hours of
wall clock time - then suddenly without a trace in the logfile:

   19:48:46.172 n=1800
   2003-09-01 05:06:00: reading 2D boundary data ...
   19:49:21.710 n=1900
   19:49:50.490 n=2000

or in any system logfiles the simulation stops and all related  
processes

on the nodes stops.

If I re-run the simulation does not stop at the same time.

Does anybody have a clue where I shall search.

I use a 4 machine/dual P/dual core cluster connected via GBit/s  
ethernet.


Karsten

PS: If I use MPICH I get the same problem.


--
--
Karsten BoldingBolding & Burchard Hydrodynamics
Strandgyden 25 Phone: +45 64422058
DK-5466 AsperupFax:   +45 64422068
DenmarkEmail: kars...@bolding-burchard.com

http://www.findvej.dk/Strandgyden25,5466,11,3
--
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Node assignment using openmpi for multiple simulations in the same submission script in PBS (GROMACS)

2007-11-06 Thread Jeff Squyres

On Nov 2, 2007, at 11:02 AM, himanshu khandelia wrote:


This question is about the use of a simulation package called GROMACS.

PS: On our cluster (quad-core nodes), GROMACS does not scale well
beyond 4 cpus. So, I wish to run two different simulations, while
requesting 2 nodes (1 simulation on each node) to best exploit the
policies of our MAUI scheduler.

So, I am requesting 2 4-cpu nodes on a cluster using PBS. I want to
run a separate simulation on each 4-cpu node. However, on 2 nodes, the
speed for each simulation decreases (50 to 100%) if compared to a
simulation which runs in a job which requests only one node. I am
guessing this is because openmpi fails to assign all cpus of the same
node to one simulation ? Instead, cpus from different nodes are being
used to run simulation. This is what I have in the PBS script

1.

mpirun -np 4  my-gromacs-executable-for-simulation-1 -np 4  &
mpirun -np 4  my-gromacs-executable-for-simulation-1-np 4  &


In this case, Open MPI does not realize that you have executed 2  
mpiruns and therefore assigns both the first and second job to the  
same 4 processors.  Hence, they run at half speed (or slower) because  
they're competing for the same CPUs.



# (THE GROMACS EXECUTABLE DOES REQUIRE A REPEAT REQUEST FOR THE NUMBER
OF PROCESSORS)

wait


OPENMPI does have a mechanism whereby one can assign specific
processes to specific nodes

http://www.open-mpi.org/faq/?category=running#mpirun-scheduling

So, I also tried all of the following in the PBS script where the
--bynode or the --byslot option is used

2.

mpirun -np 4  --bynode my-gromacs-executable-for-simulation-1  -np 4 &
mpirun -np 4  --bynode my-gromacs-executable-for-simulation-2  -np 4 &
wait


3.

mpirun -np 4  --byslot  my-gromacs-executable-for-simulation-1 -np 4 &
mpirun -np 4  --byslot  my-gromacs-executable-for-simulation-2 -np 4 &
wait


But these methods also result in similar performance losses.


The same thing will happen here -- OMPI is unaware of the 2 mpiruns,  
and therefore schedules on the same first 4 nodes or slots.


It sounds like you really want to run two torque jobs, not one.  Is  
there a reason you're not doing that?


Failing that, you could do an end-run around the Open MPI torque  
support and use the rsh launcher to precisely control where you launch  
jobs, but that's a bunch of trouble and not really how we intended the  
system to be used.




So how does one assign the cpus properly using mpirun if running
different simulations in the same PBS job  ??

Thank you for the help,

-Himanshu
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems