[OMPI users] OpenMPI bug?

2008-06-12 Thread Gabriele Fatigati
Hi,
i have installed OpenMPI 1.2.6, using gcc with bounds checking. But, when i
compile an MPI program, i have many time the same error:

../opal/include/opal/sys/amd64/atomic.h:89:Address in memory:0x8 ..
0xb
../opal/include/opal/sys/amd64/atomic.h:89:Size: 4 bytes
../opal/include/opal/sys/amd64/atomic.h:89:Element size: 1 bytes
../opal/include/opal/sys/amd64/atomic.h:89:Number of elements:   4
../opal/include/opal/sys/amd64/atomic.h:89:Created at:
class/opal_object.c, line 52
../opal/include/opal/sys/amd64/atomic.h:89:Storage class:static
../opal/include/opal/sys/amd64/atomic.h:89:Bounds error: attempt to
reference memory overrunning the end of an object.
../opal/include/opal/sys/amd64/atomic.h:89:  Pointer value: 0x8, Size: 8

Setting the enviroment variable to "-never-fatal", the compile phase, ends
successfull. But, at runtime, i have ever the error above, very much time,
and the program fails, with "undefined status".

Is this an OpenMPI bug?





-- 
Gabriele Fatigati

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it Tel: +39 051 6171722

g.fatig...@cineca.it


Re: [OMPI users] OpenMPI bug?

2008-06-12 Thread Gabriele Fatigati
I found that the error starts in this line code:

static opal_atomic_lock_t class_lock = { { OPAL_ATOMIC_UNLOCKED } };

in class/opal_object.c, line 52

and generates the bound error in this code block:

static inline int opal_atomic_cmpset_64( volatile int64_t *addr,
 int64_t oldval, int64_t newval)
{
   unsigned char ret;
   __asm__ __volatile (
   SMPLOCK "cmpxchgq %1,%2   \n\t"
   "sete %0  \n\t"
   : "=qm" (ret)
  * : "q"(newval), "m"(*((volatile long*)addr)),
"a"(oldval)*   //< HERE
   : "memory");

   return (int)ret;
}

in /opal/include/opal/sys/amd64/atomic.h, at line 89

The previous enviroment variable is GCC_BOUNDS_OPTS

Thanks in advance.


2008/6/12 Gabriele Fatigati :

> Hi,
> i have installed OpenMPI 1.2.6, using gcc with bounds checking. But, when i
> compile an MPI program, i have many time the same error:
>
> ../opal/include/opal/sys/amd64/atomic.h:89:Address in memory:0x8 ..
> 0xb
> ../opal/include/opal/sys/amd64/atomic.h:89:Size: 4
> bytes
> ../opal/include/opal/sys/amd64/atomic.h:89:Element size: 1
> bytes
> ../opal/include/opal/sys/amd64/atomic.h:89:Number of elements:   4
> ../opal/include/opal/sys/amd64/atomic.h:89:Created at:
> class/opal_object.c, line 52
> ../opal/include/opal/sys/amd64/atomic.h:89:Storage class:static
> ../opal/include/opal/sys/amd64/atomic.h:89:Bounds error: attempt to
> reference memory overrunning the end of an object.
> ../opal/include/opal/sys/amd64/atomic.h:89:  Pointer value: 0x8, Size: 8
>
> Setting the enviroment variable to "-never-fatal", the compile phase, ends
> successfull. But, at runtime, i have ever the error above, very much time,
> and the program fails, with "undefined status".
>
> Is this an OpenMPI bug?
>
>
>
>
>
> --
> Gabriele Fatigati
>
> CINECA Systems & Tecnologies Department
>
> Supercomputing Group
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.it Tel: +39 051 6171722
>
> g.fatig...@cineca.it




-- 
Gabriele Fatigati

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it Tel: +39 051 6171722

g.fatig...@cineca.it


Re: [OMPI users] Problem with getting started

2008-06-12 Thread Manuel Freiberger
Hello again!

I made a few more tests to call mpirun with different parameters. I also 
included the output from
  mpirun -debug-daemons -hostfile myhosts -np 2 mpi-test.exe


Daemon [0,0,1] checking in as pid 14725 on host bla.bla.bla.50
Daemon [0,0,2] checking in as pid 6375 on host bla.bla.bla.72
[fbmtpc31:14725] [0,0,1] orted: received launch callback
[clockwork:06375] [0,0,2] orted: received launch callback
Rank 0 is going to send
Rank 1 is waiting for answer
Rank 0 OK!
^C  <-- program hung so killed it
Killed by signal 2.
mpirun: killing job...

[fbmtpc31:14725] [0,0,1] orted_recv_pls: received message from [0,0,0]
[fbmtpc31:14725] [0,0,1] orted_recv_pls: received kill_local_procs
[fbmtpc31:14725] [0,0,1] orted_recv_pls: received message from [0,0,0]
[fbmtpc31:14725] [0,0,1] orted_recv_pls: received kill_local_procs
[fbmtpc31:14724] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 275
[fbmtpc31:14724] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at 
line 1166
[fbmtpc31:14724] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 
90
[fbmtpc31:14724] ERROR: A daemon on node bla.bla.bla.72 failed to start as 
expected.
[fbmtpc31:14724] ERROR: There may be more information available from
[fbmtpc31:14724] ERROR: the remote shell (see above).
[fbmtpc31:14724] ERROR: The daemon exited unexpectedly with status 255.
mpirun noticed that job rank 0 with PID 14727 on node bla.bla.bla.50 exited on 
signal 15 (Terminated).
1 additional process aborted (not shown)
[fbmtpc31:14725] [0,0,1] orted_recv_pls: received message from [0,0,0]
[fbmtpc31:14725] [0,0,1] orted_recv_pls: received exit
-

This is the output of ompi_info on the first PC (OpenSUSE 10.2 machine):
   Open MPI: 1.2.6
   Open MPI SVN revision: r17946
Open RTE: 1.2.6
   Open RTE SVN revision: r17946
OPAL: 1.2.6
   OPAL SVN revision: r17946
  Prefix: /usr
 Configured architecture: i686-pc-linux-gnu
   Configured by: root
   Configured on: Wed Jun 11 17:44:46 CEST 2008
  Configure host: fbmtpc31
Built by: manuel
Built on: Wed Jun 11 17:50:25 CEST 2008
  Built host: fbmtpc31
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: no
  Fortran90 bindings: no
 Fortran90 bindings size: na
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: none
  Fortran77 compiler abs: none
  Fortran90 compiler: none
  Fortran90 compiler abs: none
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: no
 Fortran90 profiling: no
  C++ exceptions: no
  Thread support: posix (mpi: no, progress: no)
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
   MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.6)
  MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.6)
   MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.6)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.6)
   MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.6)
 MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.6)
 MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.6)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.6)
MCA coll: self (MCA v1.0, API v1.0, Component v1.2.6)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.6)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.6)
  MCA io: romio (MCA v1.0, API v1.0, Component v1.2.6)
   MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.6)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.6)
 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.6)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.6)
 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.6)
  MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.6)
 MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.6)
 MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.6)
 MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.6)
 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.6)
  MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.6)
  MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.6)
  MCA errmg

Re: [OMPI users] Problem with getting started [solved]

2008-06-12 Thread Manuel Freiberger
Hi!

Ok, I found the problem. I reinstallen OMPI on both PCs but this time only 
locally in the users home directory. Now, the sample code works perfectly. 
I'm not sure where the error really was located. It could be that it was a 
problem with the Gentoo installation because OMPI is still marked unstable in 
portage (~x86 keyword).

Best regards,
Manuel

On Wednesday 11 June 2008 18:52, Manuel Freiberger wrote:
> Hello everybody!
>
> First of all I wanted to point out that I'm beginner regarding openMPI and
> all I try to achieve is to get a simple program working on two PCs.
> So far I've installed openMPI 1.2.6 on two PCs (one running OpenSUSE 10.2,
> the other one Gentoo).
> I set up two identical users on both systems and made sure that I can make
> an SSH connection between them using private/public key authentication.
>
> Next I ran the command
>   mpirun -np 2 --hostfile myhosts uptime
> which gave the result
>   6:41pm  up 1 day  5:16,  4 users,  load average: 0.00, 0.07, 0.17
>  18:43:45 up  7:36,  6 users,  load average: 0.00, 0.02, 0.05
> so I concluded that MPI should work in principle.
>
> Next I tried the following code which I copied from Boost.MPI:
>  snip
> #include 
> #include 
>
> int main(int argc, char* argv[])
> {
>   MPI_Init(&argc, &argv);
>   int rank;
>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>   if (rank == 0)
>   {
> std::cout << "Rank 0 is going to send" << std::endl;
> int value = 17;
> int result = MPI_Send(&value, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
> if (result == MPI_SUCCESS)
>   std::cout << "Rank 0 OK!" << std::endl;
>   }
>   else if (rank == 1)
>   {
> std::cout << "Rank 1 is waiting for answer" << std::endl;
> int value;
> MPI_Status status;
> int result = MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
> &status);
> if (result == MPI_SUCCESS && value == 17)
>   std::cout << "Rank 1 OK!" << std::endl;
>   }
>   MPI_Finalize();
>   return 0;
> }
>  snap
>
> Starting a parallel job using
>   mpirun -np 2 --hostfile myhosts mpi-test
> I get the answer
>   Rank 0 is going to send
>   Rank 1 is waiting for answer
>   Rank 0 OK!
> and than the program locks. So the strange thing is that obviously the
> recv()-command is blocking, which is what I do not understand.
>
> Could anybody provide some hints, where I should start looking for the
> mistake? Any help is welcome!
>
> Best regards,
> Manuel

-- 
Manuel Freiberger
Institute of Medical Engineering
Graz University of Technology
manuel.freiber...@tugraz.at


[OMPI users] parallel AMBER & PBS issue

2008-06-12 Thread Arturas Ziemys

Hi,

We have Xeon dual cpu cluster on redhat. I have compiled openMPI 1.2.6 
with g95 and AMBER (scientific program doing parallel molecular 
simulations; Fortran 77&90). Both compilation seems to be fine. However, 
AMBER runs from command prompt "mpiexec -np x " successfully, 
but using PBS batch system fails to run in parallel and runs only using 
single CPU. I get errors like:


[Morpheus06:02155] *** Process received signal ***
[Morpheus06:02155] Signal: Segmentation fault (11)
[Morpheus06:02155] Signal code: Address not mapped (1)
[Morpheus06:02155] Failing at address: 0x3900
[Morpheus06:02155] [ 0] /lib/tls/libpthread.so.0 [0x401ad610]
[Morpheus06:02155] [ 1] /lib/tls/libc.so.6 [0x420eb85e]
[Morpheus06:02155] [ 2] /lib/tls/libc.so.6(__cxa_finalize+0x7e) 
[0x42029eae]
[Morpheus06:02155] [ 3] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0 
[0x40018325]
[Morpheus06:02155] [ 4] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0 
[0x400190f6]

[Morpheus06:02155] [ 5] /lib/ld-linux.so.2 [0x4000c894]
[Morpheus06:02155] [ 6] /lib/tls/libc.so.6(exit+0x70) [0x42029c20]
[Morpheus06:02155] [ 7] /home/aziemys/bin/amber9/exe/sander.MPI [0x82beb63]
[Morpheus06:02155] [ 8] 
/home/aziemys/bin/amber9/exe/sander.MPI(_g95_exit_4+0x2c) [0x82bd648]
[Morpheus06:02155] [ 9] 
/home/aziemys/bin/amber9/exe/sander.MPI(mexit_+0x9f) [0x817cd03]
[Morpheus06:02155] [10] 
/home/aziemys/bin/amber9/exe/sander.MPI(MAIN_+0x3639) [0x80e8e51]
[Morpheus06:02155] [11] 
/home/aziemys/bin/amber9/exe/sander.MPI(main+0x2d) [0x82bb471]
[Morpheus06:02155] [12] /lib/tls/libc.so.6(__libc_start_main+0xe4) 
[0x42015574]
[Morpheus06:02155] [13] 
/home/aziemys/bin/amber9/exe/sander.MPI(sinh+0x49) [0x80697a1]

[Morpheus06:02155] *** End of error message ***
mpiexec noticed that job rank 0 with PID 2150 on node Morpheus06 exited 
on signal 11 (Segmentation fault).

5 additional processes aborted (not shown)

If I decide to supply machine file ($PBS_NODEFILE), it fails with :

Host key verification failed.
Host key verification failed.
[Morpheus06:02107] ERROR: A daemon on node Morpheus09 failed to start as 
expected.

[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 275
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_rsh_module.c at line 1166
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c 
at line 90
[Morpheus06:02107] ERROR: A daemon on node Morpheus07 failed to start as 
expected.

[Morpheus06:02107] ERROR: There may be more information available from
[Morpheus06:02107] ERROR: the remote shell (see above).
[Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255.
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 188
[Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_rsh_module.c at line 1198

--
mpiexec was unable to cleanly terminate the daemons for this job. 
Returned value Timeout instead of ORTE_SUCCESS.

--

Help, please.

--

Arturas Ziemys, PhD
 School of Health Information Sciences
 University of Texas Health Science Center at Houston
 7000 Fannin, Suit 880
 Houston, TX 77030
 Phone: (713) 500-3975
 Fax:   (713) 500-3929  



Re: [OMPI users] Problem with getting started [solved]

2008-06-12 Thread Rainer Keller
Hi,
are You sure it was not a Firewall issue on the Suse 10.2?
If there are any connections from the Gentoo machine trying to access the 
orted on the Suse, check in /var/log/firewall.

For the time being, try stopping the firewall by (as root) with
/etc/init.d/SuSEfirewall2_setup stop
and test whether it works ,-]

With best regards,
Rainer


On Donnerstag, 12. Juni 2008, Manuel Freiberger wrote:
> Hi!
>
> Ok, I found the problem. I reinstallen OMPI on both PCs but this time only
> locally in the users home directory. Now, the sample code works perfectly.
> I'm not sure where the error really was located. It could be that it was a
> problem with the Gentoo installation because OMPI is still marked unstable
> in portage (~x86 keyword).
>
> Best regards,
> Manuel
>
> On Wednesday 11 June 2008 18:52, Manuel Freiberger wrote:
> > Hello everybody!
> >
> > First of all I wanted to point out that I'm beginner regarding openMPI
> > and all I try to achieve is to get a simple program working on two PCs.
> > So far I've installed openMPI 1.2.6 on two PCs (one running OpenSUSE
> > 10.2, the other one Gentoo).
> > I set up two identical users on both systems and made sure that I can
> > make an SSH connection between them using private/public key
> > authentication.
> >
> > Next I ran the command
> >   mpirun -np 2 --hostfile myhosts uptime
> > which gave the result
> >   6:41pm  up 1 day  5:16,  4 users,  load average: 0.00, 0.07, 0.17
> >  18:43:45 up  7:36,  6 users,  load average: 0.00, 0.02, 0.05
> > so I concluded that MPI should work in principle.
> >
> > Next I tried the following code which I copied from Boost.MPI:
> >  snip
> > #include 
> > #include 
> >
> > int main(int argc, char* argv[])
> > {
> >   MPI_Init(&argc, &argv);
> >   int rank;
> >   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >   if (rank == 0)
> >   {
> > std::cout << "Rank 0 is going to send" << std::endl;
> > int value = 17;
> > int result = MPI_Send(&value, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
> > if (result == MPI_SUCCESS)
> >   std::cout << "Rank 0 OK!" << std::endl;
> >   }
> >   else if (rank == 1)
> >   {
> > std::cout << "Rank 1 is waiting for answer" << std::endl;
> > int value;
> > MPI_Status status;
> > int result = MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
> >   &status);
> > if (result == MPI_SUCCESS && value == 17)
> >   std::cout << "Rank 1 OK!" << std::endl;
> >   }
> >   MPI_Finalize();
> >   return 0;
> > }
> >  snap
> >
> > Starting a parallel job using
> >   mpirun -np 2 --hostfile myhosts mpi-test
> > I get the answer
> >   Rank 0 is going to send
> >   Rank 1 is waiting for answer
> >   Rank 0 OK!
> > and than the program locks. So the strange thing is that obviously the
> > recv()-command is blocking, which is what I do not understand.
> >
> > Could anybody provide some hints, where I should start looking for the
> > mistake? Any help is welcome!
> >
> > Best regards,
> > Manuel



-- 

Dipl.-Inf. Rainer Keller   http://www.hlrs.de/people/keller
 HLRS  Tel: ++49 (0)711-685 6 5858
 Nobelstrasse 19  Fax: ++49 (0)711-685 6 5832
 70550 Stuttgartemail: kel...@hlrs.de 
 Germany AIM/Skype:rusraink


[OMPI users] Brief mail services outage today

2008-06-12 Thread Tim Mattox
Hi Open MPI users and developers,
The mailmain service hosted by the Open Systems Lab will be
upgraded this afternoon at 2pm Eastern, and thus will be
unavailable for about an hour.  See our sysadmin's notice:


The new version(2.1.10) of mailman was released on Apr 21 2008.
It has a lot of good features included and many bugs are fixed.

The OSL would like to upgrade the current mailman(2.1.9) to get
the benefit of the new version.
The upgrade would be at 2:00PM(E.T.) on June 12, 2008.

The mailman, sendmail, and some mailman admin/info websites
services would NOT be available during the following time period.
- 11:00am-12:00pm Pacific US time
- 12:00pm-1:00pm Mountain US time
- 1:00pm-2:00pm Central US time
- 2:00pm-3:00pm Eastern US time
- 7:00pm-8:00pm GMT

Please let me know if you have any concerns or questions about this upgrade.



Note: I do not know if svn checkin e-mails will be queued during that time or
if they will be lost.  So, if you have something important you are checking in
to svn, you might avoid doing so during that hour today.
-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
 I'm a bright... http://www.the-brights.net/


[OMPI users] weird problem with passing data between nodes

2008-06-12 Thread zach
I have a weird problem that shows up when i use LAM or OpenMPI but not MPICH.

I have a parallelized code working on a really large matrix. It
partitions the matrix column-wise and ships them off to processors,
so, any given processor is working on a matrix with the same number of
rows as the original but reduced number of columns. Each processor
needs to send a single column vector entry
from its own matrix to the adjacent processor and visa versa as part
of the algorithm.

I have found that depending on the number of rows of the matrix -or,
the size of the vector being sent using MPI_Send, MPI_Recv, the
simulation will hang.
It is only until i reduce this dimension to a certain max number will
the sim run properly. I have also found that this magic number differs
depending on the system I am using, eg my home quad-core box or remote
cluster.

As i mentioned i have not had this issue with mpich. I would like to
understand why it is happening rather than just defect over to mpich
to get by.

Any help would be appreciated!
zach