[OMPI users] OpenMPI bug?
Hi, i have installed OpenMPI 1.2.6, using gcc with bounds checking. But, when i compile an MPI program, i have many time the same error: ../opal/include/opal/sys/amd64/atomic.h:89:Address in memory:0x8 .. 0xb ../opal/include/opal/sys/amd64/atomic.h:89:Size: 4 bytes ../opal/include/opal/sys/amd64/atomic.h:89:Element size: 1 bytes ../opal/include/opal/sys/amd64/atomic.h:89:Number of elements: 4 ../opal/include/opal/sys/amd64/atomic.h:89:Created at: class/opal_object.c, line 52 ../opal/include/opal/sys/amd64/atomic.h:89:Storage class:static ../opal/include/opal/sys/amd64/atomic.h:89:Bounds error: attempt to reference memory overrunning the end of an object. ../opal/include/opal/sys/amd64/atomic.h:89: Pointer value: 0x8, Size: 8 Setting the enviroment variable to "-never-fatal", the compile phase, ends successfull. But, at runtime, i have ever the error above, very much time, and the program fails, with "undefined status". Is this an OpenMPI bug? -- Gabriele Fatigati CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.it Tel: +39 051 6171722 g.fatig...@cineca.it
Re: [OMPI users] OpenMPI bug?
I found that the error starts in this line code: static opal_atomic_lock_t class_lock = { { OPAL_ATOMIC_UNLOCKED } }; in class/opal_object.c, line 52 and generates the bound error in this code block: static inline int opal_atomic_cmpset_64( volatile int64_t *addr, int64_t oldval, int64_t newval) { unsigned char ret; __asm__ __volatile ( SMPLOCK "cmpxchgq %1,%2 \n\t" "sete %0 \n\t" : "=qm" (ret) * : "q"(newval), "m"(*((volatile long*)addr)), "a"(oldval)* //< HERE : "memory"); return (int)ret; } in /opal/include/opal/sys/amd64/atomic.h, at line 89 The previous enviroment variable is GCC_BOUNDS_OPTS Thanks in advance. 2008/6/12 Gabriele Fatigati : > Hi, > i have installed OpenMPI 1.2.6, using gcc with bounds checking. But, when i > compile an MPI program, i have many time the same error: > > ../opal/include/opal/sys/amd64/atomic.h:89:Address in memory:0x8 .. > 0xb > ../opal/include/opal/sys/amd64/atomic.h:89:Size: 4 > bytes > ../opal/include/opal/sys/amd64/atomic.h:89:Element size: 1 > bytes > ../opal/include/opal/sys/amd64/atomic.h:89:Number of elements: 4 > ../opal/include/opal/sys/amd64/atomic.h:89:Created at: > class/opal_object.c, line 52 > ../opal/include/opal/sys/amd64/atomic.h:89:Storage class:static > ../opal/include/opal/sys/amd64/atomic.h:89:Bounds error: attempt to > reference memory overrunning the end of an object. > ../opal/include/opal/sys/amd64/atomic.h:89: Pointer value: 0x8, Size: 8 > > Setting the enviroment variable to "-never-fatal", the compile phase, ends > successfull. But, at runtime, i have ever the error above, very much time, > and the program fails, with "undefined status". > > Is this an OpenMPI bug? > > > > > > -- > Gabriele Fatigati > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.it Tel: +39 051 6171722 > > g.fatig...@cineca.it -- Gabriele Fatigati CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.it Tel: +39 051 6171722 g.fatig...@cineca.it
Re: [OMPI users] Problem with getting started
Hello again! I made a few more tests to call mpirun with different parameters. I also included the output from mpirun -debug-daemons -hostfile myhosts -np 2 mpi-test.exe Daemon [0,0,1] checking in as pid 14725 on host bla.bla.bla.50 Daemon [0,0,2] checking in as pid 6375 on host bla.bla.bla.72 [fbmtpc31:14725] [0,0,1] orted: received launch callback [clockwork:06375] [0,0,2] orted: received launch callback Rank 0 is going to send Rank 1 is waiting for answer Rank 0 OK! ^C <-- program hung so killed it Killed by signal 2. mpirun: killing job... [fbmtpc31:14725] [0,0,1] orted_recv_pls: received message from [0,0,0] [fbmtpc31:14725] [0,0,1] orted_recv_pls: received kill_local_procs [fbmtpc31:14725] [0,0,1] orted_recv_pls: received message from [0,0,0] [fbmtpc31:14725] [0,0,1] orted_recv_pls: received kill_local_procs [fbmtpc31:14724] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [fbmtpc31:14724] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [fbmtpc31:14724] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [fbmtpc31:14724] ERROR: A daemon on node bla.bla.bla.72 failed to start as expected. [fbmtpc31:14724] ERROR: There may be more information available from [fbmtpc31:14724] ERROR: the remote shell (see above). [fbmtpc31:14724] ERROR: The daemon exited unexpectedly with status 255. mpirun noticed that job rank 0 with PID 14727 on node bla.bla.bla.50 exited on signal 15 (Terminated). 1 additional process aborted (not shown) [fbmtpc31:14725] [0,0,1] orted_recv_pls: received message from [0,0,0] [fbmtpc31:14725] [0,0,1] orted_recv_pls: received exit - This is the output of ompi_info on the first PC (OpenSUSE 10.2 machine): Open MPI: 1.2.6 Open MPI SVN revision: r17946 Open RTE: 1.2.6 Open RTE SVN revision: r17946 OPAL: 1.2.6 OPAL SVN revision: r17946 Prefix: /usr Configured architecture: i686-pc-linux-gnu Configured by: root Configured on: Wed Jun 11 17:44:46 CEST 2008 Configure host: fbmtpc31 Built by: manuel Built on: Wed Jun 11 17:50:25 CEST 2008 Built host: fbmtpc31 C bindings: yes C++ bindings: yes Fortran77 bindings: no Fortran90 bindings: no Fortran90 bindings size: na C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: none Fortran77 compiler abs: none Fortran90 compiler: none Fortran90 compiler abs: none C profiling: yes C++ profiling: yes Fortran77 profiling: no Fortran90 profiling: no C++ exceptions: no Thread support: posix (mpi: no, progress: no) Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: yes mpirun default --prefix: no MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.6) MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.6) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.6) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.6) MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.6) MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.6) MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.6) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.6) MCA coll: self (MCA v1.0, API v1.0, Component v1.2.6) MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.6) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.6) MCA io: romio (MCA v1.0, API v1.0, Component v1.2.6) MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.6) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.6) MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.6) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.6) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.6) MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.6) MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.6) MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.6) MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.6) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.6) MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.6) MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.6) MCA errmg
Re: [OMPI users] Problem with getting started [solved]
Hi! Ok, I found the problem. I reinstallen OMPI on both PCs but this time only locally in the users home directory. Now, the sample code works perfectly. I'm not sure where the error really was located. It could be that it was a problem with the Gentoo installation because OMPI is still marked unstable in portage (~x86 keyword). Best regards, Manuel On Wednesday 11 June 2008 18:52, Manuel Freiberger wrote: > Hello everybody! > > First of all I wanted to point out that I'm beginner regarding openMPI and > all I try to achieve is to get a simple program working on two PCs. > So far I've installed openMPI 1.2.6 on two PCs (one running OpenSUSE 10.2, > the other one Gentoo). > I set up two identical users on both systems and made sure that I can make > an SSH connection between them using private/public key authentication. > > Next I ran the command > mpirun -np 2 --hostfile myhosts uptime > which gave the result > 6:41pm up 1 day 5:16, 4 users, load average: 0.00, 0.07, 0.17 > 18:43:45 up 7:36, 6 users, load average: 0.00, 0.02, 0.05 > so I concluded that MPI should work in principle. > > Next I tried the following code which I copied from Boost.MPI: > snip > #include > #include > > int main(int argc, char* argv[]) > { > MPI_Init(&argc, &argv); > int rank; > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > if (rank == 0) > { > std::cout << "Rank 0 is going to send" << std::endl; > int value = 17; > int result = MPI_Send(&value, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); > if (result == MPI_SUCCESS) > std::cout << "Rank 0 OK!" << std::endl; > } > else if (rank == 1) > { > std::cout << "Rank 1 is waiting for answer" << std::endl; > int value; > MPI_Status status; > int result = MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, > &status); > if (result == MPI_SUCCESS && value == 17) > std::cout << "Rank 1 OK!" << std::endl; > } > MPI_Finalize(); > return 0; > } > snap > > Starting a parallel job using > mpirun -np 2 --hostfile myhosts mpi-test > I get the answer > Rank 0 is going to send > Rank 1 is waiting for answer > Rank 0 OK! > and than the program locks. So the strange thing is that obviously the > recv()-command is blocking, which is what I do not understand. > > Could anybody provide some hints, where I should start looking for the > mistake? Any help is welcome! > > Best regards, > Manuel -- Manuel Freiberger Institute of Medical Engineering Graz University of Technology manuel.freiber...@tugraz.at
[OMPI users] parallel AMBER & PBS issue
Hi, We have Xeon dual cpu cluster on redhat. I have compiled openMPI 1.2.6 with g95 and AMBER (scientific program doing parallel molecular simulations; Fortran 77&90). Both compilation seems to be fine. However, AMBER runs from command prompt "mpiexec -np x " successfully, but using PBS batch system fails to run in parallel and runs only using single CPU. I get errors like: [Morpheus06:02155] *** Process received signal *** [Morpheus06:02155] Signal: Segmentation fault (11) [Morpheus06:02155] Signal code: Address not mapped (1) [Morpheus06:02155] Failing at address: 0x3900 [Morpheus06:02155] [ 0] /lib/tls/libpthread.so.0 [0x401ad610] [Morpheus06:02155] [ 1] /lib/tls/libc.so.6 [0x420eb85e] [Morpheus06:02155] [ 2] /lib/tls/libc.so.6(__cxa_finalize+0x7e) [0x42029eae] [Morpheus06:02155] [ 3] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0 [0x40018325] [Morpheus06:02155] [ 4] /home/aziemys/bin/openmpi/lib/libmpi_f90.so.0 [0x400190f6] [Morpheus06:02155] [ 5] /lib/ld-linux.so.2 [0x4000c894] [Morpheus06:02155] [ 6] /lib/tls/libc.so.6(exit+0x70) [0x42029c20] [Morpheus06:02155] [ 7] /home/aziemys/bin/amber9/exe/sander.MPI [0x82beb63] [Morpheus06:02155] [ 8] /home/aziemys/bin/amber9/exe/sander.MPI(_g95_exit_4+0x2c) [0x82bd648] [Morpheus06:02155] [ 9] /home/aziemys/bin/amber9/exe/sander.MPI(mexit_+0x9f) [0x817cd03] [Morpheus06:02155] [10] /home/aziemys/bin/amber9/exe/sander.MPI(MAIN_+0x3639) [0x80e8e51] [Morpheus06:02155] [11] /home/aziemys/bin/amber9/exe/sander.MPI(main+0x2d) [0x82bb471] [Morpheus06:02155] [12] /lib/tls/libc.so.6(__libc_start_main+0xe4) [0x42015574] [Morpheus06:02155] [13] /home/aziemys/bin/amber9/exe/sander.MPI(sinh+0x49) [0x80697a1] [Morpheus06:02155] *** End of error message *** mpiexec noticed that job rank 0 with PID 2150 on node Morpheus06 exited on signal 11 (Segmentation fault). 5 additional processes aborted (not shown) If I decide to supply machine file ($PBS_NODEFILE), it fails with : Host key verification failed. Host key verification failed. [Morpheus06:02107] ERROR: A daemon on node Morpheus09 failed to start as expected. [Morpheus06:02107] ERROR: There may be more information available from [Morpheus06:02107] ERROR: the remote shell (see above). [Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255. [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [Morpheus06:02107] ERROR: A daemon on node Morpheus07 failed to start as expected. [Morpheus06:02107] ERROR: There may be more information available from [Morpheus06:02107] ERROR: the remote shell (see above). [Morpheus06:02107] ERROR: The daemon exited unexpectedly with status 255. [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [Morpheus06:02107] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198 -- mpiexec was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -- Help, please. -- Arturas Ziemys, PhD School of Health Information Sciences University of Texas Health Science Center at Houston 7000 Fannin, Suit 880 Houston, TX 77030 Phone: (713) 500-3975 Fax: (713) 500-3929
Re: [OMPI users] Problem with getting started [solved]
Hi, are You sure it was not a Firewall issue on the Suse 10.2? If there are any connections from the Gentoo machine trying to access the orted on the Suse, check in /var/log/firewall. For the time being, try stopping the firewall by (as root) with /etc/init.d/SuSEfirewall2_setup stop and test whether it works ,-] With best regards, Rainer On Donnerstag, 12. Juni 2008, Manuel Freiberger wrote: > Hi! > > Ok, I found the problem. I reinstallen OMPI on both PCs but this time only > locally in the users home directory. Now, the sample code works perfectly. > I'm not sure where the error really was located. It could be that it was a > problem with the Gentoo installation because OMPI is still marked unstable > in portage (~x86 keyword). > > Best regards, > Manuel > > On Wednesday 11 June 2008 18:52, Manuel Freiberger wrote: > > Hello everybody! > > > > First of all I wanted to point out that I'm beginner regarding openMPI > > and all I try to achieve is to get a simple program working on two PCs. > > So far I've installed openMPI 1.2.6 on two PCs (one running OpenSUSE > > 10.2, the other one Gentoo). > > I set up two identical users on both systems and made sure that I can > > make an SSH connection between them using private/public key > > authentication. > > > > Next I ran the command > > mpirun -np 2 --hostfile myhosts uptime > > which gave the result > > 6:41pm up 1 day 5:16, 4 users, load average: 0.00, 0.07, 0.17 > > 18:43:45 up 7:36, 6 users, load average: 0.00, 0.02, 0.05 > > so I concluded that MPI should work in principle. > > > > Next I tried the following code which I copied from Boost.MPI: > > snip > > #include > > #include > > > > int main(int argc, char* argv[]) > > { > > MPI_Init(&argc, &argv); > > int rank; > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > if (rank == 0) > > { > > std::cout << "Rank 0 is going to send" << std::endl; > > int value = 17; > > int result = MPI_Send(&value, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); > > if (result == MPI_SUCCESS) > > std::cout << "Rank 0 OK!" << std::endl; > > } > > else if (rank == 1) > > { > > std::cout << "Rank 1 is waiting for answer" << std::endl; > > int value; > > MPI_Status status; > > int result = MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, > > &status); > > if (result == MPI_SUCCESS && value == 17) > > std::cout << "Rank 1 OK!" << std::endl; > > } > > MPI_Finalize(); > > return 0; > > } > > snap > > > > Starting a parallel job using > > mpirun -np 2 --hostfile myhosts mpi-test > > I get the answer > > Rank 0 is going to send > > Rank 1 is waiting for answer > > Rank 0 OK! > > and than the program locks. So the strange thing is that obviously the > > recv()-command is blocking, which is what I do not understand. > > > > Could anybody provide some hints, where I should start looking for the > > mistake? Any help is welcome! > > > > Best regards, > > Manuel -- Dipl.-Inf. Rainer Keller http://www.hlrs.de/people/keller HLRS Tel: ++49 (0)711-685 6 5858 Nobelstrasse 19 Fax: ++49 (0)711-685 6 5832 70550 Stuttgartemail: kel...@hlrs.de Germany AIM/Skype:rusraink
[OMPI users] Brief mail services outage today
Hi Open MPI users and developers, The mailmain service hosted by the Open Systems Lab will be upgraded this afternoon at 2pm Eastern, and thus will be unavailable for about an hour. See our sysadmin's notice: The new version(2.1.10) of mailman was released on Apr 21 2008. It has a lot of good features included and many bugs are fixed. The OSL would like to upgrade the current mailman(2.1.9) to get the benefit of the new version. The upgrade would be at 2:00PM(E.T.) on June 12, 2008. The mailman, sendmail, and some mailman admin/info websites services would NOT be available during the following time period. - 11:00am-12:00pm Pacific US time - 12:00pm-1:00pm Mountain US time - 1:00pm-2:00pm Central US time - 2:00pm-3:00pm Eastern US time - 7:00pm-8:00pm GMT Please let me know if you have any concerns or questions about this upgrade. Note: I do not know if svn checkin e-mails will be queued during that time or if they will be lost. So, if you have something important you are checking in to svn, you might avoid doing so during that hour today. -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/
[OMPI users] weird problem with passing data between nodes
I have a weird problem that shows up when i use LAM or OpenMPI but not MPICH. I have a parallelized code working on a really large matrix. It partitions the matrix column-wise and ships them off to processors, so, any given processor is working on a matrix with the same number of rows as the original but reduced number of columns. Each processor needs to send a single column vector entry from its own matrix to the adjacent processor and visa versa as part of the algorithm. I have found that depending on the number of rows of the matrix -or, the size of the vector being sent using MPI_Send, MPI_Recv, the simulation will hang. It is only until i reduce this dimension to a certain max number will the sim run properly. I have also found that this magic number differs depending on the system I am using, eg my home quad-core box or remote cluster. As i mentioned i have not had this issue with mpich. I would like to understand why it is happening rather than just defect over to mpich to get by. Any help would be appreciated! zach