I'm unaware of any CentOS-OMPI bug, and I've been using CentOS throughout the 6.x series running OMPI 1.6.x and above.
I can't speak to the older versions of CentOS and/or the older versions of OMPI. On Jul 19, 2014, at 8:14 PM, Lane, William <william.l...@cshs.org> wrote: > Yes there is a second HPC Sun Grid Engine cluster on which I've run > this openMPI test code dozens of times on upwards of 400 slots > through SGE using qsub and qrsh, but this was using a much > older version of openMPI (1.3.3 I believe). On that particular cluster the > open files hard and soft limits were an issue. > > I have noticed that there has been a new (as of July 2014) CentOS openMPI bug > that > occurs when CentOS is upgraded from 6.2 to 6.3. I'm not sure if that > bug applies to this situation though. > > This particular problem occurs whether or not I submit jobs through SGE (via > qrsh > or qsub) or outside of SGE which leads me to believe it is an openMPI and/or > CentOS > issue. > > -Bill Lane > > From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain > [r...@open-mpi.org] > Sent: Saturday, July 19, 2014 3:21 PM > To: Open MPI Users > Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots > > Not for this test case size. You should be just fine with the default values. > > If I understand you correctly, you've run this app at scale before on another > cluster without problem? > > On Jul 19, 2014, at 1:34 PM, Lane, William <william.l...@cshs.org> wrote: > >> Ralph, >> >> It's hard to imagine it's the openMPI code because I've tested this code >> extensively on another cluster with 400 nodes and never had any problems. >> But I'll try using the hello_c example in any case. Is it still recommended >> to >> raise the open files soft and hard limits to 4096? Or should even larger >> values >> be necessary? >> >> Thank you for your help. >> >> -Bill Lane >> >> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain >> [r...@open-mpi.org] >> Sent: Saturday, July 19, 2014 8:07 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots >> >> That's a pretty old OMPI version, and we don't really support it any longer. >> However, I can provide some advice: >> >> * have you tried running the simple "hello_c" example we provide? This would >> at least tell you if the problem is in your app, which is what I'd expect >> given your description >> >> * try using gdb (or pick your debugger) to look at the corefile and see >> where it is failing >> >> I'd also suggest updating OMPI to the 1.6.5 or 1.8.1 versions, but I doubt >> that's the issue behind this problem. >> >> >> On Jul 19, 2014, at 1:05 AM, Lane, William <william.l...@cshs.org> wrote: >> >>> I'm getting consistent errors of the form: >>> >>> "mpirun noticed that process rank 3 with PID 802 on node csclprd3-0-8 >>> exited on signal 11 (Segmentation fault)." >>> >>> whenever I request more than 28 slots. These >>> errors even occur when I run mpirun locally >>> on a compute node that has 32 slots (8 cores, 16 with hyperthreading). >>> >>> When I run less than 28 slots I have no problems whatsoever. >>> >>> OS: >>> CentOS release 6.3 (Final) >>> >>> openMPI information: >>> Package: Open MPI mockbu...@c6b8.bsys.dev.centos.org >>> Distribution >>> Open MPI: 1.5.4 >>> Open MPI SVN revision: r25060 >>> Open MPI release date: Aug 18, 2011 >>> Open RTE: 1.5.4 >>> Open RTE SVN revision: r25060 >>> Open RTE release date: Aug 18, 2011 >>> OPAL: 1.5.4 >>> OPAL SVN revision: r25060 >>> OPAL release date: Aug 18, 2011 >>> Ident string: 1.5.4 >>> Prefix: /usr/lib64/openmpi >>> Configured architecture: x86_64-unknown-linux-gnu >>> Configure host: c6b8.bsys.dev.centos.org >>> Configured by: mockbuild >>> Configured on: Fri Jun 22 06:42:03 UTC 2012 >>> Configure host: c6b8.bsys.dev.centos.org >>> Built by: mockbuild >>> Built on: Fri Jun 22 06:46:48 UTC 2012 >>> Built host: c6b8.bsys.dev.centos.org >>> C bindings: yes >>> C++ bindings: yes >>> Fortran77 bindings: yes (all) >>> Fortran90 bindings: yes >>> Fortran90 bindings size: small >>> C compiler: gcc >>> C compiler absolute: /usr/bin/gcc >>> C compiler family name: GNU >>> C compiler version: 4.4.6 >>> C++ compiler: g++ >>> C++ compiler absolute: /usr/bin/g++ >>> Fortran77 compiler: gfortran >>> Fortran77 compiler abs: /usr/bin/gfortran >>> Fortran90 compiler: gfortran >>> Fortran90 compiler abs: /usr/bin/gfortran >>> C profiling: yes >>> C++ profiling: yes >>> Fortran77 profiling: yes >>> Fortran90 profiling: yes >>> C++ exceptions: no >>> Thread support: posix (MPI_THREAD_MULTIPLE: no, progress: no) >>> Sparse Groups: no >>> Internal debug support: no >>> MPI interface warnings: no >>> MPI parameter check: runtime >>> Memory profiling support: no >>> Memory debugging support: no >>> libltdl support: yes >>> Heterogeneous support: no >>> mpirun default --prefix: no >>> MPI I/O support: yes >>> MPI_WTIME support: gettimeofday >>> Symbol vis. support: yes >>> MPI extensions: affinity example >>> FT Checkpoint support: no (checkpoint thread: no) >>> MPI_MAX_PROCESSOR_NAME: 256 >>> MPI_MAX_ERROR_STRING: 256 >>> MPI_MAX_OBJECT_NAME: 64 >>> MPI_MAX_INFO_KEY: 36 >>> MPI_MAX_INFO_VAL: 256 >>> MPI_MAX_PORT_NAME: 1024 >>> MPI_MAX_DATAREP_STRING: 128 >>> MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA memory: linux (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA carto: file (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA timer: linux (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA coll: inter (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA coll: self (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA coll: tuned (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA mpool: fake (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA pml: bfo (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA pml: csum (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA pml: v (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA rcache: vma (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA btl: ofud (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA btl: openib (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA btl: self (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA osc: rdma (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA iof: hnp (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA iof: orted (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA iof: tool (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA oob: tcp (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA odls: default (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ras: cm (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ras: loadleveler (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ras: slurm (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA rmaps: load_balance (MCA v2.0, API v2.0, Component >>> v1.5.4) >>> MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA rml: oob (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA routed: binomial (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA routed: cm (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA routed: direct (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA routed: linear (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA routed: radix (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA routed: slave (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA plm: rsh (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA plm: rshd (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA plm: slurm (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA filem: rsh (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA errmgr: default (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ess: env (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ess: hnp (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ess: singleton (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ess: slave (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ess: slurm (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ess: slurmd (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA ess: tool (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.5.4) >>> MCA notifier: command (MCA v2.0, API v1.0, Component v1.5.4) >>> MCA notifier: smtp (MCA v2.0, API v1.0, Component v1.5.4) >>> MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.5.4) >>> >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is STRICTLY PROHIBITED. If you >>> have received this message in error, please notify us immediately by >>> calling (310) 423-6428 and destroy the related message. Thank You for your >>> cooperation. _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/07/24815.php >> >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is STRICTLY PROHIBITED. If you have received >> this message in error, please notify us immediately by calling (310) >> 423-6428 and destroy the related message. Thank You for your cooperation. >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/07/24817.php > > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is STRICTLY PROHIBITED. If you have received this > message in error, please notify us immediately by calling (310) 423-6428 and > destroy the related message. Thank You for your cooperation. > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24819.php