[OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

2015-04-01 Thread Lane, William
I'm having problems running OpenMPI jobs (using a hostfile) on an HPC cluster running ROCKS on CentOS 6.3. I'm running OpenMPI outside of Sun Grid Engine (i.e. it is not submitted as a job to SGE). The program being run is a LAPACK benchmark. The commandline parameter I'm using to run the jobs is:

Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

2015-04-01 Thread Ralph Castain
The warning about binding to memory is due to not having numactl-devel installed on the system. The job would still run, but we are warning you that we cannot bind memory to the same domain as the core where we bind the process. Can cause poor performance, but not fatal. I forget the name of the pa

Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

2015-04-01 Thread Lane, William
Ralph, Here's the associated hostfile: #openMPI hostfile for csclprd3 #max slots prevents oversubscribing csclprd3-0-9 csclprd3-0-0 slots=12 max-slots=12 csclprd3-0-1 slots=6 max-slots=6 csclprd3-0-2 slots=6 max-slots=6 csclprd3-0-3 slots=6 max-slots=6 csclprd3-0-4 slots=6 max-slots=6 csclprd3-0-

Re: [OMPI users] OpenMPI 1.8.2 problems on CentOS 6.3

2015-04-01 Thread Ralph Castain
Bingo - you said the magic word. This is a terminology issue. When we say "core", we mean the old definition of "core", not "hyperthreads". If you want to use HTs as your base processing unit and bind to them, then you need to specify --bind-to hwthread. That warning should then go away. We don't

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-01 Thread Thomas Klimpel
> 2. Unable to resolve: can you be more specific on this? This was my mistake. I used "xxx.yyy.zzz" instead of "localhost" in the startup options for orterun. (More precisely the GUI did it, but I knew that code.) No idea how 1.6.5 managed to get around the fact that not even "dig xxx.yyy.zzz" can

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-01 Thread Ralph Castain
I know 1.8.4 is better than 1.6.5 in some regards, but I obviously can't say if we fixed the specific bug you're referring to in your software. As you know, thread bugs are really hard to nail down. That event_base_loop warning could be flagging a known problem in the openib module during inter-pr

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-01 Thread Thomas Klimpel
> You might double-check by running with "--mca btl ^openib" to see if that is the source of the warning The warning appears always, independent of the interconnect, and even when running with "--mca btl ^openib". > Does it only crash when you pause it? Or does it crash while normally running?

Re: [OMPI users] 1.8.4 behaves completely different from 1.6.5

2015-04-01 Thread Ralph Castain
Would it be possible to get a backtrace from one of the crashes? It would be especially helpful if you can add --enable-debug to the OMPI config. On Wed, Apr 1, 2015 at 1:09 PM, Thomas Klimpel wrote: > > You might double-check by running with "--mca btl ^openib" to see if > that is the source o