Re: [OMPI users] new map-by-obj has a problem

2014-02-28 Thread tmishima
Hi Ralph, I understood what you meant. I often use float for our applicatoin. float c = (float)(unsinged int a - unsinged int b) could be very huge number, if a < b. So I always carefully cast to int from unsigned int when I subtract them. I didn't know/mind inc d = (unsinged int a - unsinged in

Re: [OMPI users] new map-by-obj has a problem

2014-02-28 Thread tmishima
Yes, indeed. In future, when we will have many many cores in the machine, we will have to take care of overrun of num_procs. Tetsuya > Cool - easily modified. Thanks! > > Of course, you understand (I'm sure) that the cast does nothing to protect the code from blowing up if we overrun the var. I

Re: [OMPI users] new map-by-obj has a problem

2014-02-28 Thread Ralph Castain
Cool - easily modified. Thanks! Of course, you understand (I'm sure) that the cast does nothing to protect the code from blowing up if we overrun the var. In other words, if the unsigned var has wrapped, then casting it to int won't help - you'll still get a negative integer, and the code will

Re: [OMPI users] new map-by-obj has a problem

2014-02-28 Thread tmishima
Hi Ralph, I'm a litte bit late to your release. I found a minor mistake in byobj_span -integer casting problem. --- rmaps_rr_mappers.30892.c2014-03-01 08:31:50 +0900 +++ rmaps_rr_mappers.c 2014-03-01 08:33:22 +0900 @@ -689,7 +689,7 @@ } /* compute how many objs need an extra pro

Re: [OMPI users] OpenMPI job initializing problem

2014-02-28 Thread Gus Correa
HI Beichuan To add to what Ralph said, the RHEL OpenMPI package probably wasn't built with with PBS Pro support either. Besides, OMPI 1.5.4 (RHEL version) is old. ** You will save yourself time and grief if you read the installation FAQs, before you install from the source tarball: http://www.

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
Le 28/02/2014 21:30, Gus Correa a écrit : > Hi Brice > > The (pdf) output of lstopo shows one L1d (16k) for each core, > and one L1i (64k) for each *pair* of cores. > Is this wrong? It's correct. AMD uses this "dual-core compute unit" where L2 and L1i are shared but L1d isn't. > BTW, if there are

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Reuti
Am 28.02.2014 um 21:23 schrieb Brice Goglin: > OK, the problem is that node14's BIOS reports invalid NUMA info. It properly > detects 2 sockets with 16-cores each. But it reports 2 NUMA nodes total, > instead of 2 per socket (4 total). And hwloc warns because the cores > contained in these NUMA

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Gus Correa
On 02/28/2014 03:32 AM, Brice Goglin wrote: Le 28/02/2014 02:48, Ralph Castain a écrit : Remember, hwloc doesn't actually "sense" hardware - it just parses files in the /proc area. So if something is garbled in those files, hwloc will report errors. Doesn't mean anything is wrong with the hard

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
OK, the problem is that node14's BIOS reports invalid NUMA info. It properly detects 2 sockets with 16-cores each. But it reports 2 NUMA nodes total, instead of 2 per socket (4 total). And hwloc warns because the cores contained in these NUMA nodes are incompatible with sockets: socket0 contains 0-

Re: [OMPI users] OrangeFS ROMIO support

2014-02-28 Thread Ralph Castain
Did you see the note I forwarded to you about SLES issues? Not sure if that is on your side or ours On Feb 28, 2014, at 12:05 PM, Latham, Robert J. wrote: > On Wed, 2014-02-26 at 17:14 -0600, Edgar Gabriel wrote: >> that was my fault, I did not follow up the time, got probably side >> tracked b

Re: [OMPI users] OrangeFS ROMIO support

2014-02-28 Thread Latham, Robert J.
On Wed, 2014-02-26 at 17:14 -0600, Edgar Gabriel wrote: > that was my fault, I did not follow up the time, got probably side > tracked by something. Anyway, I suspect that you actually have the > patch, otherwise the current Open MPI trunk and the 1.7 release series > would not have the patch after

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Ralph Castain
You might also want to check the BIOS rev level on node14, Gus - as Brice suggested, it could be that the board came with the wrong firmware. On Feb 28, 2014, at 11:55 AM, Gus Correa wrote: > Hi Brice and Ralph > > Many thanks for helping out with this! > > Yes, you are right about node15 bei

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Gus Correa
Hi Brice and Ralph Many thanks for helping out with this! Yes, you are right about node15 being OK. Node15 was a red herring, as along with node14 it was part of the same job that failed. However, after a closer look, I noticed that failure reported by hwloc was indeed in node14. I attach both

Re: [OMPI users] OpenMPI job initializing problem

2014-02-28 Thread Ralph Castain
Almost certainly, the redhat package wasn't built with matching infiniband support and so we aren't picking it up. I'd suggest downloading the latest 1.7.4 or 1.7.5 nightly tarball, or even the latest 1.6 tarball if you want the stable release, and build it yourself so you *know* it was built fo

[OMPI users] OpenMPI job initializing problem

2014-02-28 Thread Beichuan Yan
Hi there, I am running jobs on clusters with Infiniband connection. They installed OpenMPI v1.5.4 via REDHAT 6 yum package). My problem is that although my jobs gets queued and started by PBS PRO quickly, most of the time they don't really run (occasionally they really run) and give error info

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Ralph Castain
On Feb 28, 2014, at 12:32 AM, Brice Goglin wrote: > Le 28/02/2014 02:48, Ralph Castain a écrit : >> Remember, hwloc doesn't actually "sense" hardware - it just parses files in >> the /proc area. So if something is garbled in those files, hwloc will report >> errors. Doesn't mean anything is wr

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
Le 28/02/2014 02:48, Ralph Castain a écrit : > Remember, hwloc doesn't actually "sense" hardware - it just parses files in > the /proc area. So if something is garbled in those files, hwloc will report > errors. Doesn't mean anything is wrong with the hardware at all. For the record, that's not

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-28 Thread Brice Goglin
Hello Gus, I'll need the tarball generated by gather-topology on node14 to debug this. node15 doesn't have any issue. We've seen issues on AMD machines because of buggy BIOS reporting incompatible Socket and NUMA info. If node14 doesn't have the same BIOS version as other nodes, that could explain

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-28 Thread Bernd Dammann
On 2/27/14 14:06 PM, Noam Bernstein wrote: On Feb 27, 2014, at 2:36 AM, Patrick Begou wrote: Bernd Dammann wrote: Using the workaround '--bind-to-core' does only make sense for those jobs, that allocate full nodes, but the majority of our jobs don't do that. Why ? We still use this option

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-28 Thread Bernd Dammann
On 2/27/14 16:47 PM, Dave Love wrote: Bernd Dammann writes: Hi, I found this thread from before Christmas, and I wondered what the status of this problem is. We experience the same problems since our upgrade to Scientific Linux 6.4, kernel 2.6.32-431.1.2.el6.x86_64, and OpenMPI 1.6.5. Users

Re: [OMPI users] new map-by-obj has a problem

2014-02-28 Thread Ralph Castain
Please take a look at https://svn.open-mpi.org/trac/ompi/ticket/4317 On Feb 27, 2014, at 8:13 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, I can't operate our cluster for a few days, sorry. > > But now, I'm narrowing down the cause by browsing the source code. > > My best guess is