Re: [OMPI users] new map-by-obj has a problem

2014-02-27 Thread tmishima
Hi Ralph, I can't operate our cluster for a few days, sorry. But now, I'm narrowing down the cause by browsing the source code. My best guess is the line 529. The opal_hwloc_base_get_obj_by_type will reset the object pointer to the first one when you move on to the next node. 529

Re: [OMPI users] new map-by-obj has a problem

2014-02-27 Thread Ralph Castain
I'm having trouble seeing why it is failing, so I added some more debug output. Could you run the failure case again with -mca rmaps_base_verbose 10? Thanks Ralph On Feb 27, 2014, at 6:11 PM, tmish...@jcity.maeda.co.jp wrote: > > > Just checking the difference, not so significant meaning... >

Re: [OMPI users] new map-by-obj has a problem

2014-02-27 Thread tmishima
Just checking the difference, not so significant meaning... Anyway, I guess it's due to the behavior when slot counts is missing (regarded as slots=1) and it's oversubscribed unintentionally. I'm going out now, so I can't verify it quickly. If I provide the correct slot counts, it wll work, I g

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-27 Thread Ralph Castain
On Feb 27, 2014, at 4:39 PM, Gus Correa wrote: > Thank you, Ralph! > > I did a bit more of homework, and found out that all jobs that had > the hwloc error involved one specific node (node14). > > The "report bindings" output in those jobs' stderr show > that node14 systematically failed to bi

Re: [OMPI users] new map-by-obj has a problem

2014-02-27 Thread Ralph Castain
"restore" in what sense? On Feb 27, 2014, at 4:10 PM, tmish...@jcity.maeda.co.jp wrote: > > > Hi Ralph, this is just for your information. > > I tried to restore previous orte_rmaps_rr_byobj. Then I gets the result > below with this command line: > > mpirun -np 8 -host node05,node06 -report-b

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-27 Thread Gus Correa
Thank you, Ralph! I did a bit more of homework, and found out that all jobs that had the hwloc error involved one specific node (node14). The "report bindings" output in those jobs' stderr show that node14 systematically failed to bind the processes to the cores, while other nodes on the same jo

Re: [OMPI users] new map-by-obj has a problem

2014-02-27 Thread tmishima
Hi Ralph, this is just for your information. I tried to restore previous orte_rmaps_rr_byobj. Then I gets the result below with this command line: mpirun -np 8 -host node05,node06 -report-bindings -map-by socket:pe=2 -display-map -bind-to core:overload-allowed ~/mis/openmpi/demos/myprog Data

Re: [OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-27 Thread Ralph Castain
The hwloc in 1.6.5 is very old (v1.3.2), so it's possible it is having trouble with those data/instruction cache breakdowns. I don't know why it wouldn't have shown up before, however, as this looks to be happening when we first try to assemble the topology. To check that, what happens if you ju

Re: [OMPI users] new map-by-obj has a problem

2014-02-27 Thread tmishima
They have 4 cores/socket and 2 sockets, totally 4 X 2 = 8 cores, each. Here is the output of lstopo. mishima@manage round_robin]$ rsh node05 Last login: Tue Feb 18 15:10:15 from manage [mishima@node05 ~]$ lstopo Machine (32GB) NUMANode L#0 (P#0 16GB) + Socket L#0 + L3 L#0 (6144KB) L2 L#0

Re: [OMPI users] new map-by-obj has a problem

2014-02-27 Thread Ralph Castain
Hmmm..what does your node look like again (sockets and cores)? On Feb 27, 2014, at 3:19 PM, tmish...@jcity.maeda.co.jp wrote: > > Hi Ralph, I'm afraid to say your new "map-by obj" causes another problem. > > I have overload message with this command line as shown below: > > mpirun -np 8 -host

[OMPI users] new map-by-obj has a problem

2014-02-27 Thread tmishima
Hi Ralph, I'm afraid to say your new "map-by obj" causes another problem. I have overload message with this command line as shown below: mpirun -np 8 -host node05,node06 -report-bindings -map-by socket:pe=2 -display-map ~/mis/openmpi/d emos/myprog

[OMPI users] hwloc error in topology.c in OMPI 1.6.5

2014-02-27 Thread Gus Correa
Dear OMPI pros This seems to be a question in the nowhere land between OMPI and hwloc. However, it appeared as an OMPI error, hence it may be OK to ask the question in this list. *** A user here got this error (or warning?) message today: + mpiexec -np 64 $HOME/echam-aiv_ldeo_6.1.00p1/bin/ec

Re: [OMPI users] checkpoint/restart facility - blcr

2014-02-27 Thread George Bosilca
The C/R framework is generic, so once the CRIU support is working I expect all previous C/R modules (BLCR and user-level) to be fully functional again. George. On Feb 27, 2014, at 19:15 , Ralph Castain wrote: > It is being restored, using the new CRIU support in the latest Linux kernel > >

Re: [OMPI users] checkpoint/restart facility - blcr

2014-02-27 Thread George Bosilca
You are right, the support stopped with 1.6 so there is no support for C/R in 1.7. However, an effort to reinstate the C/R support in the trunk (and potentially in the 1.9) is ongoing. George. On Feb 27, 2014, at 19:11 , Maxime Boissonneault wrote: > I heard that c/r support in OpenMPI was

Re: [OMPI users] checkpoint/restart facility - blcr

2014-02-27 Thread Ralph Castain
It is being restored, using the new CRIU support in the latest Linux kernel On Feb 27, 2014, at 10:11 AM, Maxime Boissonneault wrote: > I heard that c/r support in OpenMPI was being dropped after version 1.6.x. Is > this not still the case ? > > Maxime Boissonneault > > Le 2014-02-27 13:09,

Re: [OMPI users] checkpoint/restart facility - blcr

2014-02-27 Thread Maxime Boissonneault
I heard that c/r support in OpenMPI was being dropped after version 1.6.x. Is this not still the case ? Maxime Boissonneault Le 2014-02-27 13:09, George Bosilca a écrit : Both were supported at some point. I'm not sure if any is still in a workable state in the trunk today. However, there is a

Re: [OMPI users] checkpoint/restart facility - blcr

2014-02-27 Thread George Bosilca
Both were supported at some point. I’m not sure if any is still in a workable state in the trunk today. However, there is an ongoing effort to reinstate the coordinated approach. George. On Feb 27, 2014, at 18:50 , basma a.azeem wrote: > i have a question about the checkpoint/restart facili

Re: [OMPI users] checkpoint/restart facility - blcr

2014-02-27 Thread Ralph Castain
coordinated - all the procs do it collectively On Feb 27, 2014, at 9:50 AM, basma a.azeem wrote: > i have a question about the checkpoint/restart facility of BLCR with OPEN MPI > , does the checkpoint/restart solution as a whole can be considered as a > coordinated or uncoordinated approach >

[OMPI users] checkpoint/restart facility - blcr

2014-02-27 Thread basma a . azeem
i have a question about the checkpoint/restart facility of BLCR with OPEN MPI , does the checkpoint/restart solution as a whole can be considered as a coordinated or uncoordinated approach

[OMPI users] OpenIB Cannot Allocate Memory error

2014-02-27 Thread Brock Palen
I have some suers that are reporting errors with OpenIB on mellonox gear, it tends to apply to larger jobs (64 - 256 cores) is not reliable, but happens with regularity. Example error below: The nodes have 64GB of memory and the IB driver is set with: options mlx4_core pfctx=0 pfcrx=0 log_num

Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-02-27 Thread Edgar Gabriel
On 2/27/2014 9:44 AM, Dave Love wrote: > Edgar Gabriel writes: > >> so we had ROMIO working with PVFS2 (not OrangeFS, which is however >> registered as PVFS2 internally). We have one cluster which uses >> OrangeFS, on that machine however we used OMPIO, not ROMIO. > > [What's OMPIO, and should w

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread Dave Love
[I don't know what thread this is without References: or citation.] Bernd Dammann writes: > Hi, > > I found this thread from before Christmas, and I wondered what the > status of this problem is. We experience the same problems since our > upgrade to Scientific Linux 6.4, kernel 2.6.32-431.1.2.

Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-02-27 Thread Dave Love
Edgar Gabriel writes: > so we had ROMIO working with PVFS2 (not OrangeFS, which is however > registered as PVFS2 internally). We have one cluster which uses > OrangeFS, on that machine however we used OMPIO, not ROMIO. [What's OMPIO, and should we want it?] This is another vote for working 1.6.

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread John Hearns
Noam, cpusets are a very good idea. Not only for CPU binding but for isolating 'badky behaved' applications. If an application stsrts using huge amounts of memory - kill it, collapse the cpuset and it is gone - nice clean way to manage jobs.

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread Ralph Castain
On Feb 27, 2014, at 5:06 AM, Noam Bernstein wrote: > On Feb 27, 2014, at 2:36 AM, Patrick Begou > wrote: > >> Bernd Dammann wrote: >>> Using the workaround '--bind-to-core' does only make sense for those jobs, >>> that allocate full nodes, but the majority of our jobs don't do that. >> Why ?

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread Noam Bernstein
On Feb 27, 2014, at 2:36 AM, Patrick Begou wrote: > Bernd Dammann wrote: >> Using the workaround '--bind-to-core' does only make sense for those jobs, >> that allocate full nodes, but the majority of our jobs don't do that. > Why ? > We still use this option in OpenMPI (1.6.x, 1.7.x) with OpenF

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2014-02-27 Thread Patrick Begou
Bernd Dammann wrote: Using the workaround '--bind-to-core' does only make sense for those jobs, that allocate full nodes, but the majority of our jobs don't do that. Why ? We still use this option in OpenMPI (1.6.x, 1.7.x) with OpenFOAM and other applications to attach each process on its core

Re: [OMPI users] Binding to Core Warning

2014-02-27 Thread Saliya Ekanayake
Thank you. Anyway, your email contains good amount of info. Saliya On Wed, Feb 26, 2014 at 7:48 PM, Ralph Castain wrote: > I did one "chapter" of it on Jeff's blog and probably should complete it. > Definitely need to update the FAQ for the new options. > > Sadly, outside of that and the mpiru