Hi Ralph, I can't operate our cluster for a few days, sorry.
But now, I'm narrowing down the cause by browsing the source code.
My best guess is the line 529. The opal_hwloc_base_get_obj_by_type will
reset the object pointer to the first one when you move on to the next
node.
529
I'm having trouble seeing why it is failing, so I added some more debug output.
Could you run the failure case again with -mca rmaps_base_verbose 10?
Thanks
Ralph
On Feb 27, 2014, at 6:11 PM, tmish...@jcity.maeda.co.jp wrote:
>
>
> Just checking the difference, not so significant meaning...
>
Just checking the difference, not so significant meaning...
Anyway, I guess it's due to the behavior when slot counts is missing
(regarded as slots=1) and it's oversubscribed unintentionally.
I'm going out now, so I can't verify it quickly. If I provide the
correct slot counts, it wll work, I g
On Feb 27, 2014, at 4:39 PM, Gus Correa wrote:
> Thank you, Ralph!
>
> I did a bit more of homework, and found out that all jobs that had
> the hwloc error involved one specific node (node14).
>
> The "report bindings" output in those jobs' stderr show
> that node14 systematically failed to bi
"restore" in what sense?
On Feb 27, 2014, at 4:10 PM, tmish...@jcity.maeda.co.jp wrote:
>
>
> Hi Ralph, this is just for your information.
>
> I tried to restore previous orte_rmaps_rr_byobj. Then I gets the result
> below with this command line:
>
> mpirun -np 8 -host node05,node06 -report-b
Thank you, Ralph!
I did a bit more of homework, and found out that all jobs that had
the hwloc error involved one specific node (node14).
The "report bindings" output in those jobs' stderr show
that node14 systematically failed to bind the processes to the cores,
while other nodes on the same jo
Hi Ralph, this is just for your information.
I tried to restore previous orte_rmaps_rr_byobj. Then I gets the result
below with this command line:
mpirun -np 8 -host node05,node06 -report-bindings -map-by socket:pe=2
-display-map -bind-to core:overload-allowed ~/mis/openmpi/demos/myprog
Data
The hwloc in 1.6.5 is very old (v1.3.2), so it's possible it is having trouble
with those data/instruction cache breakdowns. I don't know why it wouldn't have
shown up before, however, as this looks to be happening when we first try to
assemble the topology. To check that, what happens if you ju
They have 4 cores/socket and 2 sockets, totally 4 X 2 = 8 cores, each.
Here is the output of lstopo.
mishima@manage round_robin]$ rsh node05
Last login: Tue Feb 18 15:10:15 from manage
[mishima@node05 ~]$ lstopo
Machine (32GB)
NUMANode L#0 (P#0 16GB) + Socket L#0 + L3 L#0 (6144KB)
L2 L#0
Hmmm..what does your node look like again (sockets and cores)?
On Feb 27, 2014, at 3:19 PM, tmish...@jcity.maeda.co.jp wrote:
>
> Hi Ralph, I'm afraid to say your new "map-by obj" causes another problem.
>
> I have overload message with this command line as shown below:
>
> mpirun -np 8 -host
Hi Ralph, I'm afraid to say your new "map-by obj" causes another problem.
I have overload message with this command line as shown below:
mpirun -np 8 -host node05,node06 -report-bindings -map-by socket:pe=2
-display-map ~/mis/openmpi/d
emos/myprog
Dear OMPI pros
This seems to be a question in the nowhere land between OMPI and hwloc.
However, it appeared as an OMPI error, hence it may be OK to ask the
question in this list.
***
A user here got this error (or warning?) message today:
+ mpiexec -np 64 $HOME/echam-aiv_ldeo_6.1.00p1/bin/ec
The C/R framework is generic, so once the CRIU support is working I expect all
previous C/R modules (BLCR and user-level) to be fully functional again.
George.
On Feb 27, 2014, at 19:15 , Ralph Castain wrote:
> It is being restored, using the new CRIU support in the latest Linux kernel
>
>
You are right, the support stopped with 1.6 so there is no support for C/R in
1.7. However, an effort to reinstate the C/R support in the trunk (and
potentially in the 1.9) is ongoing.
George.
On Feb 27, 2014, at 19:11 , Maxime Boissonneault
wrote:
> I heard that c/r support in OpenMPI was
It is being restored, using the new CRIU support in the latest Linux kernel
On Feb 27, 2014, at 10:11 AM, Maxime Boissonneault
wrote:
> I heard that c/r support in OpenMPI was being dropped after version 1.6.x. Is
> this not still the case ?
>
> Maxime Boissonneault
>
> Le 2014-02-27 13:09,
I heard that c/r support in OpenMPI was being dropped after version
1.6.x. Is this not still the case ?
Maxime Boissonneault
Le 2014-02-27 13:09, George Bosilca a écrit :
Both were supported at some point. I'm not sure if any is still in a
workable state in the trunk today. However, there is a
Both were supported at some point. I’m not sure if any is still in a workable
state in the trunk today. However, there is an ongoing effort to reinstate the
coordinated approach.
George.
On Feb 27, 2014, at 18:50 , basma a.azeem wrote:
> i have a question about the checkpoint/restart facili
coordinated - all the procs do it collectively
On Feb 27, 2014, at 9:50 AM, basma a.azeem wrote:
> i have a question about the checkpoint/restart facility of BLCR with OPEN MPI
> , does the checkpoint/restart solution as a whole can be considered as a
> coordinated or uncoordinated approach
>
i have a question about the checkpoint/restart facility of BLCR with OPEN MPI ,
does the checkpoint/restart solution as a whole can be considered as a
coordinated or uncoordinated approach
I have some suers that are reporting errors with OpenIB on mellonox gear, it
tends to apply to larger jobs (64 - 256 cores) is not reliable, but happens
with regularity. Example error below:
The nodes have 64GB of memory and the IB driver is set with:
options mlx4_core pfctx=0 pfcrx=0 log_num
On 2/27/2014 9:44 AM, Dave Love wrote:
> Edgar Gabriel writes:
>
>> so we had ROMIO working with PVFS2 (not OrangeFS, which is however
>> registered as PVFS2 internally). We have one cluster which uses
>> OrangeFS, on that machine however we used OMPIO, not ROMIO.
>
> [What's OMPIO, and should w
[I don't know what thread this is without References: or citation.]
Bernd Dammann writes:
> Hi,
>
> I found this thread from before Christmas, and I wondered what the
> status of this problem is. We experience the same problems since our
> upgrade to Scientific Linux 6.4, kernel 2.6.32-431.1.2.
Edgar Gabriel writes:
> so we had ROMIO working with PVFS2 (not OrangeFS, which is however
> registered as PVFS2 internally). We have one cluster which uses
> OrangeFS, on that machine however we used OMPIO, not ROMIO.
[What's OMPIO, and should we want it?]
This is another vote for working 1.6.
Noam, cpusets are a very good idea.
Not only for CPU binding but for isolating 'badky behaved' applications.
If an application stsrts using huge amounts of memory - kill it, collapse
the cpuset and it is gone - nice clean way to manage jobs.
On Feb 27, 2014, at 5:06 AM, Noam Bernstein wrote:
> On Feb 27, 2014, at 2:36 AM, Patrick Begou
> wrote:
>
>> Bernd Dammann wrote:
>>> Using the workaround '--bind-to-core' does only make sense for those jobs,
>>> that allocate full nodes, but the majority of our jobs don't do that.
>> Why ?
On Feb 27, 2014, at 2:36 AM, Patrick Begou
wrote:
> Bernd Dammann wrote:
>> Using the workaround '--bind-to-core' does only make sense for those jobs,
>> that allocate full nodes, but the majority of our jobs don't do that.
> Why ?
> We still use this option in OpenMPI (1.6.x, 1.7.x) with OpenF
Bernd Dammann wrote:
Using the workaround '--bind-to-core' does only make sense for those jobs,
that allocate full nodes, but the majority of our jobs don't do that.
Why ?
We still use this option in OpenMPI (1.6.x, 1.7.x) with OpenFOAM and other
applications to attach each process on its core
Thank you. Anyway, your email contains good amount of info.
Saliya
On Wed, Feb 26, 2014 at 7:48 PM, Ralph Castain wrote:
> I did one "chapter" of it on Jeff's blog and probably should complete it.
> Definitely need to update the FAQ for the new options.
>
> Sadly, outside of that and the mpiru
28 matches
Mail list logo