Daniel,
thanks for the logs.
an other workaround is to
mpirun --mca coll ^hcoll ...
i was able to reproduce the issue, and it surprisingly occurs only if
the coll_ml module is loaded *before* the hcoll module.
/* this is not the case on my system, so i had to hack my
mca_base_component_path i
You probably should update to OMPI 1.8.6 as we spent some time in the 1.8
series refreshing the LSF support.
On Wed, Jun 24, 2015 at 3:04 PM, Rahul Pisharody
wrote:
> Hello all,
>
> I'm trying to launch a job with OpenMPI using the LSF Scheduler.
> However, when I execute the job, I get the fol
Hello all,
I'm trying to launch a job with OpenMPI using the LSF Scheduler.
However, when I execute the job, I get the following error :
ORTE_ERROR_LOG: The specified application failed to start in file
plm_lsf_module.c at line 305
lsb_launch failed: 0
I'm using OpenMPI 1.6.4
The LSF version
Running OpenMPI 1.8.4 one application running on 16 cores of a single
node
takes over an hour compared to just 7 minutes for MPICH. If I use
--mca btl vader,sm,self it runs in the same 7 minutes as MPICH. If I throw
in
the tcp and openib btl's it also runs quickly, so it seems to just not be
Greetings Open MPI users and system administrators.
In response to user feedback, Open MPI is changing how its releases will be
numbered.
In short, Open MPI will no longer be released using an "odd/even" cadence
corresponding to "feature development" and "super stable" releases. Instead,
each
I think trying with --mca btl ^sm makes a lot of sense and may solve the
problem. I also noted that we are having trouble with the topology of
several of the nodes - seeing only one socket, non-HT where you say we
should see two sockets and HT-enabled. In those cases, the locality is
"unknown" - gi
Bill,
were you able to get a core file and analyze the stack with gdb ?
I suspect the error occurs in mca_btl_sm_add_procs but this is just my best
guess.
if this is correct, can you check the value of
mca_btl_sm_component.num_smp_procs ?
as a workaround, can you try
mpirun --mca btl ^sm ...
I
Gilles,
All the blades only have two core Xeons (without hyperthreading) populating
both their sockets. All
the x3550 nodes have hyperthreading capable Xeons and Sandybridge server CPU's.
It's possible
hyperthreading has been disabled on some of these nodes though. The 3-0-n nodes
are all IBM x
Gilles,
Attached the two output logs.
Thanks,
Daniel
On 06/22/2015 08:08 AM, Gilles Gouaillardet wrote:
Daniel,
i double checked this and i cannot make any sense with these logs.
if coll_ml_priority is zero, then i do not any way how
ml_coll_hier_barrier_setup can be invoked.
could you pl