On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote:
-thanks for you support!- nope, no core, just the "orte has lost"...
Dear list - the problem is _not_ related to openmpi. I compiled mvapich2 and I
get communication errors,too. Probably this is a hardware problem.
Sor
Even with just 100 nodes: some jobs are failing (50/50), failing jobs: _no
output_, _no core dumped_...only orte has lost...
Running on >=350 nodes: almost all jobs are failing, but some jobs succeeded
(similar output: only "orte has lost..." for failing jobs and the expected
outp
MCA sharedfp: individual (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA topo: basic (MCA v2.0.0, API v2.1.0, Component v1.10.2)
MCA vprotocol: pessimist (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelber
at
some later point.
Any hint? PSM? Some kernel number must be increased? Wrong network/routing
(should not happen with --mca oob_tcp_if_include eth0)??
MfG/Sincerely
Stefan Friedel
--
IWR * 4.317 * INF205 * 69120 Heidelberg
T +49 6221 5414404 * F +49 6221 5414427
stefan.frie...@iwr.uni-hei
are these references inside the *.la files?
Thanks for hints-
MfG/Sincerely,
Stefan Friedel
--
IWR * 523 * INF 368 * 69120 Heidelberg
T +49 6221 548240 * F +49 6221 545224
stefan.frie...@iwr.uni-heidelberg.de
signature.asc
Description: Digital signature
://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem
As I wrote: I'm aware of this FAQ entries -but: you can't set the log_num_mtt
parameter if you're using a Debian/vanilla kernel: the mlx4_core-module
does not offer this parameter.
MfG/Sincerely,
Stefan Friedel
--
IWR
hint?
MfG/Sincerely,
Stefan Friedel
--
IWR * 523 * INF 368 * 69120 Heidelberg
T +49 6221 548240 * F +49 6221 545224
stefan.frie...@iwr.uni-heidelberg.de
signature.asc
Description: Digital signature