Openmpi 1.10.2
cuda.h and cuda_runtime_api.h exist in /usr/local/cuda-6.5/include
using the configure trigger ./configure --with-cuda does not find cuda.h or
cuda_runtime_api.h
using the configure trigger ./configure --with-cuda=/usr/local/cuda-6.5 does
not find cuda.h or cuda_runtime_api.h e
Hi Matt,
On 01/15/2016 03:53 PM, Matt Thompson wrote:
There is a chance in the future I might want/need to query an environment
variable in a Fortran program, namely to figure out what switch a currently
running process is on (via SLURM_TOPOLOGY_ADDR in my case) and perhaps make a
"per-switch" c
On Thu, Jan 21, 2016 at 4:07 AM, Dave Love wrote:
>
> Jeff Hammond writes:
>
> > Just using Intel compilers, OpenMP and MPI. Problem solved :-)
> >
> > (I work for Intel and the previous statement should be interpreted as a
> > joke,
>
> Good!
>
> > although Intel OpenMP and MPI interoperate as
On Jan 21, 2016, at 7:40 AM, Eva wrote:
>
> Thanks Jeff.
>
> >>1. Can you create a small example to reproduce the problem?
>
> >>2. The TCP and verbs-based transports use different thresholds and
> >>protocols, and can sometimes bring to light errors in the application
> >>(e.g., the applica
Thanks Jeff.
>>1. Can you create a small example to reproduce the problem?
>>2. The TCP and verbs-based transports use different thresholds and
protocols, and can sometimes bring to light errors in the application
(e.g., the application is making assumptions that just happen to be true
for TCP, b
Matt Thompson writes:
> All,
>
> I'm not too sure if this is an MPI issue, a Fortran issue, or something
> else but I thought I'd ask the MPI gurus here first since my web search
> failed me.
>
> There is a chance in the future I might want/need to query an environment
> variable in a Fortran pro
twu...@goodyear.com writes:
> In the past (v 1.6.4-) we used mpirun args of
>
> --mca mpi_paffinity_alone 1 --mca btl openib,tcp,sm,self
>
> with lsf 7.0.6, and this was enough to make cores not be oversubscribed when
> submitting 2 or more jobs to the same node.
[I'm puzzled by that. It should
Jeff Hammond writes:
> Just using Intel compilers, OpenMP and MPI. Problem solved :-)
>
> (I work for Intel and the previous statement should be interpreted as a
> joke,
Good!
> although Intel OpenMP and MPI interoperate as well as any
> implementations of which I am aware.)
Better than MPC (
[Catching up...]
Rob Latham writes:
> Do you use any of the other ROMIO file system drivers? If you don't
> know if you do, or don't know what a ROMIO file system driver is, then
> it's unlikely you are using one.
>
> What if you use a driver and it's not on the list? First off, let me
> know
Can you create a small example to reproduce the problem?
The TCP and verbs-based transports use different thresholds and protocols, and
can sometimes bring to light errors in the application (e.g., the application
is making assumptions that just happen to be true for TCP, but not necessarily
fo
That could be a bug in openib, openmpi and/or your application.
for example, a memory corruption could be unnoticed with tcp, but might
cause openib hang.
you can start by running your program under a memory debugger
(valgrind, ddt or other) and confirm your application works fine.
you can also up
Gilles,
Actually, there are some more strange things.
With the same environment and MPI version, I write a simple program by
using the same communication logic with my hang program.
The simple program can work without hang.
So is there any possible reason? I can try them one by one.
Or can I debug
You can try a more recent version of openmpi
1.10.2 was released recently, or try with a nightly snapshot of master.
If all of these still fail, can you post a trimmed version of your program so
we can investigate ?
Cheers,
Gilles
Eva wrote:
>Gilles,
>
>>>Can you try to
>>>mpirun --mca btl t
Gilles,
>>Can you try to
>>mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ...
>>and confirm it works fine with TCP *and* without eager ?
I have tried this and it works.
So what should I do next?
2016-01-21 16:25 GMT+08:00 Eva :
> Thanks Gilles.
> it works fine on tcp
> So I use this to
Can you try to
mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ...
and confirm it works fine with TCP *and* without eager ?
Cheers,
Gilles
On 1/21/2016 5:25 PM, Eva wrote:
Thanks Gilles.
it works fine on tcp
So I use this to disable eager:
-mca btl_openib_use_eager_rdma 0 -mca btl_open
Thanks Gilles.
it works fine on tcp
So I use this to disable eager:
-mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0
2016-01-21 13:10 GMT+08:00 Eva :
> I run with two machines, 2 process per node: process0, process1, process2,
> process3.
> After some random rounds of communicat
and by the way, you did
mpirun --mca btl_tcp_eager_limit 56
in order to disable eager mode, right ?
--mca btl_tcp_rndv_eager_limit 0
does something different
Cheers,
Gilles
On 1/21/2016 2:10 PM, Eva wrote:
I run with two machines, 2 process per node: process0, process1,
process2, process3.
Af
Hi,
can you post a trimmed version of your program so we can reproduce and
analyze the hang ?
Cheers,
Gilles
On 1/21/2016 2:10 PM, Eva wrote:
I run with two machines, 2 process per node: process0, process1,
process2, process3.
After some random rounds of communications, the communication ha
I run with two machines, 2 process per node: process0, process1, process2,
process3.
After some random rounds of communications, the communication hangs. When I
debug into the program, I found:
process1 sent a message to process2;
process2 received the message from process1 and then start to receiv
19 matches
Mail list logo