Ramps up over time, we had a bunch of locked up nodes over the weekend and have
traced it back to this.
Let me see if I can share more details,
I will review with everyone tomorrow and get back to you,
Rolf vandeVaart wrote:
Hi Steven,
Thanks for the report. Very little has changed betwee
Hi Steven,
Thanks for the report. Very little has changed between 1.8.5 and 1.8.6 within
the CUDA-aware specific code so I am perplexed. Also interesting that you do
not see the issue with 1.8.5 and CUDA 7.0.
You mentioned that it is hard to share the code on this but maybe you could
share how
Hi Nick
No. Have to use mpirun in this case. You need. to ask for a larger batch
allocation than the initial mpirun requires. You do need to ask for batch
alloc though. Also note that mpirun doesnt currently work with nativized
slurm. Its on my todo list to fix.
Howard
--
sent from
Saliya,
On Tue, Jun 30, 2015 at 10:50 AM, Saliya Ekanayake
wrote:
> Hi,
>
> I am experiencing some bottleneck with allgatherv routine in one of our
> programs and wonder how it works internally. Could you please share some
> details on this?
>
Open MPI has a tunable approach to all the collecti
Hi All,
Looks like we have found a large memory leak,
Very difficult to share code on this but here are some details,
1.8.5 w/ Cuda 7.0 — no memory leak
1.8.5 w/ cuda 6.5 — no memory leak
1.8.6 w/ cuda 7.0 — large memory leak
1.8.5 w/ cuda 6.5 — no memory leak
mvapich2 2.1 GDR — no issue on eith
Howard,
I have one more question. Is it possible to use MPI_Comm_spawn when
launching an OpenMPI job with aprun? I'm getting this error when I try:
nradclif@kay:/lus/scratch/nradclif> aprun -n 1 -N 1 ./manager
[nid00036:21772] [[14952,0],0] ORTE_ERROR_LOG: Not available in file dpm_orte.c
at lin
Hi,
I am experiencing some bottleneck with allgatherv routine in one of our
programs and wonder how it works internally. Could you please share some
details on this?
I found this [1] paper from Gropp discussing an efficient implementation.
Is this similar to what we get in OpenMPI?
[1]
http://
Hi Thomas,
as far as I know MPI does _not_ guarantee asynchronous progress
(unlike OpenSHMEM) because it would require some implementations to
start a progress thread.
Jeff has a nice blog post regarding this:
http://blogs.cisco.com/performance/mpi-progress
I was surprised to see this behavior i
On 06/29/15 17:25, Nathan Hjelm wrote:
This is not a configuration issue. On 1.8.x and master we use two-sided
communication to emulation one-sided. Since we do not currently have
async progress this requires the target to call into MPI to progress RMA
communication.
This will change in 2.x. I w