Noam Bernstein <noam.bernst...@nrl.navy.mil> writes:

> On Dec 18, 2013, at 10:32 AM, Dave Love <d.l...@liverpool.ac.uk> wrote:
>
>> Noam Bernstein <noam.bernst...@nrl.navy.mil> writes:
>> 
>>> We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in 
>>> some 
>>> collective communication), but now I'm wondering whether I should just test
>>> 1.6.5.
>> 
>> What bug, exactly?  As you mentioned vasp, is it specifically affecting
>> that?
>
> Yes - I never characterized it fully, but we attached with gdb to every
> single vasp running process, and all were stuck in the same
> call to MPI_allreduce() every time. It's only happening on a rather large 
> jobs, so it's not the easiest setup to debug.  

Maybe that's a different problem.  I know they tried multiple versions
of vasp, which had different failures.  Actually, I just remembered that
the version I examined with padb was built with the intel compiler and
run with gcc openmpi (I know...), but builds with gcc failed too.  I
don't know if that was taken up with the developers.

I guess this isn't the place to discuss vasp, unless it's helping to pin
down an ompi problem, but people might benefit from notes of problems in
the archive.

> If I can reproduce the problem with 1.6.5, and I can confirm that it's always 
> locking up in the same call to mpi_allreduce, and all processes are stuck 
> in the same call, is there interest in looking into a possible mpi issue?  

I'd have thought so from the point of view of those of us running 1.6
for compatibility with the RHEL6 openmpi.

Thanks for the info, anyhow.

Incidentally, if vasp is built with ompi's alltoallv -- I understand it
has its own implementation of that or something similar --
<http://www.open-mpi.org/community/lists/users/2013/10/22804.php> may be
relevant, if you haven't seen it.

Reply via email to