On Wed, Jun 27, 2012 at 02:30:11PM -0400, Jeff Squyres wrote:
> On Jun 27, 2012, at 2:25 PM, Martin Siegert wrote:
> 
> >> http://www.open-mpi.org/~jsquyres/unofficial/openmpi-1.6.1ticket3131r26612M.tar.bz2
> > 
> > Thanks! I tried this and, indeed, the program (I tested quantum espresso,
> > pw.x, so far) no longer hangs.
> 
> Good!  We're doing a bit more definitive testing here (took a little while to 
> figure out how to do that, but we're in process of doing that now...) before 
> we let this go out into the wild.
> 
> > Then I went one step further and benchmarked the following three cases:
> > 
> > 1) pw.x compiled with openmpi-1.3.3
> > 2) pw.x compiled with openmpi-1.4.3 and
> >   btl_openib_flags = 305
> >   btl_openib_eager_limit = 65536
> >   in etc/openmpi-mca-params.conf
> > 3) pw.x compiled with openmpi-1.6.1ticket3131r26612M
> > 
> > These are the results time (in seconds) per iteration - smaller is better:
> > 1) 33.11
> > 2) 28.23
> > 3) 34.81
> > 
> > That's rather disappointing, isn't it?
> 
> 
> Yes, it is.  But #2 is not really comparable with #1 and #3.  It's quite
> possible that with newer IB hardware, the eager limit should be bumped
> up by default.
> 
> I leave this to Mellanox to figure out...

Good point ... I should run all three cases with the eager limit set to
65536.

However, there is another issue that may affect the performance of the 1.6.1
version. I see a LOT of the following messages on stderr:

--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to register memory in the driver.
Please check /var/log/messages or dmesg for driver specific failure
reason.
The failure occured here:

  Local host:    b413
  Device:        mlx4_0
  Function:      openib_reg_mr()
  Errno says:    Cannot allocate memory (errno=12)

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
[b414:15870] 168 more processes have sent help message help-mpi-btl-openib.txt 
/ mem-reg-fail
[b414:15870] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help 
/ error messages
[b414:15870] 131 more processes have sent help message help-mpi-btl-openib.txt 
/ mem-reg-fail
[b414:15870] 8 more processes have sent help message help-mpi-btl-openib.txt / 
mem-reg-fail
[b414:15870] 1 more process has sent help message help-mpi-btl-openib.txt / 
mem-reg-fail
[b414:15870] 209 more processes have sent help message help-mpi-btl-openib.txt 
/ mem-reg-fail
[b414:15870] 144 more processes have sent help message help-mpi-btl-openib.txt 
/ mem-reg-fail
...

The strange thing is that this job used 32 processors (cores). Thus, I have
no idea what the "168 more processes", etc., are refering to (there is
nothing in /var/log/messages about this).

The messages do not appear to be fatal. But nevertheless - do you know
what causes these error messages?

Cheers,
Martin

Reply via email to