On Wed, Jun 27, 2012 at 02:30:11PM -0400, Jeff Squyres wrote: > On Jun 27, 2012, at 2:25 PM, Martin Siegert wrote: > > >> http://www.open-mpi.org/~jsquyres/unofficial/openmpi-1.6.1ticket3131r26612M.tar.bz2 > > > > Thanks! I tried this and, indeed, the program (I tested quantum espresso, > > pw.x, so far) no longer hangs. > > Good! We're doing a bit more definitive testing here (took a little while to > figure out how to do that, but we're in process of doing that now...) before > we let this go out into the wild. > > > Then I went one step further and benchmarked the following three cases: > > > > 1) pw.x compiled with openmpi-1.3.3 > > 2) pw.x compiled with openmpi-1.4.3 and > > btl_openib_flags = 305 > > btl_openib_eager_limit = 65536 > > in etc/openmpi-mca-params.conf > > 3) pw.x compiled with openmpi-1.6.1ticket3131r26612M > > > > These are the results time (in seconds) per iteration - smaller is better: > > 1) 33.11 > > 2) 28.23 > > 3) 34.81 > > > > That's rather disappointing, isn't it? > > > Yes, it is. But #2 is not really comparable with #1 and #3. It's quite > possible that with newer IB hardware, the eager limit should be bumped > up by default. > > I leave this to Mellanox to figure out...
Good point ... I should run all three cases with the eager limit set to 65536. However, there is another issue that may affect the performance of the 1.6.1 version. I see a LOT of the following messages on stderr: -------------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to register memory in the driver. Please check /var/log/messages or dmesg for driver specific failure reason. The failure occured here: Local host: b413 Device: mlx4_0 Function: openib_reg_mr() Errno says: Cannot allocate memory (errno=12) You may need to consult with your system administrator to get this problem fixed. -------------------------------------------------------------------------- [b414:15870] 168 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [b414:15870] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [b414:15870] 131 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [b414:15870] 8 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [b414:15870] 1 more process has sent help message help-mpi-btl-openib.txt / mem-reg-fail [b414:15870] 209 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [b414:15870] 144 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail ... The strange thing is that this job used 32 processors (cores). Thus, I have no idea what the "168 more processes", etc., are refering to (there is nothing in /var/log/messages about this). The messages do not appear to be fatal. But nevertheless - do you know what causes these error messages? Cheers, Martin