I have some suers that are reporting errors with OpenIB on mellonox gear,  it 
tends to apply to larger jobs (64 - 256 cores)  is not reliable, but happens 
with regularity.  Example error below:

The nodes have 64GB of memory and the IB driver is set with:
options mlx4_core pfctx=0 pfcrx=0 log_num_mtt=24 log_mtts_per_seg=1

Which is I read right should let one register 128GB,

We did make one change, we showed codes having huge performance impacts and 
khugepaged consuming 100% cpu.  We found that we could get expected performance 
if we disabled memory defrag for huge pages, but left transparent huge paged 
enabled:

cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
[never] 

Is this possibly related?  We didn't have reports before then, has anyone seen 
anything similar?


--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to register memory in the driver.
Please check /var/log/messages or dmesg for driver specific failure
reason.
The failure occured here:

  Local host:    mlx4_0
  Device:        openib_reg_mr
  Function:      Cannot allocate memory()
  Errno says:    ¢
Øy9ÉA?<8a>Ù
<92>òD^C?#eÁþ/þE?&L?^Y·Ý^A?uyºçË<8c>P?<87>í&<8c><99>Ú^E?7<99><8d>       
ÍQ#?´×(<91>°k^[¿^]Ñ78©ãI?Bå^U<9d>íF^A?óü^V<84>í¢D?D9C$te^S?&'B<83>[<92> 
?Aº2^W?*^B?<95>#^]ç|¸^G?rºmHPTñ¾<8a>íÖ^Wì<84>B?­Lwçí"þ>5S<99>5<92>û^T?<9b>ë#M^_Üâ¾<9a>w^O@<98>^G-?/÷íôY0^L¿Mm^DÎÂC@?YÞ<83>t@^?^R¿<98>.ê/­£^L?^V<83>:{<80>B^M?

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
[nyx5641.engin.umich.edu:30080] 99407 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] Set MCA parameter "orte_base_help_aggregate" to 
0 to see all help / error messages
[nyx5641.engin.umich.edu:30080] 54493 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 1 more process has sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 76831 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 76800 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 1 more process has sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 76834 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 104597 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 94309 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 96283 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 88849 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 87245 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5694.engin.umich.edu][[55235,1],50][../../../../../openmpi-1.6/ompi/mca/btl/openib/btl_openib_component.c:3707:mca_btl_openib_post_srr]
 error posting receive descriptors to shared receive queue 2 (6 from 107)
[nyx5694.engin.umich.edu][[55235,1],50][../../../../../openmpi-1.6/ompi/mca/btl/openib/btl_openib_component.c:3707:mca_btl_openib_post_srr]
 error posting receive descriptors to shared receive queue 2 (0 from 106)
[nyx5694.engin.umich.edu][[55235,1],50][../../../../../openmpi-1.6/ompi/mca/btl/openib/btl_openib_component.c:3707:mca_btl_openib_post_srr]
 error posting receive descriptors to shared receive queue 2 (0 from 105)
[nyx5641.engin.umich.edu:30080] 4868 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail
[nyx5641.engin.umich.edu:30080] 557 more processes have sent help message 
help-mpi-btl-openib.txt / mem-reg-fail



Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to