I have some suers that are reporting errors with OpenIB on mellonox gear, it tends to apply to larger jobs (64 - 256 cores) is not reliable, but happens with regularity. Example error below:
The nodes have 64GB of memory and the IB driver is set with: options mlx4_core pfctx=0 pfcrx=0 log_num_mtt=24 log_mtts_per_seg=1 Which is I read right should let one register 128GB, We did make one change, we showed codes having huge performance impacts and khugepaged consuming 100% cpu. We found that we could get expected performance if we disabled memory defrag for huge pages, but left transparent huge paged enabled: cat /sys/kernel/mm/redhat_transparent_hugepage/defrag [never] Is this possibly related? We didn't have reports before then, has anyone seen anything similar? -------------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to register memory in the driver. Please check /var/log/messages or dmesg for driver specific failure reason. The failure occured here: Local host: mlx4_0 Device: openib_reg_mr Function: Cannot allocate memory() Errno says: ¢ Øy9ÉA?<8a>Ù <92>òD^C?#eÁþ/þE?&L?^Y·Ý^A?uyºçË<8c>P?<87>í&<8c><99>Ú^E?7<99><8d> ÍQ#?´×(<91>°k^[¿^]Ñ78©ãI?Bå^U<9d>íF^A?óü^V<84>í¢D?D9C$te^S?&'B<83>[<92> ?Aº2^W?*^B?<95>#^]ç|¸^G?rºmHPTñ¾<8a>íÖ^Wì<84>B?Lwçí"þ>5S<99>5<92>û^T?<9b>ë#M^_Üâ¾<9a>w^O@<98>^G-?/÷íôY0^L¿Mm^DÎÂC@?YÞ<83>t@^?^R¿<98>.ê/£^L?^V<83>:{<80>B^M? You may need to consult with your system administrator to get this problem fixed. -------------------------------------------------------------------------- [nyx5641.engin.umich.edu:30080] 99407 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [nyx5641.engin.umich.edu:30080] 54493 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 1 more process has sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 76831 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 76800 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 1 more process has sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 76834 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 104597 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 94309 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 96283 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 88849 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 87245 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5694.engin.umich.edu][[55235,1],50][../../../../../openmpi-1.6/ompi/mca/btl/openib/btl_openib_component.c:3707:mca_btl_openib_post_srr] error posting receive descriptors to shared receive queue 2 (6 from 107) [nyx5694.engin.umich.edu][[55235,1],50][../../../../../openmpi-1.6/ompi/mca/btl/openib/btl_openib_component.c:3707:mca_btl_openib_post_srr] error posting receive descriptors to shared receive queue 2 (0 from 106) [nyx5694.engin.umich.edu][[55235,1],50][../../../../../openmpi-1.6/ompi/mca/btl/openib/btl_openib_component.c:3707:mca_btl_openib_post_srr] error posting receive descriptors to shared receive queue 2 (0 from 105) [nyx5641.engin.umich.edu:30080] 4868 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail [nyx5641.engin.umich.edu:30080] 557 more processes have sent help message help-mpi-btl-openib.txt / mem-reg-fail Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985
signature.asc
Description: Message signed with OpenPGP using GPGMail