On Oct 29, 2021, at 07:39, Julien Rey via lustre-discuss 
<[email protected]<mailto:[email protected]>> wrote:

Hello,

This may not be related directly to Lustre, but here's what I get when I try to 
mount our Lustre filesystem on one of our compute node running CentOS 7:


Oct 29 14:30:20 gpu-node8 kernel: SLUB: Unable to allocate memory on node -1 
(gfp=0x8050)

There doesn't look to be anything "wrong" here, -1 means "no specific node", 
and the GFP mask is __GFP_ZERO | __GFP_IO | __GFP_WAIT for this kernel.

One time I saw problems like this, it was because all the DIMMs were installed 
on one socket of a dual-socket NUMA motherboard, and no memory was available on 
the other socket, but only some allocations failed.

Cheers, Andreas

Oct 29 14:30:20 gpu-node8 kernel:  cache: dm_rq_target_io, object size: 136, 
buffer size: 136, default order: 0, min order: 0
Oct 29 14:30:20 gpu-node8 kernel:  node 1: slabs: 2, objs: 60, free: 0
Oct 29 14:30:20 gpu-node8 kernel: LustreError: 
3097:0:(niobuf.c:994:ptlrpc_register_rqbd()) LNetMDAttach failed: -12;
Oct 29 14:30:20 gpu-node8 kernel: LustreError: 
3097:0:(service.c:2551:ptlrpc_main()) Failed to post rqbd for ldlm_cbd on CPT 
0: -1
Oct 29 14:30:20 gpu-node8 kernel: LustreError: 
3091:0:(service.c:2917:ptlrpc_start_threads()) cannot start ldlm_cb thread 
#0_0: rc -1
Oct 29 14:30:20 gpu-node8 kernel: LustreError: 
3091:0:(service.c:837:ptlrpc_register_service()) Failed to start threads for 
service ldlm_cbd: -1
Oct 29 14:30:20 gpu-node8 kernel: LustreError: 
3091:0:(ldlm_lockd.c:3077:ldlm_setup()) failed to start service
Oct 29 14:30:20 gpu-node8 kernel: LustreError: 
3091:0:(ldlm_lib.c:462:client_obd_setup()) ldlm_get_ref failed: -1
Oct 29 14:30:20 gpu-node8 kernel: LustreError: 
3091:0:(obd_config.c:559:class_setup()) setup MGC10.0.1.70@tcp failed (-1)
Oct 29 14:30:20 gpu-node8 kernel: LustreError: 
3091:0:(obd_mount.c:202:lustre_start_simple()) MGC10.0.1.70@tcp setup error -1
Oct 29 14:30:20 gpu-node8 kernel: LustreError: 
3091:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount (-1)


I've been scratching my head on this one because this could just be a kernel 
bug but we have 3 other identical servers running the exact same versions of 
CentOS 7 and Lustre client and I got no problem with them.

Some more info:

[root@gpu-node8 ~]# uname -r
3.10.0-1160.el7.x86_64

[root@gpu-node8 ~]# lctl --version
lctl 2.12.7

[root@gpu-node8 ~]# vmstat -m |grep dm_rq_target_io
dm_rq_target_io              60     60    136     30

[root@gpu-node8 ~]# free -h
              total        used        free      shared buff/cache   available
Mem:            31G        1.4G         29G         10M 117M         29G
Swap:           15G          0B         15G


I've been playing with the sysctl parameters but I don't really know what I'm 
doing and got no result anyway:

sysctl vm.overcommit_memory=1

sysctl vm.min_free_kbytes=90112

sysctl vm.overcommit_kbytes=90112


Any help would be greetly appreciated.

Thanks!

--
Julien REY

Plate-forme RPBS
Modélisation Computationnelle des Interactions Protéines-Ligand (CMPLI)
Université de Paris
tel : 01 57 27 83 95

_______________________________________________
lustre-discuss mailing list
[email protected]<mailto:[email protected]>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to