>That would be interesting. About a dozen copies of > cat /proc/$PID/stack >taken in quick succession would be best, where $PID is the pid of >the shell process which wrote to drop_caches.
Will do later today. I have found a candidate node with the problem, just need to wait for the current task to finish. >signal_cache should have one entry for each process (or thread-group). That is what i thought as well, looking at the kernel source, allocations from signal_cache happen only during fork. >It holds a the signal_struct structure that is shared among the threads >in a group. >So 3.7 million signal_structs suggests there are 3.7 million processes >on the system. I don't think Linux supports more that 4 million, so >that is one very busy system. Not as much. Top shows: Tasks: 3048 total, 273 running, 2775 sleeping, 0 stopped, 0 zombie slabinfo (note that this is a different node than in my original email). slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> nfs_direct_cache 0 0 352 46 4 : tunables 0 0 0 : slabdata 0 0 0 nfs_commit_data 46 46 704 46 8 : tunables 0 0 0 : slabdata 1 1 0 nfs_inode_cache 25110 25110 1048 31 8 : tunables 0 0 0 : slabdata 810 810 0 fscache_cookie_jar 552 552 88 46 1 : tunables 0 0 0 : slabdata 12 12 0 iser_descriptors 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0 t10_alua_lu_gp_cache 40 40 200 40 2 : tunables 0 0 0 : slabdata 1 1 0 t10_pr_reg_cache 0 0 696 47 8 : tunables 0 0 0 : slabdata 0 0 0 se_sess_cache 10728 10728 896 36 8 : tunables 0 0 0 : slabdata 298 298 0 kcopyd_job 0 0 3312 9 8 : tunables 0 0 0 : slabdata 0 0 0 dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0 dm_rq_target_io 0 0 136 60 2 : tunables 0 0 0 : slabdata 0 0 0 nfs4_layout_stateid 0 0 296 55 4 : tunables 0 0 0 : slabdata 0 0 0 nfsd4_delegations 0 0 240 68 4 : tunables 0 0 0 : slabdata 0 0 0 nfsd4_files 0 0 288 56 4 : tunables 0 0 0 : slabdata 0 0 0 nfsd4_lockowners 0 0 400 40 4 : tunables 0 0 0 : slabdata 0 0 0 nfsd4_openowners 0 0 440 74 8 : tunables 0 0 0 : slabdata 0 0 0 rpc_inode_cache 1122 1122 640 51 8 : tunables 0 0 0 : slabdata 22 22 0 vvp_object_kmem 5805496 5819230 176 46 2 : tunables 0 0 0 : slabdata 126505 126505 0 ll_thread_kmem 28341 28341 344 47 4 : tunables 0 0 0 : slabdata 603 603 0 lov_session_kmem 28636 29370 592 55 8 : tunables 0 0 0 : slabdata 534 534 0 osc_extent_kmem 6410367 6423408 168 48 2 : tunables 0 0 0 : slabdata 133821 133821 0 osc_thread_kmem 13409 13453 2832 11 8 : tunables 0 0 0 : slabdata 1223 1223 0 osc_object_kmem 6401946 6417982 304 53 4 : tunables 0 0 0 : slabdata 121094 121094 0 ldlm_locks 120640 120960 512 64 8 : tunables 0 0 0 : slabdata 1890 1890 0 ptlrpc_cache 86142 86142 768 42 8 : tunables 0 0 0 : slabdata 2051 2051 0 ll_import_cache 0 0 1480 22 8 : tunables 0 0 0 : slabdata 0 0 0 ll_obdo_cache 21216 21216 208 78 4 : tunables 0 0 0 : slabdata 272 272 0 ll_obd_dev_cache 72 72 3960 8 8 : tunables 0 0 0 : slabdata 9 9 0 ext4_groupinfo_4k 240 240 136 60 2 : tunables 0 0 0 : slabdata 4 4 0 ext4_inode_cache 74776 78275 1032 31 8 : tunables 0 0 0 : slabdata 2525 2525 0 ext4_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0 ext4_free_data 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0 ext4_allocation_context 17408 17408 128 64 2 : tunables 0 0 0 : slabdata 272 272 0 ext4_io_end 15232 15232 72 56 1 : tunables 0 0 0 : slabdata 272 272 0 ext4_extent_status 254554 256938 40 102 1 : tunables 0 0 0 : slabdata 2519 2519 0 jbd2_journal_handle 0 0 48 85 1 : tunables 0 0 0 : slabdata 0 0 0 jbd2_journal_head 0 0 112 73 2 : tunables 0 0 0 : slabdata 0 0 0 jbd2_revoke_table_s 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0 jbd2_revoke_record_s 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0 ip6_dst_cache 2701 2701 448 73 8 : tunables 0 0 0 : slabdata 37 37 0 RAWv6 286 286 1216 26 8 : tunables 0 0 0 : slabdata 11 11 0 UDPLITEv6 0 0 1216 26 8 : tunables 0 0 0 : slabdata 0 0 0 UDPv6 4550 4550 1216 26 8 : tunables 0 0 0 : slabdata 175 175 0 tw_sock_TCPv6 64 64 256 64 4 : tunables 0 0 0 : slabdata 1 1 0 TCPv6 4050 4050 2176 15 8 : tunables 0 0 0 : slabdata 270 270 0 cfq_io_cq 0 0 120 68 2 : tunables 0 0 0 : slabdata 0 0 0 cfq_queue 0 0 232 70 4 : tunables 0 0 0 : slabdata 0 0 0 bsg_cmd 0 0 312 52 4 : tunables 0 0 0 : slabdata 0 0 0 mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0 hugetlbfs_inode_cache 71992 79288 608 53 8 : tunables 0 0 0 : slabdata 1496 1496 0 dquot 0 0 256 64 4 : tunables 0 0 0 : slabdata 0 0 0 userfaultfd_ctx_cache 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0 fanotify_event_info 7957 7957 56 73 1 : tunables 0 0 0 : slabdata 109 109 0 pid_namespace 0 0 2200 14 8 : tunables 0 0 0 : slabdata 0 0 0 posix_timers_cache 17952 17952 248 66 4 : tunables 0 0 0 : slabdata 272 272 0 UDP-Lite 0 0 1088 30 8 : tunables 0 0 0 : slabdata 0 0 0 flow_cache 33488 33488 144 56 2 : tunables 0 0 0 : slabdata 598 598 0 xfrm_dst_cache 29624 29624 576 56 8 : tunables 0 0 0 : slabdata 529 529 0 UDP 8190 8190 1088 30 8 : tunables 0 0 0 : slabdata 273 273 0 tw_sock_TCP 14656 14656 256 64 4 : tunables 0 0 0 : slabdata 229 229 0 TCP 4478 4544 1984 16 8 : tunables 0 0 0 : slabdata 284 284 0 inotify_inode_mark 7176 7176 88 46 1 : tunables 0 0 0 : slabdata 156 156 0 scsi_data_buffer 0 0 24 170 1 : tunables 0 0 0 : slabdata 0 0 0 blkdev_queue 14 14 2256 14 8 : tunables 0 0 0 : slabdata 1 1 0 blkdev_ioc 21216 21216 104 78 2 : tunables 0 0 0 : slabdata 272 272 0 user_namespace 0 0 480 68 8 : tunables 0 0 0 : slabdata 0 0 0 dmaengine-unmap-128 30 30 1088 30 8 : tunables 0 0 0 : slabdata 1 1 0 sock_inode_cache 15708 15708 640 51 8 : tunables 0 0 0 : slabdata 308 308 0 net_namespace 0 0 5184 6 8 : tunables 0 0 0 : slabdata 0 0 0 Acpi-ParseExt 26600 26600 72 56 1 : tunables 0 0 0 : slabdata 475 475 0 Acpi-State 510 510 80 51 1 : tunables 0 0 0 : slabdata 10 10 0 > Unless... the final "put" of a task_struct happens via call_rcu - so it > can be delayed a while, normally 10s of milliseconds, but it can take > seconds to clear a large backlog. > So if you have lots of processes being created and destroyed very > quickly, then you might get a backlog of task_struct, and the associated > signal_struct, waiting to be destroyed. The node from my original mail has been idle for days before i did the test described. >However, if the task_struct slab were particularly big, I suspect you >would have included it in the list of large slabs - but you didn't. >If signal_cache has more active entries than task_struct, then something >has gone seriously wrong somewhere. Indeed this is the case. Number of tasks and tasks structs are way smaller than the number of signal cache structs. >I doubt this problem is related to lustre. Hmm. Interesting. Looks like __put_task_struct will call into put_signal_struct which will not free the signal that is referenced by sth. I wonder if this could be related to the log entries we see : _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_1772/job_33959278/step_batch: Device or resource busy And we are running in nohz_full, so it is going to be interesting problem to diagnose... But this seems to be going off on a tangent. Still, thank you for the useful hints and analysis. Jacek Tomaka On Tue, Apr 16, 2019 at 7:17 AM NeilBrown <[email protected]> wrote: > On Mon, Apr 15 2019, Jacek Tomaka wrote: > > > Thanks Patrick for getting the ball rolling! > > > >>1/ w.r.t drop_caches, "2" is *not* "inode and dentry". The '2' bit > >> causes all registered shrinkers to be run, until they report there is > >> nothing left that can be discarded. If this is taking 10 minutes, > >> then it seems likely that some shrinker is either very inefficient, or > >> is reporting that there is more work to be done, when really there > >> isn't. > > > > This is pretty common problem on this hardware. KNL's CPU is running > > at ~1.3GHz so anything that is not multi threaded can take a few times > more > > than on "normal" XEON. While it would be nice to improve this (by running > > it in mutliple threads), > > this is not the problem here. However i can provide you with kernel call > > stack > > next time i see it if you are interested. > > That would be interesting. About a dozen copies of > cat /proc/$PID/stack > taken in quick succession would be best, where $PID is the pid of > the shell process which wrote to drop_caches. > > > > > > >> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it > >> reclaims anything that can be reclaimed immediately. > > > > Awesome. I would just like to know how much easily available memory > > there is on the system without actually reclaiming it and seeing, ideally > > using > > normal kernel mechanisms but if lustre provides a procfs entry where i > can > > get it, it will solve my immediate problem. > > > >>4/ Patrick is right that accounting is best-effort. But we do want it > >> to improve. > > > > Accounting looks better when Lustre is not involved ;) Seriosly, how > > can i help? Should i raise a bug? Try to provide a patch? > > > >>Just last week there was a report > >> https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/ > >> about making slab-allocated objects movable. If/when that gets off > >> the ground, it should help the fragmentation problem, so more of the > >> pages listed as reclaimable should actually be so. > > > > This is a very interesting article. While memory fragmentation makes it > > more > > difficult to use huge pages, it is not directly related to the problem of > > lustre kernel > > memory allocation accounting. It will be good to see movable slabs, > though. > > > > Also i am not sure how the high signal_cache can be explained and if > > anything can be > > done on the Lustre level? > > signal_cache should have one entry for each process (or thread-group). > It holds a the signal_struct structure that is shared among the threads > in a group. > So 3.7 million signal_structs suggests there are 3.7 million processes > on the system. I don't think Linux supports more that 4 million, so > that is one very busy system. > Unless... the final "put" of a task_struct happens via call_rcu - so it > can be delayed a while, normally 10s of milliseconds, but it can take > seconds to clear a large backlog. > So if you have lots of processes being created and destroyed very > quickly, then you might get a backlog of task_struct, and the associated > signal_struct, waiting to be destroyed. > However, if the task_struct slab were particularly big, I suspect you > would have included it in the list of large slabs - but you didn't. > If signal_cache has more active entries than task_struct, then something > has gone seriously wrong somewhere. > > I doubt this problem is related to lustre. > > NeilBrown > -- *Jacek Tomaka* Geophysical Software Developer *DownUnder GeoSolutions* 76 Kings Park Road West Perth 6005 WA, Australia *tel *+61 8 9287 4143 <+61%208%209287%204143> [email protected] *www.dug.com <http://www.dug.com>*
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
