+1 to Warren's advice on checking for memory fragmentation. Are you seeing kmem allocation failures in dmesg on these hosts?
On 24 January 2018 at 10:44, Warren Wang <warren.w...@walmart.com> wrote: > Check /proc/buddyinfo for memory fragmentation. We have some pretty severe > memory frag issues with Ceph to the point where we keep excessive > min_free_kbytes configured (8GB), and are starting to order more memory than > we actually need. If you have a lot of objects, you may find that you need to > increase vfs_cache_pressure as well, to something like the default of 100. > > In your buddyinfo, the columns represent the quantity of each page size > available. So if you only see numbers in the first 2 columns, you only have > 4K and 8K pages available, and will fail any allocations larger than that. > The problem is so severe for us that we have stopped using jumbo frames due > to dropped packets as a result of not being able to DMA map pages that will > fit 9K frames. > > In short, you might have enough memory, but not contiguous. It's even worse > on RGW nodes. > > Warren Wang > > On 1/23/18, 2:56 PM, "ceph-users on behalf of Samuel Taylor Liston" > <ceph-users-boun...@lists.ceph.com on behalf of sam.lis...@utah.edu> wrote: > > We have a 9 - node (16 - 8TB OSDs per node) running jewel on centos 7.4. > The OSDs are configured with encryption. The cluster is accessed via two - > RGWs and there are 3 - mon servers. The data pool is using 6+3 erasure > coding. > > About 2 weeks ago I found two of the nine servers wedged and had to hard > power cycle them to get them back. In this hard reboot 22 - OSDs came back > with either a corrupted encryption or data partitions. These OSDs were > removed and recreated, and the resultant rebalance moved along just fine for > about a week. At the end of that week two different nodes were unresponsive > complaining of page allocation failures. This is when I realized the nodes > were heavy into swap. These nodes were configured with 64GB of RAM as a cost > saving going against the 1GB per 1TB recommendation. We have since then > doubled the RAM in each of the nodes giving each of them more than the 1GB > per 1TB ratio. > > The issue I am running into is that these nodes are still swapping; a > lot, and over time becoming unresponsive, or throwing page allocation > failures. As an example, “free” will show 15GB of RAM usage (out of 128GB) > and 32GB of swap. I have configured swappiness to 0 and and also turned up > the vm.min_free_kbytes to 4GB to try to keep the kernel happy, and yet I am > still filling up swap. It only occurs when the OSDs have mounted partitions > and ceph-osd daemons active. > > Anyone have an idea where this swap usage might be coming from? > Thanks for any insight, > > Sam Liston (sam.lis...@utah.edu) > ==================================== > Center for High Performance Computing > 155 S. 1452 E. Rm 405 > Salt Lake City, Utah 84112 (801)232-6932 > ==================================== > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, ~Blairo _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com