[ceph-users] Workaround for XFS lockup resulting in down OSDs

2017-02-07 Thread Thorvald Natvig
Hi,

We've encountered a small "kernel feature" in XFS using Filestore. We
have a workaround, and would like to share in case others have the
same problem.

Under high load, on slow storage, with lots of dirty buffers and low
memory, there's a design choice with unfortunate side-effects if you
have multiple XFS filesystems mounted, such as often is the case when
you have a JBOD full of drives. This results in network traffic
stalling, leading to OSDs failing heartbeats.

In short, when the kernel needs to allocate memory for anything, it
first figures out how many pages it needs, then goes to each
filesystem and says "release N pages". In XFS, that's implemented as
follows:

- For each AG (8 in our case):
  - Try to lock AG
  - Release unused buffers, up to N
- If this point is reached, and we didn't manage to release at least N
pages, try again, but this time wait for the lock.

That last part is the problem; if the lock is currently held by, say,
another kernel thread that is currently flushing dirty buffers, then
the memory allocation stalls. However, we have 30 other XFS
filesystems that could release memory, and the kernel also has a lot
of non-filesystem memory that can be released.

This manifests as OSDs going offline during high load, with other OSDs
claiming that the OSD stopped responding to health checks. This is
especially prevalent during cache tier flushing and large backfills,
which can put very heavy load on the write buffers, thus increasing
the probability of one of these events.
In reality, the OSD is stuck in the kernel, trying to allocate buffers
to build a TCP packet to answer the network message. As soon as the
buffers are flushed (which can take a while), the OSD recovers, but
now has to deal with being marked down in the monitor maps.

The following systemtap changes the kernel behavior to not do the lock-waiting:

probe module("xfs").function("xfs_reclaim_inodes_ag").call {
 $flags = $flags & 2
}

Save it to a file, and run with 'stap -v -g -d kernel
--suppress-time-limits '. We've been running this for a
few weeks, and the issue is completely gone.

There was a writeup on the XFS mailing list a while ago about the same
issue ( http://www.spinics.net/lists/linux-xfs/msg01541.html ), but
unfortunately it didn't result in consensus on a patch. This problem
won't exist in BlueStore, so we consider the systemtap approach a
workaround until we're ready to deploy BlueStore.

- Thorvald
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Workaround for XFS lockup resulting in down OSDs

2017-02-08 Thread Thorvald Natvig
Hi,

We've encountered this on both 4.4 and 4.8. It might have been there
earlier, but we have no data for that anymore.

While there's no causation, there's a high correlation with kernel
page allocation failures. If you see "[timestamp] : page
allocation failure: order:5,
mode:0x2082120(GFP_ATOMIC|__GFP_COLD|__GFP_MEMALLOC)" or similar in
dmesg, that indicates you're in a low-memory situation with high
memory pressure. Our first workaround was to keep an eye on
/proc/buddyinfo and to do

sync
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory

whenever the bucket for order 5 got less than 10.

A pretty dead giveaway is that e.g. SSH will start freezing up, and
your shell becomes laggy with pauses up to a second. This happens when
you try to run a command, which allocates memory, and hence runs into
the same situation of the kernel waiting for another kernel process
which flushes buffers, despite having (in our case) about 100GB other
memory that could be released in an instant.

If you see timeouts happening while /proc/buddyinfo indicates you have
plenty of high-order memory available, then it's not this issue.

- Thorvald


On Wed, Feb 8, 2017 at 3:48 AM, Nick Fisk  wrote:
> Hi,
>
> I would also be interested in if there is a way to determine if this 
> happening. I'm not sure if its related, but when I updated a
> number of OSD nodes to Kernel 4.7 from 4.4, I started seeing lots of random 
> alerts from OSD's saying that other OSD's were not
> responding. The load wasn't particularly high on the cluster but I couldn't 
> determine why the OSD's were wrongly getting marked out,
> it was almost like they just hung for a while, which is remarkably similar to 
> what you describe. The only difference in my case was
> that I wouldn't have said my cluster was particularly busy.
>
> I read through that XFS thread you posted and it seemed to suggest the 
> problem was between the 3.x and 4.x versions.
>
> As soon as I reverted to Kernel 4.4 (from Ubuntu) the problem stopped 
> immediately.
>
> Nick
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dan 
>> van der Ster
>> Sent: 08 February 2017 11:08
>> To: Thorvald Natvig 
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] Workaround for XFS lockup resulting in down OSDs
>>
>> Hi,
>>
>> This is interesting. Do you have a bit more info about how to identify a 
>> server which is suffering from this problem? Is there
> some
>> process
>> (xfs* or kswapd?) we'll see as busy in top or iotop.
>>
>> Also, which kernel are you using?
>>
>> Cheers, Dan
>>
>>
>> On Tue, Feb 7, 2017 at 6:59 PM, Thorvald Natvig  
>> wrote:
>> > Hi,
>> >
>> > We've encountered a small "kernel feature" in XFS using Filestore. We
>> > have a workaround, and would like to share in case others have the
>> > same problem.
>> >
>> > Under high load, on slow storage, with lots of dirty buffers and low
>> > memory, there's a design choice with unfortunate side-effects if you
>> > have multiple XFS filesystems mounted, such as often is the case when
>> > you have a JBOD full of drives. This results in network traffic
>> > stalling, leading to OSDs failing heartbeats.
>> >
>> > In short, when the kernel needs to allocate memory for anything, it
>> > first figures out how many pages it needs, then goes to each
>> > filesystem and says "release N pages". In XFS, that's implemented as
>> > follows:
>> >
>> > - For each AG (8 in our case):
>> >   - Try to lock AG
>> >   - Release unused buffers, up to N
>> > - If this point is reached, and we didn't manage to release at least N
>> > pages, try again, but this time wait for the lock.
>> >
>> > That last part is the problem; if the lock is currently held by, say,
>> > another kernel thread that is currently flushing dirty buffers, then
>> > the memory allocation stalls. However, we have 30 other XFS
>> > filesystems that could release memory, and the kernel also has a lot
>> > of non-filesystem memory that can be released.
>> >
>> > This manifests as OSDs going offline during high load, with other OSDs
>> > claiming that the OSD stopped responding to health checks. This is
>> > especially prevalent during cache tier flushing and large backfills,
>> > which can put very heavy load on the write buffers, thus increasing
>> > the probability of one of these events.
>

Re: [ceph-users] Workaround for XFS lockup resulting in down OSDs

2017-02-08 Thread Thorvald Natvig
Hi,

High-concurrency backfilling or flushing a cache tier triggers it
fairly reliably.

Setting backfills to >16 and switching from hammer to jewel tunables
(which moves most of the data) will trigger this, as will going in the
opposite direction.

The nodes where we observed this most commonly are on our very-cold
tier, which has 1x E5-2667v4, 256GB memory, 31x 8TB HDD, 2 x NVMe
journals, 40GbE networking, running 10.2.5 on Ubuntu 16.04 with the
HWE kernel (4.8).

We managed to force the issue to appear on hot tier nodes (SSD-only
nodes) by using SSD-on-SSD tiering, filling up the cache with dirty
data and then setting the target_bytes to 1/10th of the current usage.
That said, the pure-SSD OSDs didn't lock for more than a few hundred
ms at a time, since the XFS-flush happens fairly quickly on SSDs,
which releases the mutex and allows the kernel to continue. So the
hang wasn't long enough to cause lots of OSDs to drop out.

- Thorvald


On Wed, Feb 8, 2017 at 11:03 AM, Shinobu Kinjo  wrote:
> On Wed, Feb 8, 2017 at 8:07 PM, Dan van der Ster  wrote:
>> Hi,
>>
>> This is interesting. Do you have a bit more info about how to identify
>> a server which is suffering from this problem? Is there some process
>> (xfs* or kswapd?) we'll see as busy in top or iotop.
>
> That's my question as well. If you would be able to reproduce the
> issue intentionally, it would be very helpful.
>
> And also if you could tell us your cluster environment a bit more in
> detail, it would be also helpful.
>
>>
>> Also, which kernel are you using?
>>
>> Cheers, Dan
>>
>>
>> On Tue, Feb 7, 2017 at 6:59 PM, Thorvald Natvig  
>> wrote:
>>> Hi,
>>>
>>> We've encountered a small "kernel feature" in XFS using Filestore. We
>>> have a workaround, and would like to share in case others have the
>>> same problem.
>>>
>>> Under high load, on slow storage, with lots of dirty buffers and low
>>> memory, there's a design choice with unfortunate side-effects if you
>>> have multiple XFS filesystems mounted, such as often is the case when
>>> you have a JBOD full of drives. This results in network traffic
>>> stalling, leading to OSDs failing heartbeats.
>>>
>>> In short, when the kernel needs to allocate memory for anything, it
>>> first figures out how many pages it needs, then goes to each
>>> filesystem and says "release N pages". In XFS, that's implemented as
>>> follows:
>>>
>>> - For each AG (8 in our case):
>>>   - Try to lock AG
>>>   - Release unused buffers, up to N
>>> - If this point is reached, and we didn't manage to release at least N
>>> pages, try again, but this time wait for the lock.
>>>
>>> That last part is the problem; if the lock is currently held by, say,
>>> another kernel thread that is currently flushing dirty buffers, then
>>> the memory allocation stalls. However, we have 30 other XFS
>>> filesystems that could release memory, and the kernel also has a lot
>>> of non-filesystem memory that can be released.
>>>
>>> This manifests as OSDs going offline during high load, with other OSDs
>>> claiming that the OSD stopped responding to health checks. This is
>>> especially prevalent during cache tier flushing and large backfills,
>>> which can put very heavy load on the write buffers, thus increasing
>>> the probability of one of these events.
>>> In reality, the OSD is stuck in the kernel, trying to allocate buffers
>>> to build a TCP packet to answer the network message. As soon as the
>>> buffers are flushed (which can take a while), the OSD recovers, but
>>> now has to deal with being marked down in the monitor maps.
>>>
>>> The following systemtap changes the kernel behavior to not do the 
>>> lock-waiting:
>>>
>>> probe module("xfs").function("xfs_reclaim_inodes_ag").call {
>>>  $flags = $flags & 2
>>> }
>>>
>>> Save it to a file, and run with 'stap -v -g -d kernel
>>> --suppress-time-limits '. We've been running this for a
>>> few weeks, and the issue is completely gone.
>>>
>>> There was a writeup on the XFS mailing list a while ago about the same
>>> issue ( http://www.spinics.net/lists/linux-xfs/msg01541.html ), but
>>> unfortunately it didn't result in consensus on a patch. This problem
>>> won't exist in BlueStore, so we consider the systemtap approach a
>>> workaround until we're ready to deploy BlueStore.
>>>
>>> - Thorvald
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com