[Bug 1900438] Re: Bcache bypasse writeback on caching device with fragmentation

Nivedita Singhvi Fri, 26 Mar 2021 01:16:33 -0700

** Description changed:

  SRU Justification:
  
  [Impact]
- This bug in bcache [insert correct area] affects I/O performance on all 
versions of the kernel [correct versions affected]. It is particularly negative 
on ceph if used with bcache.
+ This bug in bcache affects I/O performance on all versions of the kernel 
[correct versions affected]. It is particularly negative on ceph if used with 
bcache.
  
  Write I/O latency would suddenly go to around 1 second from around 10 ms
  when hitting this issue and would easily be stuck there for hours or
  even days, especially bad for ceph on bcache architecture. This would
  make ceph extremely slow and make the entire cloud almost unusable.
  
  The root cause is that the dirty bucket had reached the 70 percent
  threshold, thus causing all writes to go direct to the backing HDD
  device. It might be fine if it actually had a lot of dirty data, but
  this happens when dirty data has not even reached over 10 percent, due
  to having high memory fragmentation. What makes it worse is that the
  writeback rate might be still at minimum value (8) due to the writeback
  percent not reached, so it takes ages for bcache to really reclaim
  enough dirty buckets to get itself out of this situation.
  
  [Fix]
  
  * 71dda2a5625f31bc3410cb69c3d31376a2b66f28 “bcache: consider the
  fragmentation when update the writeback rate”
  
- The current way to calculate the writeback rate only considered the dirty 
sectors. 
+ The current way to calculate the writeback rate only considered the dirty 
sectors.
  This usually works fine when memory fragmentation is not high, but it will 
give us an unreasonably low writeback rate when we are in the situation that a 
few dirty sectors have consumed a lot of dirty buckets. In some cases, the 
dirty buckets reached  CUTOFF_WRITEBACK_SYNC (i.e., stopped writeback)  while 
the dirty data (sectors) had not even reached the writeback_percent threshold 
(i.e., started writeback). In that situation, the writeback rate will still be 
the minimum value (8*512 = 4KB/s), thus it will cause all the writes to bestuck 
in a non-writeback mode because of the slow writeback.
  
  We accelerate the rate in 3 stages with different aggressiveness:
- the first stage starts when dirty buckets percent reach above 
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), 
+ the first stage starts when dirty buckets percent reach above 
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50),
  the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57),
- the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). 
+ the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64).
  
  By default the first stage tries to writeback the amount of dirty data
  in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) seconds,
  the second stage tries to writeback the amount of dirty data in one bucket
  in (1 / (dirty_buckets_percent - 57)) * 100 milliseconds, the third
  stage tries to writeback the amount of dirty data in one bucket in
  (1 / (dirty_buckets_percent - 64)) milliseconds.
  
  The initial rate at each stage can be controlled by 3 configurable
- parameters: 
+ parameters:
  
  writeback_rate_fp_term_{low|mid|high}
  
  They are by default 1, 10, 1000, chosen based on testing and production
  data, detailed below.
  
  A. When it comes to the low stage, it is still far from the 70%
-    threshold, so we only want to give it a little bit push by setting the
-    term to 1, it means the initial rate will be 170 if the fragment is 6,
-    it is calculated by bucket_size/fragment, this rate is very small,
-    but still much more reasonable than the minimum 8.
-    For a production bcache with non-heavy workload, if the cache device
-    is bigger than 1 TB, it may take hours to consume 1% buckets,
-    so it is very possible to reclaim enough dirty buckets in this stage,
-    thus to avoid entering the next stage.
+    threshold, so we only want to give it a little bit push by setting the
+    term to 1, it means the initial rate will be 170 if the fragment is 6,
+    it is calculated by bucket_size/fragment, this rate is very small,
+    but still much more reasonable than the minimum 8.
+    For a production bcache with non-heavy workload, if the cache device
+    is bigger than 1 TB, it may take hours to consume 1% buckets,
+    so it is very possible to reclaim enough dirty buckets in this stage,
+    thus to avoid entering the next stage.
  
  B. If the dirty buckets ratio didn’t turn around during the first stage,
-    it comes to the mid stage, then it is necessary for mid stage
-    to be more aggressive than low stage, so the initial rate is chosen
-    to be 10 times more than the low stage, which means 1700 as the initial
-    rate if the fragment is 6. This is a normal rate
-    we usually see for a normal workload when writeback happens
-    because of writeback_percent.
+    it comes to the mid stage, then it is necessary for mid stage
+    to be more aggressive than low stage, so the initial rate is chosen
+    to be 10 times more than the low stage, which means 1700 as the initial
+    rate if the fragment is 6. This is a normal rate
+    we usually see for a normal workload when writeback happens
+    because of writeback_percent.
  
  C. If the dirty buckets ratio didn't turn around during the low and mid
-    stages, it comes to the third stage, and it is the last chance that
-    we can turn around to avoid the horrible cutoff writeback sync issue,
-    then we choose 100 times more aggressive than the mid stage, that
-    means 170000 as the initial rate if the fragment is 6. This is also
-    inferred from a production bcache, I've got one week's writeback rate
-    data from a production bcache which has quite heavy workloads,
-    again, the writeback is triggered by the writeback percent,
-    the highest rate area is around 100000 to 240000, so I believe this
-    kind aggressiveness at this stage is reasonable for production.
-    And it should be mostly enough because the hint is trying to reclaim
-    1000 bucket per second, and from that heavy production env,
-    it is consuming 50 buckets per second on average in one week's data.
+    stages, it comes to the third stage, and it is the last chance that
+    we can turn around to avoid the horrible cutoff writeback sync issue,
+    then we choose 100 times more aggressive than the mid stage, that
+    means 170000 as the initial rate if the fragment is 6. This is also
+    inferred from a production bcache, I've got one week's writeback rate
+    data from a production bcache which has quite heavy workloads,
+    again, the writeback is triggered by the writeback percent,
+    the highest rate area is around 100000 to 240000, so I believe this
+    kind aggressiveness at this stage is reasonable for production.
+    And it should be mostly enough because the hint is trying to reclaim
+    1000 bucket per second, and from that heavy production env,
+    it is consuming 50 buckets per second on average in one week's data.
  
  Option writeback_consider_fragment is to control whether we want
  this feature to be on or off, it's on by default.
- 
  
  [Test Case]
  
  I’ve put all my testing results in below google document, the testing clearly 
shows the significant performance improvement.
  
https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
  
  Another testing is that we had built a testing kernel based on bionic
  4.15.0-99.100 + the patch, and putting this kernel in a production
  environment, it’s an openstack environment with ceph on bcache as the
  storage. It runs for more than one month and doesn’t show any issue.
  
  [Regression Potential]
  
  The patch only updates the writeback rate, so it won’t have any impact
  on the data safety, the only potential regression I can think of  is
  that the backing device might be a bit busier after the dirty buckets
  reached to BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50% by default) since
  the writeback rate is accelerated under this highly fragmented
  situation, but that’s because we are trying to avoid all writes hit the
  writeback cutoff sync threshold.


-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1900438

Title:
  Bcache bypasse writeback on caching device with fragmentation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1900438/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1900438] Re: Bcache bypasse writeback on caching device with fragmentation

Reply via email to