On 03.03.2017 06:14, NeilBrown wrote:
> 
> [ Hi Jens,
>   you might have seen assorted email threads recently about
>   deadlocks, particular in dm-snap or md/raid1/10.  Also about
>   the excess of rescuer threads.
>   I think a big part of the problem is my ancient improvement
>   to generic_make_request to queue bios and handle them in
>   a strict FIFO order.  As described below, that can cause
>   problems which individual block devices cannot fix themselves
>   without punting to various threads.
>   This patch does not fix everything, but provides a basis that
>   drives can build on to create dead-lock free solutions without
>   excess threads.
>   If you accept this, I will look into improving at least md
>   and bio_alloc_set() to be less dependant on rescuer threads.
>   Thanks,
>   NeilBrown
>  ]
> 
> 
> To avoid recursion on the kernel stack when stacked block devices
> are in use, generic_make_request() will, when called recursively,
> queue new requests for later handling.  They will be handled when the
> make_request_fn for the current bio completes.
> 
> If any bios are submitted by a make_request_fn, these will ultimately
> handled seqeuntially.  If the handling of one of those generates
> further requests, they will be added to the end of the queue.
> 
> This strict first-in-first-out behaviour can lead to deadlocks in
> various ways, normally because a request might need to wait for a
> previous request to the same device to complete.  This can happen when
> they share a mempool, and can happen due to interdependencies
> particular to the device.  Both md and dm have examples where this happens.
> 
> These deadlocks can be erradicated by more selective ordering of bios.
> Specifically by handling them in depth-first order.  That is: when the
> handling of one bio generates one or more further bios, they are
> handled immediately after the parent, before any siblings of the
> parent.  That way, when generic_make_request() calls make_request_fn
> for some particular device, it we can be certain that all previously
> submited request for that device have been completely handled and are
> not waiting for anything in the queue of requests maintained in
> generic_make_request().
> 
> An easy way to achieve this would be to use a last-in-first-out stack
> instead of a queue.  However this will change the order of consecutive
> bios submitted by a make_request_fn, which could have unexpected consequences.
> Instead we take a slightly more complex approach.
> A fresh queue is created for each call to a make_request_fn.  After it 
> completes,
> any bios for a different device are placed on the front of the main queue, 
> followed
> by any bios for the same device, followed by all bios that were already on
> the queue before the make_request_fn was called.
> This provides the depth-first approach without reordering bios on the same 
> level.
> 
> This, by itself, it not enough to remove the deadlocks.  It just makes
> it possible for drivers to take the extra step required themselves.
> 
> To avoid deadlocks, drivers must never risk waiting for a request
> after submitting one to generic_make_request.  This includes never
> allocing from a mempool twice in the one call to a make_request_fn.
> 
> A common pattern in drivers is to call bio_split() in a loop, handling
> the first part and then looping around to possibly split the next part.
> Instead, a driver that finds it needs to split a bio should queue
> (with generic_make_request) the second part, handle the first part,
> and then return.  The new code in generic_make_request will ensure the
> requests to underlying bios are processed first, then the second bio
> that was split off.  If it splits again, the same process happens.  In
> each case one bio will be completely handled before the next one is attempted.
> 
> With this is place, it should be possible to disable the
> punt_bios_to_recover() recovery thread for many block devices, and
> eventually it may be possible to remove it completely.
> 
> Tested-by: Jinpu Wang <jinpu.w...@profitbricks.com>
> Inspired-by: Lars Ellenberg <lars.ellenb...@linbit.com>
> Signed-off-by: NeilBrown <ne...@suse.com>
> ---
>  block/blk-core.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b9e857f4afe8..ef55f210dd7c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
>               struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>               if (likely(blk_queue_enter(q, false) == 0)) {
> +                     struct bio_list hold;
> +                     struct bio_list lower, same;
> +
> +                     /* Create a fresh bio_list for all subordinate requests 
> */
> +                     bio_list_init(&hold);
> +                     bio_list_merge(&hold, &bio_list_on_stack);
> +                     bio_list_init(&bio_list_on_stack);
>                       ret = q->make_request_fn(q, bio);
>  
>                       blk_queue_exit(q);
>  
> +                     /* sort new bios into those for a lower level
> +                      * and those for the same level
> +                      */
> +                     bio_list_init(&lower);
> +                     bio_list_init(&same);
> +                     while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> +                             if (q == bdev_get_queue(bio->bi_bdev))
> +                                     bio_list_add(&same, bio);
> +                             else
> +                                     bio_list_add(&lower, bio);
> +                     /* now assemble so we handle the lowest level first */
> +                     bio_list_merge(&bio_list_on_stack, &lower);
> +                     bio_list_merge(&bio_list_on_stack, &same);
> +                     bio_list_merge(&bio_list_on_stack, &hold);
> +
>                       bio = bio_list_pop(current->bio_list);
>               } else {
>                       struct bio *bio_next = bio_list_pop(current->bio_list);
> 

Thanks Neil for pushing the fix.

We can optimize generic_make_request a little bit:
- assign bio_list struct hold directly instead init and merge
- remove duplicate code

I think better to squash into your fix.
---
 block/blk-core.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3bc7202..b29b7e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
                        struct bio_list lower, same, hold;
 
                        /* Create a fresh bio_list for all subordinate requests 
*/
-                       bio_list_init(&hold);
-                       bio_list_merge(&hold, &bio_list_on_stack);
+                       hold = bio_list_on_stack;
                        bio_list_init(&bio_list_on_stack);
 
                        ret = q->make_request_fn(q, bio);
@@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
                        bio_list_merge(&bio_list_on_stack, &lower);
                        bio_list_merge(&bio_list_on_stack, &same);
                        bio_list_merge(&bio_list_on_stack, &hold);
-
-                       bio = bio_list_pop(current->bio_list);
                } else {
-                       struct bio *bio_next = bio_list_pop(current->bio_list);
-
                        bio_io_error(bio);
-                       bio = bio_next;
                }
+               bio = bio_list_pop(current->bio_list);
        } while (bio);
        current->bio_list = NULL; /* deactivate */
 
-- 

Regards,
Jack 

Reply via email to