On Wed, 15 Aug 2012, Kent Overstreet wrote:

> > Both md and dm use __GFP_WAIT allocations from mempools in 
> > generic_make_request.
> > 
> > I think you found an interesting bug here. Suppose that we have three 
> > stacked devices: d1 depends on d2 and d2 depends on d3.
> > 
> > Now, a bio b1 comes to d1. d1 splits it to two bios: b2.1 and b2.2 and 
> > sends them to the device d2 - these bios end up in current->bio_list. The 
> > driver for d2 receives bio b2.1 and sends bio b3.1 to d3. Now, 
> > current->bio_list contains bios b2.2, b3.1. Now, bio b2.2 is popped off 
> > the bio list and the driver for d2 is called with b2.2 - suppose that for 
> > some reason mempool in d2 is exhausted and the driver needs to wait until 
> > b2.1 finishes. b2.1 never finishes, because b2.1 depends on b3.1 and b3.1 
> > is still in current->bio_list. So it deadlocks.
> > 
> > Turning off __GFP_WAIT fixes nothing - it just turns one bug (a possible 
> > deadlock) into another bug (a possible bio failure with -ENOMEM).
> > 
> > Increasing mempool sizes doesn't fix it either, the bio would just have to 
> > be split to more pieces in the above example to make it deadlock.
> > 
> > I think the above possible deadlock scenario could be fixed by reversing 
> > current->bio_list processing - i.e. when some device's make_request_fn 
> > adds some bios to current->bio_list, these bios are processed before other 
> > bios that were on the list before. This way, bio list would contain "b3.1, 
> > b2.2" instead of "b2.2, b3.1" in the above example and the deadlock would 
> > not happen.
> 
> Your patch isn't sufficient in the case where a bio may be split
> multiple times (I'm not sure if it's sufficient in the case where bios
> are only split once; trying to work it all out makes my head hurt).
> 
> You don't need multiple stacked drivers to see this; just the case where
> a single driver is running that splits multiple times is sufficient, if
> you have enough threads submitting at the same time.

That is true. dm splits one bio to multiple bios, so it could still 
deadlock.

Mikulas

> Bcache works around this with the trick I mentioned previously, where it
> masks out _GFP_WAIT if current->bio_list != NULL, and punts to workqueue
> if the allocation fails.
> 
> This works but it'd have to be done in each stacking driver... it's not
> a generic solution, and it's a pain in the ass.
> 
> I came up with another idea the other day. Conceptually, it inverts my
> previous workaround - the punting to workqueue is done in the allocation
> code when necessary, for the bios that would be blocked.
> 
> It's lightly tested, gonna rig up some kind of fault injection and test
> it more thoroughly later.
> 
> commit d61bbb074cc8f2e089eb57e2bee8e84500f390a8
> Author: Kent Overstreet <koverstr...@google.com>
> Date:   Mon Aug 13 18:11:01 2012 -0700
> 
>     block: Avoid deadlocks with bio allocation by stacking drivers
>     
>     Previously, if we ever try to allocate more than once from the same bio
>     set while running under generic_make_request(), we risk deadlock.
>     
>     This would happen if e.g. a bio ever needed to be split more than once,
>     and it's difficult to handle correctly in the drivers - so in practice
>     it's not.
>     
>     This patch fixes this issue by allocating a rescuer workqueue for each
>     bio_set, and punting queued bios to said rescuer when necessary:
> 
> diff --git a/fs/bio.c b/fs/bio.c
> index bc4265a..7b4f655 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -281,6 +281,23 @@ void bio_reset(struct bio *bio)
>  }
>  EXPORT_SYMBOL(bio_reset);
>  
> +static void bio_alloc_rescue(struct work_struct *work)
> +{
> +     struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> +     struct bio *bio;
> +
> +     while (1) {
> +             spin_lock(&bs->rescue_lock);
> +             bio = bio_list_pop(&bs->rescue_list);
> +             spin_unlock(&bs->rescue_lock);
> +
> +             if (!bio)
> +                     break;
> +
> +             generic_make_request(bio);
> +     }
> +}
> +
>  /**
>   * bio_alloc_bioset - allocate a bio for I/O
>   * @gfp_mask:   the GFP_ mask given to the slab allocator
> @@ -294,6 +311,7 @@ EXPORT_SYMBOL(bio_reset);
>   **/
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set 
> *bs)
>  {
> +     gfp_t saved_gfp = gfp_mask;
>       unsigned front_pad;
>       unsigned inline_vecs;
>       unsigned long idx = BIO_POOL_NONE;
> @@ -308,16 +326,39 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
> nr_iovecs, struct bio_set *bs)
>               p = kmalloc(sizeof(struct bio) +
>                           nr_iovecs * sizeof(struct bio_vec),
>                           gfp_mask);
> +
>               front_pad = 0;
>               inline_vecs = nr_iovecs;
>       } else {
> -             p = mempool_alloc(bs->bio_pool, gfp_mask);
> +             /*
> +              * If we're running under generic_make_request()
> +              * (current->bio_list != NULL), we risk deadlock if we sleep on
> +              * allocation and there's already bios on current->bio_list that
> +              * were allocated from the same bio_set; they won't be submitted
> +              * (and thus freed) as long as we're blocked here.
> +              *
> +              * To deal with this, we first try the allocation without using
> +              * the mempool; if that fails, we punt all the bios on
> +              * current->bio_list to a different thread and then retry the
> +              * allocation with the original gfp mask.
> +              */
> +
> +             if (current->bio_list &&
> +                 !bio_list_empty(current->bio_list) &&
> +                 (gfp_mask & __GFP_WAIT))
> +                     gfp_mask &= GFP_ATOMIC;
> +retry:
> +             if (gfp_mask & __GFP_WAIT)
> +                     p = mempool_alloc(bs->bio_pool, gfp_mask);
> +             else
> +                     p = kmem_cache_alloc(bs->bio_slab, gfp_mask);
> +
>               front_pad = bs->front_pad;
>               inline_vecs = BIO_INLINE_VECS;
>       }
>  
>       if (unlikely(!p))
> -             return NULL;
> +             goto err;
>  
>       bio = p + front_pad;
>       bio_init(bio);
> @@ -338,6 +379,19 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int 
> nr_iovecs, struct bio_set *bs)
>  
>  err_free:
>       mempool_free(p, bs->bio_pool);
> +err:
> +     if (gfp_mask != saved_gfp) {
> +             gfp_mask = saved_gfp;
> +
> +             spin_lock(&bs->rescue_lock);
> +             bio_list_merge(&bs->rescue_list, current->bio_list);
> +             bio_list_init(current->bio_list);
> +             spin_unlock(&bs->rescue_lock);
> +
> +             queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +             goto retry;
> +     }
> +
>       return NULL;
>  }
>  EXPORT_SYMBOL(bio_alloc_bioset);
> @@ -1546,6 +1600,9 @@ static void biovec_free_pools(struct bio_set *bs)
>  
>  void bioset_free(struct bio_set *bs)
>  {
> +     if (bs->rescue_workqueue)
> +             destroy_workqueue(bs->rescue_workqueue);
> +
>       if (bs->bio_pool)
>               mempool_destroy(bs->bio_pool);
>  
> @@ -1581,6 +1638,10 @@ struct bio_set *bioset_create(unsigned int pool_size, 
> unsigned int front_pad)
>  
>       bs->front_pad = front_pad;
>  
> +     spin_lock_init(&bs->rescue_lock);
> +     bio_list_init(&bs->rescue_list);
> +     INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> +
>       bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
>       if (!bs->bio_slab) {
>               kfree(bs);
> @@ -1591,9 +1652,14 @@ struct bio_set *bioset_create(unsigned int pool_size, 
> unsigned int front_pad)
>       if (!bs->bio_pool)
>               goto bad;
>  
> -     if (!biovec_create_pools(bs, pool_size))
> -             return bs;
> +     if (biovec_create_pools(bs, pool_size))
> +             goto bad;
> +
> +     bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> +     if (!bs->rescue_workqueue)
> +             goto bad;
>  
> +     return bs;
>  bad:
>       bioset_free(bs);
>       return NULL;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index b22c22b..ba5b52e 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -290,39 +290,6 @@ static inline int bio_associate_current(struct bio *bio) 
> { return -ENOENT; }
>  static inline void bio_disassociate_task(struct bio *bio) { }
>  #endif       /* CONFIG_BLK_CGROUP */
>  
> -/*
> - * bio_set is used to allow other portions of the IO system to
> - * allocate their own private memory pools for bio and iovec structures.
> - * These memory pools in turn all allocate from the bio_slab
> - * and the bvec_slabs[].
> - */
> -#define BIO_POOL_SIZE 2
> -#define BIOVEC_NR_POOLS 6
> -#define BIOVEC_MAX_IDX       (BIOVEC_NR_POOLS - 1)
> -
> -struct bio_set {
> -     struct kmem_cache *bio_slab;
> -     unsigned int front_pad;
> -
> -     mempool_t *bio_pool;
> -#if defined(CONFIG_BLK_DEV_INTEGRITY)
> -     mempool_t *bio_integrity_pool;
> -#endif
> -     mempool_t *bvec_pool;
> -};
> -
> -struct biovec_slab {
> -     int nr_vecs;
> -     char *name;
> -     struct kmem_cache *slab;
> -};
> -
> -/*
> - * a small number of entries is fine, not going to be performance critical.
> - * basically we just need to survive
> - */
> -#define BIO_SPLIT_ENTRIES 2
> -
>  #ifdef CONFIG_HIGHMEM
>  /*
>   * remember never ever reenable interrupts between a bvec_kmap_irq and
> @@ -497,6 +464,48 @@ static inline struct bio *bio_list_get(struct bio_list 
> *bl)
>       return bio;
>  }
>  
> +/*
> + * bio_set is used to allow other portions of the IO system to
> + * allocate their own private memory pools for bio and iovec structures.
> + * These memory pools in turn all allocate from the bio_slab
> + * and the bvec_slabs[].
> + */
> +#define BIO_POOL_SIZE 2
> +#define BIOVEC_NR_POOLS 6
> +#define BIOVEC_MAX_IDX       (BIOVEC_NR_POOLS - 1)
> +
> +struct bio_set {
> +     struct kmem_cache *bio_slab;
> +     unsigned int front_pad;
> +
> +     mempool_t *bio_pool;
> +#if defined(CONFIG_BLK_DEV_INTEGRITY)
> +     mempool_t *bio_integrity_pool;
> +#endif
> +     mempool_t *bvec_pool;
> +
> +     /*
> +      * Deadlock avoidance for stacking block drivers: see comments in
> +      * bio_alloc_bioset() for details
> +      */
> +     spinlock_t              rescue_lock;
> +     struct bio_list         rescue_list;
> +     struct work_struct      rescue_work;
> +     struct workqueue_struct *rescue_workqueue;
> +};
> +
> +struct biovec_slab {
> +     int nr_vecs;
> +     char *name;
> +     struct kmem_cache *slab;
> +};
> +
> +/*
> + * a small number of entries is fine, not going to be performance critical.
> + * basically we just need to survive
> + */
> +#define BIO_SPLIT_ENTRIES 2
> +
>  #if defined(CONFIG_BLK_DEV_INTEGRITY)
>  
>  #define bip_vec_idx(bip, idx)        (&(bip->bip_vec[(idx)]))
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to