On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> From: Huang Ying <ying.hu...@intel.com>
> 
> When the swapin is performed, after getting the swap entry information
> from the page table, system will swap in the swap entry, without any
> lock held to prevent the swap device from being swapoff.  This may
> cause the race like below,
> 
> CPU 1                         CPU 2
> -----                         -----
>                               do_swap_page
>                                 swapin_readahead
>                                   __read_swap_cache_async
> swapoff                                     swapcache_prepare
>   p->swap_map = NULL                  __swap_duplicate
>                                         p->swap_map[?] /* !!! NULL pointer 
> access */
> 
> Because swapoff is usually done when system shutdown only, the race
> may not hit many people in practice.  But it is still a race need to
> be fixed.
> 
> To fix the race, get_swap_device() is added to check whether the
> specified swap entry is valid in its swap device.  If so, it will keep
> the swap entry valid via preventing the swap device from being
> swapoff, until put_swap_device() is called.
> 
> Because swapoff() is very race code path, to make the normal path runs
> as fast as possible, RCU instead of reference count is used to
> implement get/put_swap_device().  From get_swap_device() to
> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> swapoff() will wait until put_swap_device() is called.
> 
> In addition to swap_map, cluster_info, etc. data structure in the
> struct swap_info_struct, the swap cache radix tree will be freed after
> swapoff, so this patch fixes the race between swap cache looking up
> and swapoff too.
> 
> Cc: Hugh Dickins <hu...@google.com>
> Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Tim Chen <tim.c.c...@linux.intel.com>
> Cc: Shaohua Li <s...@fb.com>
> Cc: Mel Gorman <mgor...@techsingularity.net>
> Cc: "Jérôme Glisse" <jgli...@redhat.com>
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Andrea Arcangeli <aarca...@redhat.com>
> Cc: David Rientjes <rient...@google.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Jan Kara <j...@suse.cz>
> Cc: Dave Jiang <dave.ji...@intel.com>
> Cc: Aaron Lu <aaron...@intel.com>
> Signed-off-by: "Huang, Ying" <ying.hu...@intel.com>
> 
> Changelog:
> 
> v4:
> 
> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>   normal paths further.

Hi Huang,

This version is much better than old. To me, it's due to not rcu,
srcu, refcount thing but it adds swap device dependency(i.e., get/put)
into every swap related functions so users who don't interested on swap
don't need to care of it. Good.

The problem is caused by freeing by swap related-data structure
*dynamically* while old swap logic was based on static data
structure(i.e., never freed and the verify it's stale).
So, I reviewed some places where use PageSwapCache and swp_entry_t
which could make access of swap related data structures.

A example is __isolate_lru_page

It calls page_mapping to get a address_space.
What happens if the page is on SwapCache and raced with swapoff?
The mapping got could be disappeared by the race. Right?

Thanks.

> 
> v3:
> 
> - Re-implemented with RCU to reduce the overhead of normal paths
> 
> v2:
> 
> - Re-implemented with SRCU to reduce the overhead of normal paths.
> 
> - Avoid to check whether the swap device has been swapoff in
>   get_swap_device().  Because we can check the origin of the swap
>   entry to make sure the swap device hasn't bee swapoff.
> ---
>  include/linux/swap.h |  11 ++++-
>  mm/memory.c          |   2 +-
>  mm/swap_state.c      |  16 +++++--
>  mm/swapfile.c        | 123 
> ++++++++++++++++++++++++++++++++++++++-------------
>  4 files changed, 116 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2417d288e016..f7e8f26cf07f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -172,8 +172,9 @@ enum {
>       SWP_PAGE_DISCARD = (1 << 9),    /* freed swap page-cluster discards */
>       SWP_STABLE_WRITES = (1 << 10),  /* no overwrite PG_writeback pages */
>       SWP_SYNCHRONOUS_IO = (1 << 11), /* synchronous IO is efficient */
> +     SWP_VALID       = (1 << 12),    /* swap is valid to be operated on? */
>                                       /* add others here before... */
> -     SWP_SCANNING    = (1 << 12),    /* refcount in scan_swap_map */
> +     SWP_SCANNING    = (1 << 13),    /* refcount in scan_swap_map */
>  };
>  
>  #define SWAP_CLUSTER_MAX 32UL
> @@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int);
>  extern sector_t map_swap_page(struct page *, struct block_device **);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern int page_swapcount(struct page *);
> -extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry);
> +extern int __swap_count(swp_entry_t entry);
>  extern int __swp_swapcount(swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  extern struct swap_info_struct *page_swap_info(struct page *);
> @@ -470,6 +471,12 @@ extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  extern int init_swap_address_space(unsigned int type, unsigned long 
> nr_pages);
>  extern void exit_swap_address_space(unsigned int type);
> +extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
> +
> +static inline void put_swap_device(struct swap_info_struct *si)
> +{
> +     rcu_read_unlock();
> +}
>  
>  #else /* CONFIG_SWAP */
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 1a969992f76b..77a7d6191218 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2909,7 +2909,7 @@ int do_swap_page(struct vm_fault *vmf)
>               struct swap_info_struct *si = swp_swap_info(entry);
>  
>               if (si->flags & SWP_SYNCHRONOUS_IO &&
> -                             __swap_count(si, entry) == 1) {
> +                 __swap_count(entry) == 1) {
>                       /* skip swapcache */
>                       page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>                                                       vmf->address);
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 0b8ae361981f..8dde719e973c 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -337,8 +337,13 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct 
> vm_area_struct *vma,
>                              unsigned long addr)
>  {
>       struct page *page;
> +     struct swap_info_struct *si;
>  
> +     si = get_swap_device(entry);
> +     if (!si)
> +             return NULL;
>       page = find_get_page(swap_address_space(entry), swp_offset(entry));
> +     put_swap_device(si);
>  
>       INC_CACHE_INFO(find_total);
>       if (page) {
> @@ -376,8 +381,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
> gfp_t gfp_mask,
>                       struct vm_area_struct *vma, unsigned long addr,
>                       bool *new_page_allocated)
>  {
> -     struct page *found_page, *new_page = NULL;
> -     struct address_space *swapper_space = swap_address_space(entry);
> +     struct page *found_page = NULL, *new_page = NULL;
> +     struct swap_info_struct *si;
>       int err;
>       *new_page_allocated = false;
>  
> @@ -387,7 +392,12 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
> gfp_t gfp_mask,
>                * called after lookup_swap_cache() failed, re-calling
>                * that would confuse statistics.
>                */
> -             found_page = find_get_page(swapper_space, swp_offset(entry));
> +             si = get_swap_device(entry);
> +             if (!si)
> +                     break;
> +             found_page = find_get_page(swap_address_space(entry),
> +                                        swp_offset(entry));
> +             put_swap_device(si);
>               if (found_page)
>                       break;
>  
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 42fe5653814a..881515a59f95 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1107,6 +1107,41 @@ static struct swap_info_struct 
> *swap_info_get_cont(swp_entry_t entry,
>       return p;
>  }
>  
> +/*
> + * Check whether swap entry is valid in the swap device.  If so,
> + * return pointer to swap_info_struct, and keep the swap entry valid
> + * via preventing the swap device from being swapoff, until
> + * put_swap_device() is called.  Otherwise return NULL.
> + */
> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> +     struct swap_info_struct *si;
> +     unsigned long type, offset;
> +
> +     if (!entry.val)
> +             goto out;
> +     type = swp_type(entry);
> +     if (type >= nr_swapfiles)
> +             goto bad_nofile;
> +     si = swap_info[type];
> +
> +     rcu_read_lock();
> +     if (!(si->flags & SWP_VALID))
> +             goto unlock_out;
> +     offset = swp_offset(entry);
> +     if (offset >= si->max)
> +             goto unlock_out;
> +
> +     return si;
> +bad_nofile:
> +     pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
> +out:
> +     return NULL;
> +unlock_out:
> +     rcu_read_unlock();
> +     return NULL;
> +}
> +
>  static unsigned char __swap_entry_free(struct swap_info_struct *p,
>                                      swp_entry_t entry, unsigned char usage)
>  {
> @@ -1328,11 +1363,18 @@ int page_swapcount(struct page *page)
>       return count;
>  }
>  
> -int __swap_count(struct swap_info_struct *si, swp_entry_t entry)
> +int __swap_count(swp_entry_t entry)
>  {
> +     struct swap_info_struct *si;
>       pgoff_t offset = swp_offset(entry);
> +     int count = 0;
>  
> -     return swap_count(si->swap_map[offset]);
> +     si = get_swap_device(entry);
> +     if (si) {
> +             count = swap_count(si->swap_map[offset]);
> +             put_swap_device(si);
> +     }
> +     return count;
>  }
>  
>  static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
> @@ -1357,9 +1399,11 @@ int __swp_swapcount(swp_entry_t entry)
>       int count = 0;
>       struct swap_info_struct *si;
>  
> -     si = __swap_info_get(entry);
> -     if (si)
> +     si = get_swap_device(entry);
> +     if (si) {
>               count = swap_swapcount(si, entry);
> +             put_swap_device(si);
> +     }
>       return count;
>  }
>  
> @@ -2451,9 +2495,9 @@ static int swap_node(struct swap_info_struct *p)
>       return bdev ? bdev->bd_disk->node_id : NUMA_NO_NODE;
>  }
>  
> -static void _enable_swap_info(struct swap_info_struct *p, int prio,
> -                             unsigned char *swap_map,
> -                             struct swap_cluster_info *cluster_info)
> +static void setup_swap_info(struct swap_info_struct *p, int prio,
> +                         unsigned char *swap_map,
> +                         struct swap_cluster_info *cluster_info)
>  {
>       int i;
>  
> @@ -2478,7 +2522,11 @@ static void _enable_swap_info(struct swap_info_struct 
> *p, int prio,
>       }
>       p->swap_map = swap_map;
>       p->cluster_info = cluster_info;
> -     p->flags |= SWP_WRITEOK;
> +}
> +
> +static void _enable_swap_info(struct swap_info_struct *p)
> +{
> +     p->flags |= SWP_WRITEOK | SWP_VALID;
>       atomic_long_add(p->pages, &nr_swap_pages);
>       total_swap_pages += p->pages;
>  
> @@ -2505,7 +2553,17 @@ static void enable_swap_info(struct swap_info_struct 
> *p, int prio,
>       frontswap_init(p->type, frontswap_map);
>       spin_lock(&swap_lock);
>       spin_lock(&p->lock);
> -      _enable_swap_info(p, prio, swap_map, cluster_info);
> +     setup_swap_info(p, prio, swap_map, cluster_info);
> +     spin_unlock(&p->lock);
> +     spin_unlock(&swap_lock);
> +     /*
> +      * Guarantee swap_map, cluster_info, etc. fields are used
> +      * between get/put_swap_device() only if SWP_VALID bit is set
> +      */
> +     synchronize_rcu();
> +     spin_lock(&swap_lock);
> +     spin_lock(&p->lock);
> +     _enable_swap_info(p);
>       spin_unlock(&p->lock);
>       spin_unlock(&swap_lock);
>  }
> @@ -2514,7 +2572,8 @@ static void reinsert_swap_info(struct swap_info_struct 
> *p)
>  {
>       spin_lock(&swap_lock);
>       spin_lock(&p->lock);
> -     _enable_swap_info(p, p->prio, p->swap_map, p->cluster_info);
> +     setup_swap_info(p, p->prio, p->swap_map, p->cluster_info);
> +     _enable_swap_info(p);
>       spin_unlock(&p->lock);
>       spin_unlock(&swap_lock);
>  }
> @@ -2617,6 +2676,17 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
> specialfile)
>  
>       reenable_swap_slots_cache_unlock();
>  
> +     spin_lock(&swap_lock);
> +     spin_lock(&p->lock);
> +     p->flags &= ~SWP_VALID;         /* mark swap device as invalid */
> +     spin_unlock(&p->lock);
> +     spin_unlock(&swap_lock);
> +     /*
> +      * wait for swap operations protected by get/put_swap_device()
> +      * to complete
> +      */
> +     synchronize_rcu();
> +
>       flush_work(&p->discard_work);
>  
>       destroy_swap_extents(p);
> @@ -3356,22 +3426,16 @@ static int __swap_duplicate(swp_entry_t entry, 
> unsigned char usage)
>  {
>       struct swap_info_struct *p;
>       struct swap_cluster_info *ci;
> -     unsigned long offset, type;
> +     unsigned long offset;
>       unsigned char count;
>       unsigned char has_cache;
>       int err = -EINVAL;
>  
> -     if (non_swap_entry(entry))
> +     p = get_swap_device(entry);
> +     if (!p)
>               goto out;
>  
> -     type = swp_type(entry);
> -     if (type >= nr_swapfiles)
> -             goto bad_file;
> -     p = swap_info[type];
>       offset = swp_offset(entry);
> -     if (unlikely(offset >= p->max))
> -             goto out;
> -
>       ci = lock_cluster_or_swap_info(p, offset);
>  
>       count = p->swap_map[offset];
> @@ -3417,11 +3481,9 @@ static int __swap_duplicate(swp_entry_t entry, 
> unsigned char usage)
>  unlock_out:
>       unlock_cluster_or_swap_info(p, ci);
>  out:
> +     if (p)
> +             put_swap_device(p);
>       return err;
> -
> -bad_file:
> -     pr_err("swap_dup: %s%08lx\n", Bad_file, entry.val);
> -     goto out;
>  }
>  
>  /*
> @@ -3513,6 +3575,7 @@ int add_swap_count_continuation(swp_entry_t entry, 
> gfp_t gfp_mask)
>       struct page *list_page;
>       pgoff_t offset;
>       unsigned char count;
> +     int ret = 0;
>  
>       /*
>        * When debugging, it's easier to use __GFP_ZERO here; but it's better
> @@ -3520,15 +3583,15 @@ int add_swap_count_continuation(swp_entry_t entry, 
> gfp_t gfp_mask)
>        */
>       page = alloc_page(gfp_mask | __GFP_HIGHMEM);
>  
> -     si = swap_info_get(entry);
> +     si = get_swap_device(entry);
>       if (!si) {
>               /*
>                * An acceptable race has occurred since the failing
> -              * __swap_duplicate(): the swap entry has been freed,
> -              * perhaps even the whole swap_map cleared for swapoff.
> +              * __swap_duplicate(): the swap device may be swapoff
>                */
>               goto outer;
>       }
> +     spin_lock(&si->lock);
>  
>       offset = swp_offset(entry);
>  
> @@ -3546,9 +3609,8 @@ int add_swap_count_continuation(swp_entry_t entry, 
> gfp_t gfp_mask)
>       }
>  
>       if (!page) {
> -             unlock_cluster(ci);
> -             spin_unlock(&si->lock);
> -             return -ENOMEM;
> +             ret = -ENOMEM;
> +             goto out;
>       }
>  
>       /*
> @@ -3600,10 +3662,11 @@ int add_swap_count_continuation(swp_entry_t entry, 
> gfp_t gfp_mask)
>  out:
>       unlock_cluster(ci);
>       spin_unlock(&si->lock);
> +     put_swap_device(si);
>  outer:
>       if (page)
>               __free_page(page);
> -     return 0;
> +     return ret;
>  }
>  
>  /*
> -- 
> 2.15.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"d...@kvack.org";> em...@kvack.org </a>

Reply via email to