from:"Dennis Zhou"

Re: [PATCH 0/4] percpu: add basic stats and tracepoints to percpu allocator

2017-06-20 Thread Dennis Zhou

On 6/20/17, 1:45 PM, "Tejun Heo"  wrote:
> Applied to percpu/for-4.13.  I had to update 0002 because of the
> recent __ro_after_init changes.  Can you please see whether I made any
> mistakes while updating it?

There is a tagging mismatch in 0002. Can you please change or remove the 
__read_mostly annotation in mm/percpu-internal.h?

Thanks,
Dennis

[PATCH 1/1] percpu: resolve err may not be initialized in pcpu_alloc

2017-06-21 Thread Dennis Zhou

>From 4a42ecc735cff0015cc73c3d87edede631f4b885 Mon Sep 17 00:00:00 2001
From: Dennis Zhou 
Date: Wed, 21 Jun 2017 08:07:15 -0700

Add error message to out of space failure for atomic allocations in
percpu allocation path to fix -Wmaybe-uninitialized.

Signed-off-by: Dennis Zhou 
Reported-by: Stephen Rothwell 
---
 mm/percpu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index a5bc363..bd4130a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -956,8 +956,10 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
 * tasks to create chunks simultaneously.  Serialize and create iff
 * there's still no empty chunk after grabbing the mutex.
 */
-   if (is_atomic)
+   if (is_atomic) {
+   err = "atomic alloc failed, no space left";
goto fail;
+   }
 
if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) {
chunk = pcpu_create_chunk();
-- 
2.9.3

[PATCH 1/1] percpu: fix early calls for spinlock in pcpu_stats

2017-06-21 Thread Dennis Zhou

>From 2c06e795162cb306c9707ec51d3e1deadb37f573 Mon Sep 17 00:00:00 2001
From: Dennis Zhou 
Date: Wed, 21 Jun 2017 10:17:09 -0700

Commit 30a5b5367ef9 ("percpu: expose statistics about percpu memory via
debugfs") introduces percpu memory statistics. pcpu_stats_chunk_alloc
takes the spin lock and disables/enables irqs on creation of a chunk. Irqs
are not enabled when the first chunk is initialized and thus kernels are
failing to boot with kernel debugging enabled. Fixed by changing _irq to
_irqsave and _irqrestore.

Fixes: 30a5b5367ef9 ("percpu: expose statistics about percpu memory via 
debugfs")
Signed-off-by: Dennis Zhou 
Reported-by: Alexander Levin 
---

Hi Sasha,

The root cause was from 0003 of that series where I prematurely enabled
irqs and the problem is addresssed here. I am able to boot with debug
options enabled.

Thanks,
Dennis

 mm/percpu-internal.h | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index d030fce..cd2442e 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -116,13 +116,14 @@ static inline void pcpu_stats_area_dealloc(struct 
pcpu_chunk *chunk)
  */
 static inline void pcpu_stats_chunk_alloc(void)
 {
-   spin_lock_irq(&pcpu_lock);
+   unsigned long flags;
+   spin_lock_irqsave(&pcpu_lock, flags);
 
pcpu_stats.nr_chunks++;
pcpu_stats.nr_max_chunks =
max(pcpu_stats.nr_max_chunks, pcpu_stats.nr_chunks);
 
-   spin_unlock_irq(&pcpu_lock);
+   spin_unlock_irqrestore(&pcpu_lock, flags);
 }
 
 /*
@@ -130,11 +131,12 @@ static inline void pcpu_stats_chunk_alloc(void)
  */
 static inline void pcpu_stats_chunk_dealloc(void)
 {
-   spin_lock_irq(&pcpu_lock);
+   unsigned long flags;
+   spin_lock_irqsave(&pcpu_lock, flags);
 
pcpu_stats.nr_chunks--;
 
-   spin_unlock_irq(&pcpu_lock);
+   spin_unlock_irqrestore(&pcpu_lock, flags);
 }
 
 #else
-- 
2.9.3

Re: [PATCH net 0/3] Fix for BPF devmap percpu allocation splat

2017-10-18 Thread Dennis Zhou

Hi Daniel and Tejun,

On Wed, Oct 18, 2017 at 06:25:26AM -0700, Tejun Heo wrote:
> > Daniel Borkmann (3):
> >   mm, percpu: add support for __GFP_NOWARN flag
> 
> This looks fine.
> 

Looks good to me too.

> >   bpf: fix splat for illegal devmap percpu allocation
> >   bpf: do not test for PCPU_MIN_UNIT_SIZE before percpu allocations
> 
> These look okay too but if it helps percpu allocator can expose the
> maximum size / alignment supported to take out the guessing game too.
> 

I can add this once we've addressed the below if we want to.

> Also, the reason why PCPU_MIN_UNIT_SIZE is what it is is because
> nobody needed anything bigger.  Increasing the size doesn't really
> cost much at least on 64bit archs.  Is that something we want to be
> considering?
> 

I'm not sure I see the reason we can't match the minimum allocation size
with the unit size? It seems weird to arbitrate the maximum allocation
size given a lower bound on the unit size.

Thanks,
Dennis

[PATCH] proc: add percpu populated pages count to meminfo

2018-08-06 Thread Dennis Zhou

From: "Dennis Zhou (Facebook)" 

Currently, percpu memory only exposes allocation and utilization
information via debugfs. This more or less is only really useful for
understanding the fragmentation and allocation information at a
per-chunk level with a few global counters. This is also gated behind a
config. BPF and cgroup, for example, have seen an increase use causing
increased use of percpu memory. Let's make it easier for someone to
identify how much memory is being used.

This patch adds the PercpuPopulated stat to meminfo to more easily
look up how much percpu memory is in use. This new number includes the
cost for all backing pages and not just insight at the a unit, per
chunk level. This stat includes only pages used to back the chunks
themselves excluding metadata. I think excluding metadata is fair
because the backing memory scales with the number of cpus and can
quickly outweigh the metadata. It also makes this calculation light.

Signed-off-by: Dennis Zhou 
---
 fs/proc/meminfo.c  |  2 ++
 include/linux/percpu.h |  2 ++
 mm/percpu.c| 29 +
 3 files changed, 33 insertions(+)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 2fb04846ed11..ddd5249692e9 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -121,6 +122,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
   (unsigned long)VMALLOC_TOTAL >> 10);
show_val_kb(m, "VmallocUsed:", 0ul);
show_val_kb(m, "VmallocChunk:   ", 0ul);
+   show_val_kb(m, "PercpuPopulated:", pcpu_nr_populated_pages());
 
 #ifdef CONFIG_MEMORY_FAILURE
seq_printf(m, "HardwareCorrupted: %5lu kB\n",
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 296bbe49d5d1..1c80be42822c 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -149,4 +149,6 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
(typeof(type) __percpu *)__alloc_percpu(sizeof(type),   \
__alignof__(type))
 
+extern int pcpu_nr_populated_pages(void);
+
 #endif /* __LINUX_PERCPU_H */
diff --git a/mm/percpu.c b/mm/percpu.c
index 0b6480979ac7..08a4341f30c5 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -169,6 +169,14 @@ static LIST_HEAD(pcpu_map_extend_chunks);
  */
 int pcpu_nr_empty_pop_pages;
 
+/*
+ * The number of populated pages in use by the allocator, protected by
+ * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page gets
+ * allocated/deallocated, it is allocated/deallocated in all units of a chunk
+ * and increments/decrements this count by 1).
+ */
+static int pcpu_nr_populated;
+
 /*
  * Balance work is used to populate or destroy chunks asynchronously.  We
  * try to keep the number of populated free pages between
@@ -1232,6 +1240,7 @@ static void pcpu_chunk_populated(struct pcpu_chunk 
*chunk, int page_start,
 
bitmap_set(chunk->populated, page_start, nr);
chunk->nr_populated += nr;
+   pcpu_nr_populated += nr;
 
if (!for_alloc) {
chunk->nr_empty_pop_pages += nr;
@@ -1260,6 +1269,7 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk 
*chunk,
chunk->nr_populated -= nr;
chunk->nr_empty_pop_pages -= nr;
pcpu_nr_empty_pop_pages -= nr;
+   pcpu_nr_populated -= nr;
 }
 
 /*
@@ -2176,6 +2186,9 @@ int __init pcpu_setup_first_chunk(const struct 
pcpu_alloc_info *ai,
pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
pcpu_chunk_relocate(pcpu_first_chunk, -1);
 
+   /* include all regions of the first chunk */
+   pcpu_nr_populated += PFN_DOWN(size_sum);
+
pcpu_stats_chunk_alloc();
trace_percpu_create_chunk(base_addr);
 
@@ -2745,6 +2758,22 @@ void __init setup_per_cpu_areas(void)
 
 #endif /* CONFIG_SMP */
 
+/*
+ * pcpu_nr_populated_pages - calculate total number of populated backing pages
+ *
+ * This reflects the number of pages populated to back the chunks.
+ * Metadata is excluded in the number exposed in meminfo as the number of
+ * backing pages scales with the number of cpus and can quickly outweigh the
+ * memory used for metadata.  It also keeps this calculation nice and simple.
+ *
+ * RETURNS:
+ * Total number of populated backing pages in use by the allocator.
+ */
+int pcpu_nr_populated_pages(void)
+{
+   return pcpu_nr_populated * pcpu_nr_units;
+}
+
 /*
  * Percpu allocator is initialized early during boot when neither slab or
  * workqueue is available.  Plug async management until everything is up
-- 
2.17.1

Re: [PATCH] proc: add percpu populated pages count to meminfo

2018-08-07 Thread Dennis Zhou

Hi Vlastimil,

On Tue, Aug 07, 2018 at 03:18:31PM +0200, Vlastimil Babka wrote:
> 
> Documentation/filesystems/proc.txt should be updated as well
> 

Will do.

> >  
> > +/*
> > + * The number of populated pages in use by the allocator, protected by
> > + * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page 
> > gets
> > + * allocated/deallocated, it is allocated/deallocated in all units of a 
> > chunk
> > + * and increments/decrements this count by 1).
> > + */
> > +static int pcpu_nr_populated;
> 
> It better be unsigned long, to match others.
> 

Yeah that makes sense. I've changed this for v2.

> > +/*
> > + * pcpu_nr_populated_pages - calculate total number of populated backing 
> > pages
> > + *
> > + * This reflects the number of pages populated to back the chunks.
> > + * Metadata is excluded in the number exposed in meminfo as the number of
> > + * backing pages scales with the number of cpus and can quickly outweigh 
> > the
> > + * memory used for metadata.  It also keeps this calculation nice and 
> > simple.
> > + *
> > + * RETURNS:
> > + * Total number of populated backing pages in use by the allocator.
> > + */
> > +int pcpu_nr_populated_pages(void)
> 
> Also unsigned long please.
> 

Changed for v2.

Thanks,
Dennis

Re: [PATCH] proc: add percpu populated pages count to meminfo

2018-08-07 Thread Dennis Zhou

Hi Christopher,

On Tue, Aug 07, 2018 at 02:12:06PM +, Christopher Lameter wrote:
> On Mon, 6 Aug 2018, Dennis Zhou wrote:
> > show_val_kb(m, "VmallocUsed:", 0ul);
> > show_val_kb(m, "VmallocChunk:   ", 0ul);
> > +   show_val_kb(m, "PercpuPopulated:", pcpu_nr_populated_pages());
> 
> Populated? Can we avoid this for simplicities sake: "Percpu"?

Yeah, I've dropped populated.

> 
> We do not count pages that are not present elsewhere either and those
> counters do not have "populated" in them.

I see, that makes sense. I think I was trying to keep an external
distinction between what we reserve and what we actually have populated
that really isn't useful outside of playing with the allocator itself.

> >  int pcpu_nr_empty_pop_pages;
> >
> > +/*
> > + * The number of populated pages in use by the allocator, protected by
> > + * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page 
> > gets
> > + * allocated/deallocated, it is allocated/deallocated in all units of a 
> > chunk
> > + * and increments/decrements this count by 1).
> > + */
> > +static int pcpu_nr_populated;
> 
> pcpu_nr_pages?
> 

I'd rather keep it as pcpu_nr_populated because internally in pcpu_chunk
we maintain nr_pages and nr_populated. That way we keep the same meaning
at the chunk and global level.

Thanks,
Dennis

Re: [PATCH] proc: add percpu populated pages count to meminfo

2018-08-07 Thread Dennis Zhou

Hi Tejun,

On Tue, Aug 07, 2018 at 08:11:46AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Tue, Aug 07, 2018 at 02:12:06PM +, Christopher Lameter wrote:
> > > @@ -121,6 +122,7 @@ static int meminfo_proc_show(struct seq_file *m, void 
> > > *v)
> > >  (unsigned long)VMALLOC_TOTAL >> 10);
> > >   show_val_kb(m, "VmallocUsed:", 0ul);
> > >   show_val_kb(m, "VmallocChunk:   ", 0ul);
> > > + show_val_kb(m, "PercpuPopulated:", pcpu_nr_populated_pages());
> > 
> > Populated? Can we avoid this for simplicities sake: "Percpu"?
> > 
> > We do not count pages that are not present elsewhere either and those
> > counters do not have "populated" in them.
> 
> Yeah, let's do "Percpu".
> 

Sounds good, I've dropped populated.

Thanks,
Dennis

[PATCH v2] proc: add percpu populated pages count to meminfo

2018-08-07 Thread Dennis Zhou

From: "Dennis Zhou (Facebook)" 

Currently, percpu memory only exposes allocation and utilization
information via debugfs. This more or less is only really useful for
understanding the fragmentation and allocation information at a
per-chunk level with a few global counters. This is also gated behind a
config. BPF and cgroup, for example, have seen an increase use causing
increased use of percpu memory. Let's make it easier for someone to
identify how much memory is being used.

This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use. This number includes the cost for all
allocated backing pages and not just isnight at the a unit, per chunk
level. Metadata is excluded. I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata. It also makes this calculation light.

Signed-off-by: Dennis Zhou 
---
 Documentation/filesystems/proc.txt |  3 +++
 fs/proc/meminfo.c  |  2 ++
 include/linux/percpu.h |  2 ++
 mm/percpu.c| 29 +
 4 files changed, 36 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 520f6a84cf50..490f803be1ec 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -870,6 +870,7 @@ Committed_AS:   100056 kB
 VmallocTotal:   112216 kB
 VmallocUsed:   428 kB
 VmallocChunk:   111088 kB
+Percpu:  62080 kB
 AnonHugePages:   49152 kB
 ShmemHugePages:  0 kB
 ShmemPmdMapped:  0 kB
@@ -959,6 +960,8 @@ Committed_AS: The amount of memory presently allocated on 
the system.
 VmallocTotal: total size of vmalloc memory area
  VmallocUsed: amount of vmalloc area which is used
 VmallocChunk: largest contiguous block of vmalloc area which is free
+  Percpu: Memory allocated to the percpu allocator used to back percpu
+  allocations. This stat excludes the cost of metadata.
 
 ..
 
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 2fb04846ed11..edda898714eb 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -121,6 +122,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
   (unsigned long)VMALLOC_TOTAL >> 10);
show_val_kb(m, "VmallocUsed:", 0ul);
show_val_kb(m, "VmallocChunk:   ", 0ul);
+   show_val_kb(m, "Percpu: ", pcpu_nr_pages());
 
 #ifdef CONFIG_MEMORY_FAILURE
seq_printf(m, "HardwareCorrupted: %5lu kB\n",
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 296bbe49d5d1..70b7123f38c7 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -149,4 +149,6 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
(typeof(type) __percpu *)__alloc_percpu(sizeof(type),   \
__alignof__(type))
 
+extern unsigned long pcpu_nr_pages(void);
+
 #endif /* __LINUX_PERCPU_H */
diff --git a/mm/percpu.c b/mm/percpu.c
index 0b6480979ac7..a749d4d96e3e 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -169,6 +169,14 @@ static LIST_HEAD(pcpu_map_extend_chunks);
  */
 int pcpu_nr_empty_pop_pages;
 
+/*
+ * The number of populated pages in use by the allocator, protected by
+ * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page gets
+ * allocated/deallocated, it is allocated/deallocated in all units of a chunk
+ * and increments/decrements this count by 1).
+ */
+static unsigned long pcpu_nr_populated;
+
 /*
  * Balance work is used to populate or destroy chunks asynchronously.  We
  * try to keep the number of populated free pages between
@@ -1232,6 +1240,7 @@ static void pcpu_chunk_populated(struct pcpu_chunk 
*chunk, int page_start,
 
bitmap_set(chunk->populated, page_start, nr);
chunk->nr_populated += nr;
+   pcpu_nr_populated += nr;
 
if (!for_alloc) {
chunk->nr_empty_pop_pages += nr;
@@ -1260,6 +1269,7 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk 
*chunk,
chunk->nr_populated -= nr;
chunk->nr_empty_pop_pages -= nr;
pcpu_nr_empty_pop_pages -= nr;
+   pcpu_nr_populated -= nr;
 }
 
 /*
@@ -2176,6 +2186,9 @@ int __init pcpu_setup_first_chunk(const struct 
pcpu_alloc_info *ai,
pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
pcpu_chunk_relocate(pcpu_first_chunk, -1);
 
+   /* include all regions of the first chunk */
+   pcpu_nr_populated += PFN_DOWN(size_sum);
+
pcpu_stats_chunk_alloc();
trace_percpu_create_chunk(base_addr);
 
@@ -2745,6 +2758,22 @@ void __init setup_per_cpu_areas(void)
 
 #endif /* CONFIG_SMP */
 
+/*
+ * pcpu_nr_pages - ca

LPC Traffic Shaping w/ BPF Talk - percpu followup

2018-11-20 Thread Dennis Zhou

Hi Eddie, Vlad, and Willem,

A few people mentioned to me that you guys were experiencing issues with
the percpu memory allocator. I saw the talk slides mention the
following two bullets:

1) allocation pattern makes the per cpu allocator reach a highly
   fragmented state
2) sometimes takes a long time (up to 12s) to create the PERCPU_HASH
   maps at startup

Could you guys elaborate a little more about the above? Some things
that would help: kernel version, cpu info, and a reproducer if possible?

Also, I did some work last summer to make percpu allocation more
efficient, which went into the 4.14 kernel. Just to be sure, is that a
part of the kernel you guys are running?

Thanks,
Dennis

[PATCH 12/14] blkcg: remove bio_disassociate_task()

2018-12-05 Thread Dennis Zhou

Now that a bio only holds a blkg reference, so clean up is simply
putting back that reference. Remove bio_disassociate_task() as it just
calls bio_disassociate_blkg() and call the latter directly.

Signed-off-by: Dennis Zhou 
Acked-by: Tejun Heo 
Reviewed-by: Josef Bacik 
---
 block/bio.c | 11 +--
 include/linux/bio.h |  2 --
 2 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index ce1e512dca5a..7ec5316e6ecc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -244,7 +244,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned 
long *idx,
 
 void bio_uninit(struct bio *bio)
 {
-   bio_disassociate_task(bio);
+   bio_disassociate_blkg(bio);
 }
 EXPORT_SYMBOL(bio_uninit);
 
@@ -2073,15 +2073,6 @@ void bio_associate_blkg(struct bio *bio)
 }
 EXPORT_SYMBOL_GPL(bio_associate_blkg);
 
-/**
- * bio_disassociate_task - undo bio_associate_current()
- * @bio: target bio
- */
-void bio_disassociate_task(struct bio *bio)
-{
-   bio_disassociate_blkg(bio);
-}
-
 /**
  * bio_clone_blkg_association - clone blkg association from src to dst bio
  * @dst: destination bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 84e1c4dc703a..7380b094dcca 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -516,7 +516,6 @@ void bio_disassociate_blkg(struct bio *bio);
 void bio_associate_blkg(struct bio *bio);
 void bio_associate_blkg_from_css(struct bio *bio,
 struct cgroup_subsys_state *css);
-void bio_disassociate_task(struct bio *bio);
 void bio_clone_blkg_association(struct bio *dst, struct bio *src);
 #else  /* CONFIG_BLK_CGROUP */
 static inline void bio_disassociate_blkg(struct bio *bio) { }
@@ -524,7 +523,6 @@ static inline void bio_associate_blkg(struct bio *bio) { }
 static inline void bio_associate_blkg_from_css(struct bio *bio,
   struct cgroup_subsys_state *css)
 { }
-static inline void bio_disassociate_task(struct bio *bio) { }
 static inline void bio_clone_blkg_association(struct bio *dst,
  struct bio *src) { }
 #endif /* CONFIG_BLK_CGROUP */
-- 
2.17.1

[PATCH] blkcg: put back rcu lock in blkcg_bio_issue_check()

2018-12-06 Thread Dennis Zhou

I was a little overzealous in removing the rcu_read_lock() call from
blkcg_bio_issue_check() and it broke blk-throttle. Put it back.

Fixes: e35403a034bf ("blkcg: associate blkg when associating a device")
Signed-off-by: Dennis Zhou 
---
 include/linux/blk-cgroup.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 752de1becb5c..bf13ecb0fe4f 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -764,6 +764,8 @@ static inline bool blkcg_bio_issue_check(struct 
request_queue *q,
struct blkcg_gq *blkg;
bool throtl = false;
 
+   rcu_read_lock();
+
if (!bio->bi_blkg) {
char b[BDEVNAME_SIZE];
 
@@ -791,6 +793,7 @@ static inline bool blkcg_bio_issue_check(struct 
request_queue *q,
 
blkcg_bio_issue_init(bio);
 
+   rcu_read_unlock();
return !throtl;
 }
 
-- 
2.17.1

Re: [PATCH] blkcg: handle dying request_queue when associating a blkg

2018-12-11 Thread Dennis Zhou

Hi Bart,

On Tue, Dec 11, 2018 at 03:16:13PM -0800, Bart Van Assche wrote:
> On Tue, 2018-12-11 at 18:03 -0500, Dennis Zhou wrote:
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 6bd0619a7d6e..c30661ddc873 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -202,6 +202,12 @@ static struct blkcg_gq *blkg_create(struct blkcg 
> > *blkcg,
> > WARN_ON_ONCE(!rcu_read_lock_held());
> > lockdep_assert_held(&q->queue_lock);
> >  
> > +   /* request_queue is dying, do not create/recreate a blkg */
> > +   if (blk_queue_dying(q)) {
> > +   ret = -ENODEV;
> > +   goto err_free_blkg;
> > +   }
> > +
> > /* blkg holds a reference to blkcg */
> > if (!css_tryget_online(&blkcg->css)) {
> > ret = -ENODEV;
> 
> What prevents that the queue state changes after blk_queue_dying() has 
> returned
> and before blkg_create() returns? Are you sure you don't need to protect this
> code with a blk_queue_enter() / blk_queue_exit() pair?
> 

Hmmm. So I think the idea is that we rely on normal shutdown as I don't
think there is anything wrong with creating a blkg on a dying
request_queue. When we are doing association, the request_queue should
be pinned by the open call. What we are racing against is when the
request_queue is shutting down, it goes around and destroys the blkgs.
For clarity, QUEUE_FLAG_DYING is set in blk_cleanup_queue() before
calling blk_exit_queue() which eventually calls blkcg_exit_queue().

The use of blk_queue_dying() is to determine whether blkg shutdown has
already started as if we create one after it has started, we may
incorrectly orphan a blkg and leak it. Both blkg creation and
destruction require holding the queue_lock, so if the QUEUE_FLAG_DYING
flag is set after we've checked it, it means blkg destruction hasn't
started because it has to wait on the queue_lock. If QUEUE_FLAG_DYING is
set, then we have no guarantee of knowing what phase blkg destruction is
in leading to a potential leak.

Thanks,
Dennis

Re: [PATCH] blkcg: handle dying request_queue when associating a blkg

2018-12-13 Thread Dennis Zhou

On Wed, Dec 12, 2018 at 03:54:52PM -0800, Bart Van Assche wrote:
> On Tue, 2018-12-11 at 23:06 -0500, Dennis Zhou wrote:
> > Hi Bart,
> > 
> > On Tue, Dec 11, 2018 at 03:16:13PM -0800, Bart Van Assche wrote:
> > > On Tue, 2018-12-11 at 18:03 -0500, Dennis Zhou wrote:
> > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > > > index 6bd0619a7d6e..c30661ddc873 100644
> > > > --- a/block/blk-cgroup.c
> > > > +++ b/block/blk-cgroup.c
> > > > @@ -202,6 +202,12 @@ static struct blkcg_gq *blkg_create(struct blkcg 
> > > > *blkcg,
> > > > WARN_ON_ONCE(!rcu_read_lock_held());
> > > > lockdep_assert_held(&q->queue_lock);
> > > >  
> > > > +   /* request_queue is dying, do not create/recreate a blkg */
> > > > +   if (blk_queue_dying(q)) {
> > > > +   ret = -ENODEV;
> > > > +   goto err_free_blkg;
> > > > +   }
> > > > +
> > > > /* blkg holds a reference to blkcg */
> > > > if (!css_tryget_online(&blkcg->css)) {
> > > > ret = -ENODEV;
> > > 
> > > What prevents that the queue state changes after blk_queue_dying() has 
> > > returned
> > > and before blkg_create() returns? Are you sure you don't need to protect 
> > > this
> > > code with a blk_queue_enter() / blk_queue_exit() pair?
> > > 
> > 
> > Hmmm. So I think the idea is that we rely on normal shutdown as I don't
> > think there is anything wrong with creating a blkg on a dying
> > request_queue. When we are doing association, the request_queue should
> > be pinned by the open call. What we are racing against is when the
> > request_queue is shutting down, it goes around and destroys the blkgs.
> > For clarity, QUEUE_FLAG_DYING is set in blk_cleanup_queue() before
> > calling blk_exit_queue() which eventually calls blkcg_exit_queue().
> > 
> > The use of blk_queue_dying() is to determine whether blkg shutdown has
> > already started as if we create one after it has started, we may
> > incorrectly orphan a blkg and leak it. Both blkg creation and
> > destruction require holding the queue_lock, so if the QUEUE_FLAG_DYING
> > flag is set after we've checked it, it means blkg destruction hasn't
> > started because it has to wait on the queue_lock. If QUEUE_FLAG_DYING is
> > set, then we have no guarantee of knowing what phase blkg destruction is
> > in leading to a potential leak.
> 
> Hi Dennis,
> 
> To answer my own question: since all queue flag manipulations are protected
> by the queue lock and since blkg_create() is called with the queue lock held
> the above code does not need any further protection. Hence feel free to add
> the following:
> 
> Reviewed-by: Bart Van Assche 
> 

It seems that Christoph in 57d74df90783 ("block: use atomic bitops for
->queue_flags") changed it so that flag manipulations no longer are
protected by the queue_lock in for-4.21/block. But I think my
explanation above suffices that we will always be able to clean up a
blkg created as long as the QUEUE_FLAG_DYING is not set.

Thanks for reviewing this!
Dennis

Re: [PATCH v3] block: fix blk-iolatency accounting underflow

2018-12-17 Thread Dennis Zhou

On Mon, Dec 17, 2018 at 11:42:28AM -0800, Liu Bo wrote:
> On Mon, Dec 17, 2018 at 8:04 AM Dennis Zhou  wrote:
> >
> > The blk-iolatency controller measures the time from rq_qos_throttle() to
> > rq_qos_done_bio() and attributes this time to the first bio that needs
> > to create the request. This means if a bio is plug-mergeable or
> > bio-mergeable, it gets to bypass the blk-iolatency controller.
> >
> 
> Hi,
> 
> I have a question about merging in plug list, since plug merges are
> done before rq_qos_throttle(), why would plug-mergeable bios bypass
> the controller?
> 
> thanks,
> liubo
> 

Hi Liubo,

BIO_TRACKED is tagging the bio that is responsible for allocating a new
request, so that rq_qos controllers can decide whether or not they want
to process the bio any part of the way. I should have phrased that a
little better in the commit message. It's not that the bio itself is
bypassing the blk-iolatency controller, but the blk-iolatency controller
deciding to not do anything based on the BIO_TRACKED flag. This doesn't
change any of the function calls made on a bio/request.

Thanks,
Dennis

[PATCH] blkcg: clean up blkg_tryget_closest()

2018-12-19 Thread Dennis Zhou

The implementation of blkg_tryget_closest() wasn't super obvious and
became a point of suspicion when debugging [1]. So let's clean it up so
it's obviously not the problem.

[1] 
https://lore.kernel.org/linux-block/a7e97e4b-0dd8-3a54-23b7-a0f27b17f...@kernel.dk/

Signed-off-by: Dennis Zhou 
---
 include/linux/blk-cgroup.h | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index f025fd1e22e6..76c61318fda5 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -499,22 +499,33 @@ static inline void blkg_get(struct blkcg_gq *blkg)
  */
 static inline bool blkg_tryget(struct blkcg_gq *blkg)
 {
-   return percpu_ref_tryget(&blkg->refcnt);
+   return blkg && percpu_ref_tryget(&blkg->refcnt);
 }
 
 /**
  * blkg_tryget_closest - try and get a blkg ref on the closet blkg
  * @blkg: blkg to get
  *
- * This walks up the blkg tree to find the closest non-dying blkg and returns
- * the blkg that it did association with as it may not be the passed in blkg.
+ * This needs to be called rcu protected.  As the failure mode here is to walk
+ * up the blkg tree, this ensure that the blkg->parent pointers are always
+ * valid.  This returns the blkg that it ended up taking a reference on or 
%NULL
+ * if no reference was taken.
  */
 static inline struct blkcg_gq *blkg_tryget_closest(struct blkcg_gq *blkg)
 {
-   while (blkg && !percpu_ref_tryget(&blkg->refcnt))
+   struct blkcg_gq *ret_blkg = NULL;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());
+
+   while (blkg) {
+   if (blkg_tryget(blkg)) {
+   ret_blkg = blkg;
+   break;
+   }
blkg = blkg->parent;
+   }
 
-   return blkg;
+   return ret_blkg;
 }
 
 /**
-- 
2.17.1

[PATCH] blkcg: remove unused __blkg_release_rcu()

2018-12-19 Thread Dennis Zhou

An earlier commit 7fcf2b033b84 ("blkcg: change blkg reference counting
to use percpu_ref") moved around the release call from blkg_put() to be
a part of the percpu_ref cleanup. Remove the additional unused code
which should have been removed earlier.

Signed-off-by: Dennis Zhou 
---
 block/blk-cgroup.c | 23 ---
 1 file changed, 23 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c8cc1cbb6370..2bed5725aa03 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -438,29 +438,6 @@ static void blkg_destroy_all(struct request_queue *q)
spin_unlock_irq(&q->queue_lock);
 }
 
-/*
- * A group is RCU protected, but having an rcu lock does not mean that one
- * can access all the fields of blkg and assume these are valid.  For
- * example, don't try to follow throtl_data and request queue links.
- *
- * Having a reference to blkg under an rcu allows accesses to only values
- * local to groups like group stats and group rate limits.
- */
-void __blkg_release_rcu(struct rcu_head *rcu_head)
-{
-   struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, 
rcu_head);
-
-   /* release the blkcg and parent blkg refs this blkg has been holding */
-   css_put(&blkg->blkcg->css);
-   if (blkg->parent)
-   blkg_put(blkg->parent);
-
-   wb_congested_put(blkg->wb_congested);
-
-   blkg_free(blkg);
-}
-EXPORT_SYMBOL_GPL(__blkg_release_rcu);
-
 static int blkcg_reset_stats(struct cgroup_subsys_state *css,
 struct cftype *cftype, u64 val)
 {
-- 
2.17.1

[PATCH v3] block: fix blk-iolatency accounting underflow

2018-12-17 Thread Dennis Zhou

The blk-iolatency controller measures the time from rq_qos_throttle() to
rq_qos_done_bio() and attributes this time to the first bio that needs
to create the request. This means if a bio is plug-mergeable or
bio-mergeable, it gets to bypass the blk-iolatency controller.

The recent series [1], to tag all bios w/ blkgs undermined how iolatency
was determining which bios it was charging and should process in
rq_qos_done_bio(). Because all bios are being tagged, this caused the
atomic_t for the struct rq_wait inflight count to underflow and result
in a stall.

This patch adds a new flag BIO_TRACKED to let controllers know that a
bio is going through the rq_qos path. blk-iolatency now checks if this
flag is set to see if it should process the bio in rq_qos_done_bio().

Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
BIO_THROTTLED was another candidate, but the flag is set for all bios
that have gone through blk-throttle code. Overloading a flag comes with
the burden of making sure that when either implementation changes, a
change in setting rules for one doesn't cause a bug in the other. So
here, we unfortunately opt for adding a new flag.

[1] https://lore.kernel.org/lkml/20181205171039.73066-1-den...@kernel.org/

Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
Signed-off-by: Dennis Zhou 
Cc: Josef Bacik 
---
 block/blk-iolatency.c | 2 +-
 block/blk-rq-qos.h| 5 +
 include/linux/blk_types.h | 1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index bee092727cad..fc714ef402a6 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -593,7 +593,7 @@ static void blkcg_iolatency_done_bio(struct rq_qos *rqos, 
struct bio *bio)
bool enabled = false;
 
blkg = bio->bi_blkg;
-   if (!blkg)
+   if (!blkg || !bio_flagged(bio, BIO_TRACKED))
return;
 
iolat = blkg_to_lat(bio->bi_blkg);
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index fd8a0c5debd3..58f62483b537 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -170,6 +170,11 @@ static inline void rq_qos_done_bio(struct request_queue 
*q, struct bio *bio)
 
 static inline void rq_qos_throttle(struct request_queue *q, struct bio *bio)
 {
+   /*
+* BIO_TRACKED lets controllers know that a bio went through the
+* normal rq_qos path.
+*/
+   bio_set_flag(bio, BIO_TRACKED);
if (q->rq_qos)
__rq_qos_throttle(q->rq_qos, bio);
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 46c005d601ac..fc99474ac968 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -228,6 +228,7 @@ struct bio {
 #define BIO_TRACE_COMPLETION 10/* bio_endio() should trace the final 
completion
 * of this bio. */
 #define BIO_QUEUE_ENTERED 11   /* can use blk_queue_enter_live() */
+#define BIO_TRACKED 12 /* set if bio goes through the rq_qos path */
 
 /* See BVEC_POOL_OFFSET below before adding new flags */
 
-- 
2.17.1

[PATCH] block: fix iolat timestamp and restore accounting semantics

2018-12-10 Thread Dennis Zhou

The blk-iolatency controller measures the time from rq_qos_throttle() to
rq_qos_done_bio() and attributes this time to the first bio that needs
to create the request. This means if a bio is plug-mergeable or
bio-mergeable, it gets to bypass the blk-iolatency controller.

The recent series, to tag all bios w/ blkgs in [1] changed the timing
incorrectly as well. First, the iolatency controller was tagging bios
and using that information if it should process it in rq_qos_done_bio().
However, now that all bios are tagged, this caused the atomic_t for the
struct rq_wait inflight count to underflow resulting in a stall. Second,
now the timing was using the duration a bio from generic_make_request()
rather than the timing mentioned above.

This patch fixes the errors by accounting time separately in a bio
adding the field bi_start. If this field is set, the bio should be
processed by blk-iolatency in rq_qos_done_bio().

[1] https://lore.kernel.org/lkml/20181205171039.73066-1-den...@kernel.org/

Signed-off-by: Dennis Zhou 
Cc: Josef Bacik 
---
 block/blk-iolatency.c | 17 ++---
 include/linux/blk_types.h | 12 
 2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index bee092727cad..52d5d7cc387c 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -463,6 +463,8 @@ static void blkcg_iolatency_throttle(struct rq_qos *rqos, 
struct bio *bio)
if (!blk_iolatency_enabled(blkiolat))
return;
 
+   bio->bi_start = ktime_get_ns();
+
while (blkg && blkg->parent) {
struct iolatency_grp *iolat = blkg_to_lat(blkg);
if (!iolat) {
@@ -480,18 +482,12 @@ static void blkcg_iolatency_throttle(struct rq_qos *rqos, 
struct bio *bio)
 }
 
 static void iolatency_record_time(struct iolatency_grp *iolat,
- struct bio_issue *issue, u64 now,
+ struct bio *bio, u64 now,
  bool issue_as_root)
 {
-   u64 start = bio_issue_time(issue);
+   u64 start = bio->bi_start;
u64 req_time;
 
-   /*
-* Have to do this so we are truncated to the correct time that our
-* issue is truncated to.
-*/
-   now = __bio_issue_time(now);
-
if (now <= start)
return;
 
@@ -593,7 +589,7 @@ static void blkcg_iolatency_done_bio(struct rq_qos *rqos, 
struct bio *bio)
bool enabled = false;
 
blkg = bio->bi_blkg;
-   if (!blkg)
+   if (!blkg || !bio->bi_start)
return;
 
iolat = blkg_to_lat(bio->bi_blkg);
@@ -612,8 +608,7 @@ static void blkcg_iolatency_done_bio(struct rq_qos *rqos, 
struct bio *bio)
atomic_dec(&rqw->inflight);
if (!enabled || iolat->min_lat_nsec == 0)
goto next;
-   iolatency_record_time(iolat, &bio->bi_issue, now,
- issue_as_root);
+   iolatency_record_time(iolat, bio, now, issue_as_root);
window_start = atomic64_read(&iolat->window_start);
if (now > window_start &&
(now - window_start) >= iolat->cur_win_nsec) {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 46c005d601ac..c2c02ec08d7c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -181,6 +181,18 @@ struct bio {
 */
struct blkcg_gq *bi_blkg;
struct bio_issuebi_issue;
+#ifdef CONFIG_BLK_CGROUP_IOLATENCY
+   /*
+* blk-iolatency measure the time a bio takes between rq_qos_throttle()
+* and rq_qos_done_bio().  It attributes the time to the bio that gets
+* the request allowing any bios that can tag along via plug merging or
+* bio merging to be free (from blk-iolatency's perspective). This is
+* different from the time a bio takes from generic_make_request() to
+* the end of its life.  So, this also serves as a marker for which bios
+* should be processed by blk-iolatency.
+*/
+   u64 bi_start;
+#endif /* CONFIG_BLK_CGROUP_IOLATENCY */
 #endif
union {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
-- 
2.17.1

Re: [PATCH] block: fix iolat timestamp and restore accounting semantics

2018-12-10 Thread Dennis Zhou

Hi Josef,

On Mon, Dec 10, 2018 at 01:25:08PM -0500, Josef Bacik wrote:
> On Mon, Dec 10, 2018 at 11:35:10AM -0500, Dennis Zhou wrote:
> > The blk-iolatency controller measures the time from rq_qos_throttle() to
> > rq_qos_done_bio() and attributes this time to the first bio that needs
> > to create the request. This means if a bio is plug-mergeable or
> > bio-mergeable, it gets to bypass the blk-iolatency controller.
> > 
> > The recent series, to tag all bios w/ blkgs in [1] changed the timing
> > incorrectly as well. First, the iolatency controller was tagging bios
> > and using that information if it should process it in rq_qos_done_bio().
> > However, now that all bios are tagged, this caused the atomic_t for the
> > struct rq_wait inflight count to underflow resulting in a stall. Second,
> > now the timing was using the duration a bio from generic_make_request()
> > rather than the timing mentioned above.
> > 
> > This patch fixes the errors by accounting time separately in a bio
> > adding the field bi_start. If this field is set, the bio should be
> > processed by blk-iolatency in rq_qos_done_bio().
> > 
> > [1] https://lore.kernel.org/lkml/20181205171039.73066-1-den...@kernel.org/
> > 
> > Signed-off-by: Dennis Zhou 
> > Cc: Josef Bacik 
> > ---
> >  block/blk-iolatency.c | 17 ++---
> >  include/linux/blk_types.h | 12 
> >  2 files changed, 18 insertions(+), 11 deletions(-)
> > 
> > diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
> > index bee092727cad..52d5d7cc387c 100644
> > --- a/block/blk-iolatency.c
> > +++ b/block/blk-iolatency.c
> > @@ -463,6 +463,8 @@ static void blkcg_iolatency_throttle(struct rq_qos 
> > *rqos, struct bio *bio)
> > if (!blk_iolatency_enabled(blkiolat))
> > return;
> >  
> > +   bio->bi_start = ktime_get_ns();
> > +
> > while (blkg && blkg->parent) {
> > struct iolatency_grp *iolat = blkg_to_lat(blkg);
> > if (!iolat) {
> > @@ -480,18 +482,12 @@ static void blkcg_iolatency_throttle(struct rq_qos 
> > *rqos, struct bio *bio)
> >  }
> >  
> >  static void iolatency_record_time(struct iolatency_grp *iolat,
> > - struct bio_issue *issue, u64 now,
> > + struct bio *bio, u64 now,
> >   bool issue_as_root)
> >  {
> > -   u64 start = bio_issue_time(issue);
> > +   u64 start = bio->bi_start;
> > u64 req_time;
> >  
> > -   /*
> > -* Have to do this so we are truncated to the correct time that our
> > -* issue is truncated to.
> > -*/
> > -   now = __bio_issue_time(now);
> > -
> > if (now <= start)
> > return;
> >  
> > @@ -593,7 +589,7 @@ static void blkcg_iolatency_done_bio(struct rq_qos 
> > *rqos, struct bio *bio)
> > bool enabled = false;
> >  
> > blkg = bio->bi_blkg;
> > -   if (!blkg)
> > +   if (!blkg || !bio->bi_start)
> > return;
> >  
> > iolat = blkg_to_lat(bio->bi_blkg);
> > @@ -612,8 +608,7 @@ static void blkcg_iolatency_done_bio(struct rq_qos 
> > *rqos, struct bio *bio)
> > atomic_dec(&rqw->inflight);
> > if (!enabled || iolat->min_lat_nsec == 0)
> > goto next;
> > -   iolatency_record_time(iolat, &bio->bi_issue, now,
> > - issue_as_root);
> > +   iolatency_record_time(iolat, bio, now, issue_as_root);
> > window_start = atomic64_read(&iolat->window_start);
> > if (now > window_start &&
> > (now - window_start) >= iolat->cur_win_nsec) {
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index 46c005d601ac..c2c02ec08d7c 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -181,6 +181,18 @@ struct bio {
> >  */
> > struct blkcg_gq *bi_blkg;
> > struct bio_issuebi_issue;
> > +#ifdef CONFIG_BLK_CGROUP_IOLATENCY
> > +   /*
> > +* blk-iolatency measure the time a bio takes between rq_qos_throttle()
> > +* and rq_qos_done_bio().  It attributes the time to the bio that gets
> > +* the request allowing any bios that can tag along via plug merging or
> > +* bio merging to be free (from blk-iolatency's perspective). This is
> > +* different from the time a bio takes from

[PATCH v2] block: fix iolat timestamp and restore accounting semantics

2018-12-11 Thread Dennis Zhou

The blk-iolatency controller measures the time from rq_qos_throttle() to
rq_qos_done_bio() and attributes this time to the first bio that needs
to create the request. This means if a bio is plug-mergeable or
bio-mergeable, it gets to bypass the blk-iolatency controller.

The recent series, to tag all bios w/ blkgs in [1] changed the timing
incorrectly as well. First, the iolatency controller was tagging bios
and using that information if it should process it in rq_qos_done_bio().
However, now that all bios are tagged, this caused the atomic_t for the
struct rq_wait inflight count to underflow resulting in a stall. Second,
now the timing was using the duration a bio from generic_make_request()
rather than the timing mentioned above.

This patch fixes these issues by reusing the BLK_QUEUE_ENTERED flag to
determine if a bio has entered the request layer and is responsible for
starting a request. Stacked drivers don't recurse through
blk_mq_make_request(), so the overhead of using time between
generic_make_request() and the blk_mq_get_request() should be minimal.
blk-iolatency now checks if this flag is set to determine if it should
process the bio in rq_qos_done_bio().

[1] https://lore.kernel.org/lkml/20181205171039.73066-1-den...@kernel.org/

Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
Signed-off-by: Dennis Zhou 
Cc: Josef Bacik 
---
v2:
- Switched to reusing BIO_QUEUE_ENTERED rather than adding a second
  timestamp to the bio struct.

 block/blk-iolatency.c |  2 +-
 block/blk-mq.c| 12 
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index bee092727cad..e408282bdc4c 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -593,7 +593,7 @@ static void blkcg_iolatency_done_bio(struct rq_qos *rqos, 
struct bio *bio)
bool enabled = false;
 
blkg = bio->bi_blkg;
-   if (!blkg)
+   if (!blkg || !bio_flagged(bio, BIO_QUEUE_ENTERED))
return;
 
iolat = blkg_to_lat(bio->bi_blkg);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9690f4f8de7e..05ac940e6671 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1920,6 +1920,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
struct request *same_queue_rq = NULL;
blk_qc_t cookie;
 
+   /*
+* The flag BIO_QUEUE_ENTERED is used for two purposes.  First, it
+* determines if a bio is being split and has already entered the queue.
+* This happens in blk_queue_split() where we can recursively call
+* generic_make_request().  The second use is to mark bios that will
+* call rq_qos_throttle() and subseqently blk_mq_get_request().  These
+* are the bios that fail plug-merging and bio-merging with the primary
+* use case for this being the blk-iolatency controller.
+*/
+   bio_clear_flag(bio, BIO_QUEUE_ENTERED);
+
blk_queue_bounce(q, &bio);
 
blk_queue_split(q, &bio);
@@ -1934,6 +1945,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
if (blk_mq_sched_bio_merge(q, bio))
return BLK_QC_T_NONE;
 
+   bio_set_flag(bio, BIO_QUEUE_ENTERED);
rq_qos_throttle(q, bio);
 
rq = blk_mq_get_request(q, bio, &data);
-- 
2.17.1

[PATCH] blkcg: handle dying request_queue when associating a blkg

2018-12-11 Thread Dennis Zhou

Between v3 [1] and v4 [2] of the blkg association series, the
association point moved from generic_make_request_checks(), which is
called after the request enters the queue, to bio_set_dev(), which is when
the bio is formed before submit_bio(). When the request_queue goes away,
the blkgs supporting the request_queue are destroyed and then the
q->root_blkg is set to %NULL.

This patch adds a %NULL check to blkg_tryget_closest() to prevent the
NPE caused by the above. It also adds a guard to see if the
request_queue is dying when creating a blkg to prevent creating a blkg
for a dead request_queue.

[1] https://lore.kernel.org/lkml/20180911184137.35897-1-dennissz...@gmail.com/
[2] https://lore.kernel.org/lkml/20181126211946.77067-1-den...@kernel.org/

Fixes: 5cdf2e3fea5e ("blkcg: associate blkg when associating a device")
Reported-and-tested-by: Ming Lei 
Signed-off-by: Dennis Zhou 
---
 block/blk-cgroup.c | 6 ++
 include/linux/blk-cgroup.h | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 6bd0619a7d6e..c30661ddc873 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -202,6 +202,12 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(&q->queue_lock);
 
+   /* request_queue is dying, do not create/recreate a blkg */
+   if (blk_queue_dying(q)) {
+   ret = -ENODEV;
+   goto err_free_blkg;
+   }
+
/* blkg holds a reference to blkcg */
if (!css_tryget_online(&blkcg->css)) {
ret = -ENODEV;
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index bf13ecb0fe4f..f025fd1e22e6 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -511,7 +511,7 @@ static inline bool blkg_tryget(struct blkcg_gq *blkg)
  */
 static inline struct blkcg_gq *blkg_tryget_closest(struct blkcg_gq *blkg)
 {
-   while (!percpu_ref_tryget(&blkg->refcnt))
+   while (blkg && !percpu_ref_tryget(&blkg->refcnt))
blkg = blkg->parent;
 
return blkg;
-- 
2.17.1

[PATCH] blkcg: add rcu lock to bio_clone_blkg_association()

2018-12-21 Thread Dennis Zhou

I cleaned up blkg_tryget_closest() to require rcu_read_lock() earlier.
However, this was a subtle case too which clearly was too subtle for me.
The idea was the src bio should be holding a ref to the blkg so rcu
wasn't technically needed. If it doesn't hold a ref, it should be %NULL
and the blkg->parent pointers are unused.

This adds the appropriate read lock in bio_clone_blkg_association().

Fixes: 80fd3c272c1a ("blkcg: clean up blkg_tryget_closest()")
Reported-by: syzbot+a36a3ba92bea3b315...@syzkaller.appspotmail.com
Signed-off-by: Dennis Zhou 
---
 block/bio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index c288b9057042..9194d8ad3d5e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2096,8 +2096,12 @@ EXPORT_SYMBOL_GPL(bio_associate_blkg);
  */
 void bio_clone_blkg_association(struct bio *dst, struct bio *src)
 {
+   rcu_read_lock();
+
if (src->bi_blkg)
__bio_associate_blkg(dst, src->bi_blkg);
+
+   rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(bio_clone_blkg_association);
 #endif /* CONFIG_BLK_CGROUP */
-- 
2.17.1

Re: [PATCH] blkcg: add rcu lock to bio_clone_blkg_association()

2018-12-21 Thread Dennis Zhou

On Fri, Dec 21, 2018 at 08:26:02AM -0700, Jens Axboe wrote:
> On 12/21/18 7:54 AM, Dennis Zhou wrote:
> > I cleaned up blkg_tryget_closest() to require rcu_read_lock() earlier.
> > However, this was a subtle case too which clearly was too subtle for me.
> > The idea was the src bio should be holding a ref to the blkg so rcu
> > wasn't technically needed. If it doesn't hold a ref, it should be %NULL
> > and the blkg->parent pointers are unused.
> > 
> > This adds the appropriate read lock in bio_clone_blkg_association().
> 
> Shall I just fold this with the previous? I staged it in a
> later-in-merge-cycle branch, so that's not an issue to amend.
> 

Yeah that would be great!

Thanks,
Dennis

Re: [GIT PULL] percpu changes for v4.21-rc1

2018-12-28 Thread Dennis Zhou

On Thu, Dec 27, 2018 at 11:04:32AM -0600, Dennis Zhou wrote:
> Hi Linus,
> 
> Michael Cree noted generic UP Alpha has been broken since v3.18. This is
> a small fix for locking in UP percpu code that fixes the issue.
> 
> Thanks,
> Dennis
> 
> The following changes since commit 7566ec393f4161572ba6f11ad5171fd5d59b0fbd:
> 
>   Linux 4.20-rc7 (2018-12-16 15:46:55 -0800)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git for-4.21
> 
> for you to fetch changes up to 6ab7d47bcbf0144a8cb81536c2cead4cde18acfe:
> 
>   percpu: convert spin_lock_irq to spin_lock_irqsave. (2018-12-18 09:04:08 
> -0800)
> 
> ----
> Dennis Zhou (1):
>   percpu: convert spin_lock_irq to spin_lock_irqsave.
> 
>  mm/percpu-km.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/percpu-km.c b/mm/percpu-km.c
> index 38de70ab1a0d..0f643dc2dc65 100644
> --- a/mm/percpu-km.c
> +++ b/mm/percpu-km.c
> @@ -50,6 +50,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
>   const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
>   struct pcpu_chunk *chunk;
>   struct page *pages;
> + unsigned long flags;
>   int i;
>  
>   chunk = pcpu_alloc_chunk(gfp);
> @@ -68,9 +69,9 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
>   chunk->data = pages;
>   chunk->base_addr = page_address(pages) - pcpu_group_offsets[0];
>  
> - spin_lock_irq(&pcpu_lock);
> + spin_lock_irqsave(&pcpu_lock, flags);
>   pcpu_chunk_populated(chunk, 0, nr_pages, false);
> - spin_unlock_irq(&pcpu_lock);
> + spin_unlock_irqrestore(&pcpu_lock, flags);
>  
>   pcpu_stats_chunk_alloc();
>   trace_percpu_create_chunk(chunk->base_addr);
> 

Sigh, I missed +linux-kernel..

Thanks,
Dennis

Re: [PATCH] percpu: plumb gfp flag to pcpu_get_pages

2018-12-29 Thread Dennis Zhou

Hi Shakeel,

On Fri, Dec 28, 2018 at 05:31:47PM -0800, Shakeel Butt wrote:
> __alloc_percpu_gfp() can be called from atomic context, so, make
> pcpu_get_pages use the gfp provided to the higher layer.
> 
> Signed-off-by: Shakeel Butt 
> ---
>  mm/percpu-vm.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index d8078de912de..4f42c4c5c902 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -21,6 +21,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk 
> *chunk,
>  
>  /**
>   * pcpu_get_pages - get temp pages array
> + * @gfp: allocation flags passed to the underlying allocator
>   *
>   * Returns pointer to array of pointers to struct page which can be indexed
>   * with pcpu_page_idx().  Note that there is only one array and accesses
> @@ -29,7 +30,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk 
> *chunk,
>   * RETURNS:
>   * Pointer to temp pages array on success.
>   */
> -static struct page **pcpu_get_pages(void)
> +static struct page **pcpu_get_pages(gfp_t gfp)
>  {
>   static struct page **pages;
>   size_t pages_size = pcpu_nr_units * pcpu_unit_pages * sizeof(pages[0]);
> @@ -37,7 +38,7 @@ static struct page **pcpu_get_pages(void)
>   lockdep_assert_held(&pcpu_alloc_mutex);
>  
>   if (!pages)
> - pages = pcpu_mem_zalloc(pages_size, GFP_KERNEL);
> + pages = pcpu_mem_zalloc(pages_size, gfp);
>   return pages;
>  }
>  
> @@ -278,7 +279,7 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
>  {
>   struct page **pages;
>  
> - pages = pcpu_get_pages();
> + pages = pcpu_get_pages(gfp);
>   if (!pages)
>   return -ENOMEM;
>  
> @@ -316,7 +317,7 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk 
> *chunk,
>* successful population attempt so the temp pages array must
>* be available now.
>*/
> - pages = pcpu_get_pages();
> + pages = pcpu_get_pages(GFP_KERNEL);
>   BUG_ON(!pages);
>  
>   /* unmap and free */
> -- 
> 2.20.1.415.g653613c723-goog
> 

Sorry, I'm travelling today and was hoping to respond to this later
tonight.

So percpu memory is a little different as it's an intermediary. When you 
call __alloc_percpu_gfp() and it does not contain GFP_KERNEL, it is
considered atomic. So, we only service requests out of already
populated memory. pcpu_get_pages() is only called when we need to
populate/depopulate a chunk and will not be called if we need an atomic
allocation. Also, in all but the first case, it won't make an allocation
as pages is a static variable.

Furthermore, percpu only plumbs through certain gfp as not all make
sense [1].

[1] https://lore.kernel.org/lkml/cover.1518668149.git.dennissz...@gmail.com/

Thanks,
Dennis

Re: [PATCH] percpu: plumb gfp flag to pcpu_get_pages

2018-12-29 Thread Dennis Zhou

Hi Andrew,

On Sat, Dec 29, 2018 at 01:03:52PM -0800, Andrew Morton wrote:
> On Fri, 28 Dec 2018 17:31:47 -0800 Shakeel Butt  wrote:
> 
> > __alloc_percpu_gfp() can be called from atomic context, so, make
> > pcpu_get_pages use the gfp provided to the higher layer.
> 
> Does this fix any user-visible issues?

Sorry for not getting to this earlier. I'm currently traveling. I
respoeded on the patch itself. Do you mind unqueuing? I explain in more
detail on the patch, but __alloc_percpu_gfp() will never call
pcpu_get_pages() when called as not GFP_KERNEL.

Thanks,
Dennis

[GIT PULL] percpu fixes for-4.19-rc8

2018-10-09 Thread Dennis Zhou

Hi Greg,

The new percpu allocator introduced in 4.14 had a missing free for the
percpu metadata. This caused a memory leak when percpu memory is being
churned resulting in the allocation and deallocation of percpu memory
chunks.

Thanks,
Dennis

The following changes since commit 0238df646e6224016a45505d2c111a24669ebe21:

  Linux 4.19-rc7 (2018-10-07 17:26:02 +0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git for-4.19-fixes

for you to fetch changes up to 6685b357363bfe295e3ae73665014db4aed62c58:

  percpu: stop leaking bitmap metadata blocks (2018-10-07 14:50:12 -0700)


Mike Rapoport (1):
  percpu: stop leaking bitmap metadata blocks

 mm/percpu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/percpu.c b/mm/percpu.c
index a749d4d96e3e..4b90682623e9 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1212,6 +1212,7 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
 {
if (!chunk)
return;
+   pcpu_mem_free(chunk->md_blocks);
pcpu_mem_free(chunk->bound_map);
pcpu_mem_free(chunk->alloc_map);
pcpu_mem_free(chunk);

Re: [PATCH] percpu: stop leaking bitmap metadata blocks

2018-10-07 Thread Dennis Zhou

Hi Mike,

On Sun, Oct 07, 2018 at 11:31:51AM +0300, Mike Rapoport wrote:
> The commit ca460b3c9627 ("percpu: introduce bitmap metadata blocks")
> introduced bitmap metadata blocks. These metadata blocks are allocated
> whenever a new chunk is created, but they are never freed. Fix it.
> 
> Fixes: ca460b3c9627 ("percpu: introduce bitmap metadata blocks")
> 
> Signed-off-by: Mike Rapoport 
> Cc: sta...@vger.kernel.org
> ---
>  mm/percpu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index d21cb13..25104cd 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1212,6 +1212,7 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
>  {
>   if (!chunk)
>   return;
> + pcpu_mem_free(chunk->md_blocks);
>   pcpu_mem_free(chunk->bound_map);
>   pcpu_mem_free(chunk->alloc_map);
>   pcpu_mem_free(chunk);

Ah a bit of a boneheaded miss on my part.. Thanks for catching this!
I've applied it to for-4.19-fixes.

Thanks,
Dennis

[GIT PULL] percpu changes for v4.20-rc1

2018-11-01 Thread Dennis Zhou

Hi Linus,

Two small things for v4.20. The first fixes a clang uninitialized
variable warning for arm64 in the default path calls BUILD_BUG(). The
second removes an unnecessary unlikely() in a WARN_ON() use.

Thanks,
Dennis

The following changes since commit 11da3a7f84f19c26da6f86af878298694ede0804:

  Linux 4.19-rc3 (2018-09-09 17:26:43 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git for-4.20

for you to fetch changes up to b5bb425871186303e6936fa2581521bdd1964a58:

  arm64: percpu: Initialize ret in the default case (2018-09-25 13:26:48 -0700)


Igor Stoppa (1):
  mm: percpu: remove unnecessary unlikely()

Nathan Chancellor (1):
  arm64: percpu: Initialize ret in the default case

 arch/arm64/include/asm/percpu.h | 3 +++
 mm/percpu.c | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

Re: linux-next: changes in the maintainership of the percpu tree

2018-09-17 Thread Dennis Zhou

Hi Stephen,

On Tue, Sep 18, 2018 at 12:00:03AM +1000, Stephen Rothwell wrote:
> Hi Dennis,
> 
> On Mon, 17 Sep 2018 07:53:29 -0400 Tejun Heo  wrote:
> >
> > Ah, yeah, please switch the tree to Dennis's for-next branch instead of
> > mine.
> > 
> > Thanks.
> > 
> > On Mon, Sep 17, 2018 at 7:03 AM Stephen Rothwell 
> > wrote:
> > 
> > > I noticed commit
> > >
> > >   1194c4154662 ("MAINTAINERS: Make Dennis the percpu tree maintainer")
> > >
> > > Just wondering if this means any changes to the tree/branch I should
> > > fetch or the contacts I should use for the percpu tree in linux-next.
> 
> I have switched the percpu tree to be the for-next branch of
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git
> 
> and added you to the contacts for problems with that tree.
> 
> Thanks for adding your subsystem tree as a participant of linux-next.  As
> you may know, this is not a judgement of your code.  The purpose of
> linux-next is for integration testing and to lower the impact of
> conflicts between subsystems in the next merge window. 
> 
> You will need to ensure that the patches/commits in your tree/series have
> been:
>  * submitted under GPL v2 (or later) and include the Contributor's
> Signed-off-by,
>  * posted to the relevant mailing list,
>  * reviewed by you (or another maintainer of your subsystem tree),
>  * successfully unit tested, and 
>  * destined for the current or next Linux merge window.
> 
> Basically, this should be just what you would send to Linus (or ask him
> to fetch).  It is allowed to be rebased if you deem it necessary.
> 

Thanks for switching it over and for the rundown!
Dennis

Re: [PATCH] percpu: improve percpu_alloc_percpu_fail event trace

2024-01-24 Thread Dennis Zhou

Hello,

On Mon, Jan 22, 2024 at 08:55:39PM -0500, Steven Rostedt wrote:
> On Tue, 23 Jan 2024 09:44:43 +0800
> George Guo  wrote:
> 
> > There are two reasons of percpu_alloc failed without warnings: 
> > 
> > 1. do_warn is false
> > 2. do_warn is true and warn_limit is reached the limit.
> 
> Yes I know the reasons.
> 
> > 
> > Showing do_warn and warn_limit makes things simple, maybe dont need
> > kprobe again.
> 
> It's up to the maintainers of that code to decide if it's worth it or not,
> but honestly, my opinion it is not.
> 

I agree, I don't think this is a worthwhile change. If we do change
this, I'd like it to be more actionable in some way and as a result
something we can fix or tune accordingly.

George is this a common problem you're seeing?

> The trace event in question is to trace that percpu_alloc failed and why.
> It's not there to determine why it did not produce a printk message.
> 
> -- Steve

Thanks,
Dennis

Re: [PATCH v4 0/4] percpu: partial chunk depopulation

2021-04-20 Thread Dennis Zhou

On Tue, Apr 20, 2021 at 04:37:02PM +0530, Pratik Sampat wrote:
> 
> On 20/04/21 4:27 am, Dennis Zhou wrote:
> > On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
> > > Hello,
> > > 
> > > This series is a continuation of Roman's series in [1]. It aims to solve
> > > chunks holding onto free pages by adding a reclaim process to the percpu
> > > balance work item.
> > > 
> > > The main difference is that the nr_empty_pop_pages is now managed at
> > > time of isolation instead of intermixed. This helps with deciding which
> > > chunks to free instead of having to interleave returning chunks to
> > > active duty.
> > > 
> > > The allocation priority is as follows:
> > >1) appropriate chunk slot increasing until fit
> > >2) sidelined chunks
> > >3) full free chunks
> > > 
> > > The last slot for to_depopulate is never used for allocations.
> > > 
> > > A big thanks to Roman for initiating the work and being available for
> > > iterating on these ideas.
> > > 
> > > This patchset contains the following 4 patches:
> > >0001-percpu-factor-out-pcpu_check_block_hint.patch
> > >0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
> > >0003-percpu-implement-partial-chunk-depopulation.patch
> > >0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
> > > 
> > > 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
> > > initially from Roman. 0004 adds a reclaim threshold so we do not need to
> > > schedule for every page freed.
> > > 
> > > This series is on top of percpu$for-5.14 67c2669d69fb.
> > > 
> > > diffstats below:
> > > 
> > > Dennis Zhou (2):
> > >percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
> > >percpu: use reclaim threshold instead of running for every page
> > > 
> > > Roman Gushchin (2):
> > >percpu: factor out pcpu_check_block_hint()
> > >percpu: implement partial chunk depopulation
> > > 
> > >   mm/percpu-internal.h |   5 +
> > >   mm/percpu-km.c   |   5 +
> > >   mm/percpu-stats.c|  20 ++--
> > >   mm/percpu-vm.c   |  30 ++
> > >   mm/percpu.c  | 252 ++-
> > >   5 files changed, 278 insertions(+), 34 deletions(-)
> > > 
> > > Thanks,
> > > Dennis
> > Hello Pratik,
> > 
> > Do you mind testing this series again on POWER9? The base is available
> > here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14
> > 
> > Thanks,
> > Dennis
> 
> Hello Dennis, I have tested this patchset on POWER9.
> 
> I have tried variations of the percpu_test in the top level and nested cgroups
> creation as the test with 1000:10 didn't show any benefits.

This is most likely because the 1 in every 11 still pins every page
while 1 in 50 does not. Can you try the patch below on top? I think it
may show slightly better perf as well. If it doesn't I'll just drop it.

> 
> The following example shows more consistent benefits with the de-allocation
> strategy.
> Outer: 1000
> Inner: 50
> # ./percpu_test.sh
> Percpu: 6912 kB
> Percpu:   532736 kB
> Percpu:   278784 kB
> 
> I believe it could be a result of bulk freeing within 
> "free_unref_page_commit",
> where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
> page size it would end up creating lesser number of pages but with the
> effects of fragmentation.

This is unrelated to per cpu pages in slab/slub. Percpu is a separate
memory allocator.

> 
> Having said that, the patchset and its behavior does look good to me.

Thanks, can I throw the following on the appropriate patches? In the
future it's good to be explicit about this because some prefer to credit
different emails.

Tested-by: Pratik Sampat 

Thanks,
Dennis

The following may do a little better on power9:
---
>From a1464c4d5900cca68fd95b935178d72bb74837d5 Mon Sep 17 00:00:00 2001
From: Dennis Zhou 
Date: Tue, 20 Apr 2021 14:25:20 +
Subject: [PATCH] percpu: convert free page float to bytes

The percpu memory allocator keeps around a minimum number of free pages
to ensure we can satisfy atomic allocations. However, we've always kept
this number in terms of pages. On certain architectures like arm and
powerpc, the default page size could be 64k instead of 4k. So, start
with a target number of free bytes and then co

[PATCH] percpu: fix clang modpost section mismatch

2021-02-14 Thread Dennis Zhou

pcpu_build_alloc_info() is an __init function that makes a call to
cpumask_clear_cpu(). With CONFIG_GCOV_PROFILE_ALL enabled, the inline
heuristics are modified and such cpumask_clear_cpu() which is marked
inline doesn't get inlined. Because it works on mask in __initdata,
modpost throws a section mismatch error.

Arnd sent a patch with the flatten attribute as an alternative [2]. I've
added it to compiler_attributes.h.

modpost complaint:
  WARNING: modpost: vmlinux.o(.text+0x735425): Section mismatch in reference 
from the function cpumask_clear_cpu() to the variable 
.init.data:pcpu_build_alloc_info.mask
  The function cpumask_clear_cpu() references
  the variable __initdata pcpu_build_alloc_info.mask.
  This is often because cpumask_clear_cpu lacks a __initdata
  annotation or the annotation of pcpu_build_alloc_info.mask is wrong.

clang output:
  mm/percpu.c:2724:5: remark: cpumask_clear_cpu not inlined into 
pcpu_build_alloc_info because too costly to inline (cost=725, threshold=325) 
[-Rpass-missed=inline]

[1] https://lore.kernel.org/linux-mm/202012220454.9f6bkz9q-...@intel.com/
[2] 
https://lore.kernel.org/lkml/cak8p3a2zwfnexksm8k_suhhwkor17jfo3xaplxjzfpqx0eu...@mail.gmail.com/

Reported-by: kernel test robot 
Cc: Arnd Bergmann 
Cc: Nick Desaulniers 
Signed-off-by: Dennis Zhou 
---
 include/linux/compiler_attributes.h | 6 ++
 mm/percpu.c | 2 +-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/compiler_attributes.h 
b/include/linux/compiler_attributes.h
index ea5e04e75845..c043b8d2b17b 100644
--- a/include/linux/compiler_attributes.h
+++ b/include/linux/compiler_attributes.h
@@ -210,6 +210,12 @@
 # define fallthroughdo {} while (0)  /* fallthrough */
 #endif
 
+/*
+ * gcc: 
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
+ * clang: https://clang.llvm.org/docs/AttributeReference.html#flatten
+ */
+# define __flatten __attribute__((flatten))
+
 /*
  * Note the missing underscores.
  *
diff --git a/mm/percpu.c b/mm/percpu.c
index 80f8f885a990..6596a0a4286e 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2663,7 +2663,7 @@ early_param("percpu_alloc", percpu_alloc_setup);
  * On success, pointer to the new allocation_info is returned.  On
  * failure, ERR_PTR value is returned.
  */
-static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
+static struct pcpu_alloc_info * __init __flatten pcpu_build_alloc_info(
size_t reserved_size, size_t dyn_size,
size_t atom_size,
pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
-- 
2.30.0.478.g8a0d178c01-goog

Re: [PATCH v2 5/5] percpu: implement partial chunk depopulation

2021-04-07 Thread Dennis Zhou

Hello,

On Wed, Apr 07, 2021 at 11:26:18AM -0700, Roman Gushchin wrote:
> This patch implements partial depopulation of percpu chunks.
> 
> As now, a chunk can be depopulated only as a part of the final
> destruction, if there are no more outstanding allocations. However
> to minimize a memory waste it might be useful to depopulate a
> partially filed chunk, if a small number of outstanding allocations
> prevents the chunk from being fully reclaimed.
> 
> This patch implements the following depopulation process: it scans
> over the chunk pages, looks for a range of empty and populated pages
> and performs the depopulation. To avoid races with new allocations,
> the chunk is previously isolated. After the depopulation the chunk is
> sidelined to a special list or freed. New allocations can't be served
> using a sidelined chunk. The chunk can be moved back to a corresponding
> slot if there are not enough chunks with empty populated pages.
> 
> The depopulation is scheduled on the free path. Is the chunk:
>   1) has more than 1/4 of total pages free and populated
>   2) the system has enough free percpu pages aside of this chunk
>   3) isn't the reserved chunk
>   4) isn't the first chunk
>   5) isn't entirely free
> it's a good target for depopulation. If it's already depopulated
> but got free populated pages, it's a good target too.
> The chunk is moved to a special pcpu_depopulate_list, chunk->isolate
> flag is set and the async balancing is scheduled.
> 
> The async balancing moves pcpu_depopulate_list to a local list
> (because pcpu_depopulate_list can be changed when pcpu_lock is
> releases), and then tries to depopulate each chunk.  The depopulation
> is performed in the reverse direction to keep populated pages close to
> the beginning, if the global number of empty pages is reached.
> Depopulated chunks are sidelined to prevent further allocations.
> Skipped and fully empty chunks are returned to the corresponding slot.
> 
> On the allocation path, if there are no suitable chunks found,
> the list of sidelined chunks in scanned prior to creating a new chunk.
> If there is a good sidelined chunk, it's placed back to the slot
> and the scanning is restarted.
> 
> Many thanks to Dennis Zhou for his great ideas and a very constructive
> discussion which led to many improvements in this patchset!
> 
> Signed-off-by: Roman Gushchin 
> ---
>  mm/percpu-internal.h |   2 +
>  mm/percpu.c  | 164 ++-
>  2 files changed, 164 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 095d7eaa0db4..8e432663c41e 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -67,6 +67,8 @@ struct pcpu_chunk {
>  
>   void*data;  /* chunk data */
>   boolimmutable;  /* no [de]population allowed */
> + boolisolated;   /* isolated from chunk slot 
> lists */
> + booldepopulated;/* sidelined after depopulation 
> */
>   int start_offset;   /* the overlap with the previous
>  region to have a page aligned
>  base_addr */
> diff --git a/mm/percpu.c b/mm/percpu.c
> index e20119668c42..0a5a5e84e0a4 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -181,6 +181,19 @@ static LIST_HEAD(pcpu_map_extend_chunks);
>   */
>  int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
>  
> +/*
> + * List of chunks with a lot of free pages.  Used to depopulate them
> + * asynchronously.
> + */
> +static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
> +
> +/*
> + * List of previously depopulated chunks.  They are not usually used for new
> + * allocations, but can be returned back to service if a need arises.
> + */
> +static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
> +
> +
>  /*
>   * The number of populated pages in use by the allocator, protected by
>   * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page 
> gets
> @@ -542,6 +555,12 @@ static void pcpu_chunk_relocate(struct pcpu_chunk 
> *chunk, int oslot)
>  {
>   int nslot = pcpu_chunk_slot(chunk);
>  
> + /*
> +  * Keep isolated and depopulated chunks on a sideline.
> +  */
> + if (chunk->isolated || chunk->depopulated)
> + return;
> +
>   if (oslot != nslot)
>   __pcpu_chunk_move(chunk, nslot, oslot < nslot);
>  }
> @@ -1778,6 +1797,25 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
> align

[PATCH 00/12] introduce percpu block scan_hint

2019-02-27 Thread Dennis Zhou

Hi everyone,

It was reported a while [1] that an increase in allocation alignment
requirement [2] caused the percpu memory allocator to do significantly
more work.

After spending quite a bit of time diving into it, it seems the crux was
the following:
1) chunk management by free_bytes caused allocations to scan over
chunks that could not fit due to fragmentation
2) per block fragmentation required scanning from an early first_free
bit causing allocations to repeat work

This series introduces a scan_hint for pcpu_block_md and merges the
paths used to manage the hints. The scan_hint represents the largest
known free area prior to the contig_hint. There are some caveats to
this. First, it may not necessarily be the largest area as we do partial
updates based on freeing of regions and failed scanning in
pcpu_alloc_area(). Second, if contig_hint == scan_hint, then
scan_hint_start > contig_hint_start is possible. This is necessary
for scan_hint discovery when refreshing the hint of a block.

A necessary change is to enforce a block to be the size of a page. This
let's the management of nr_empty_pop_pages to be done by breaking and
making full contig_hints in the hint update paths. Prior, this was done
by piggy backing off of refreshing the chunk contig_hint as it performed
a full scan and counting empty full pages.

The following are the results found using the workload provided in [3].

branch| time

5.0-rc7 | 69s
[2] reverted | 44s
scan_hint | 39s

The times above represent the approximate average across multiple runs.
I tested based on a basic 1M 16-byte allocation pattern with no
alignment requirement and times did not differ between 5.0-rc7 and
scan_hint.

[1]
https://lore.kernel.org/netdev/CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=kz3hdcwp...@mail.gmail.com/
[2] https://lore.kernel.org/netdev/20181116154329.247947-1-eduma...@google.com/
[3] https://lore.kernel.org/netdev/vbfzhrj9smb@mellanox.com/

This patchset contains the following 12 patches:
0001-percpu-update-free-path-with-correct-new-free-region.patch
0002-percpu-do-not-search-past-bitmap-when-allocating-an-.patch
0003-percpu-introduce-helper-to-determine-if-two-regions-.patch
0004-percpu-manage-chunks-based-on-contig_bits-instead-of.patch
0005-percpu-relegate-chunks-unusable-when-failing-small-a.patch
0006-percpu-set-PCPU_BITMAP_BLOCK_SIZE-to-PAGE_SIZE.patch
0007-percpu-add-block-level-scan_hint.patch
0008-percpu-remember-largest-area-skipped-during-allocati.patch
0009-percpu-use-block-scan_hint-to-only-scan-forward.patch
0010-percpu-make-pcpu_block_md-generic.patch
0011-percpu-convert-chunk-hints-to-be-based-on-pcpu_block.patch
0012-percpu-use-chunk-scan_hint-to-skip-some-scanning.patch

0001 fixes an issue where the chunk contig_hint was being updated
improperly with the new region's starting offset and possibly differing
contig_hint. 0002 fixes possibly scanning pass the end of the bitmap.
0003 introduces a helper to do region overlap comparison. 0004 switches
to chunk management by contig_hint rather than free_bytes. 0005 moves
chunks that fail to allocate to the empty block list to prevent excess
scanning with of chunks with small contig_hints and poor alignment.
0006 introduces the constraint PCPU_BITMAP_BLOCK_SIZE == PAGE_SIZE and
modifies nr_empty_pop_pages management to be a part of the hint updates.
0007-0009 introduces percpu block scan_hint. 0010 makes pcpu_block_md
generic so chunk hints can be managed as a pcpu_block_md responsible
for more bits. 0011-0012 add chunk scan_hints.

This patchset is on top of percpu#master a3b22b9f11d9.

diffstats below:

Dennis Zhou (12):
percpu: update free path with correct new free region
percpu: do not search past bitmap when allocating an area
percpu: introduce helper to determine if two regions overlap
percpu: manage chunks based on contig_bits instead of free_bytes
percpu: relegate chunks unusable when failing small allocations
percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
percpu: add block level scan_hint
percpu: remember largest area skipped during allocation
percpu: use block scan_hint to only scan forward
percpu: make pcpu_block_md generic
percpu: convert chunk hints to be based on pcpu_block_md
percpu: use chunk scan_hint to skip some scanning

Thanks,
Dennis

[PATCH 01/12] percpu: update free path with correct new free region

2019-02-27 Thread Dennis Zhou

When updating the chunk's contig_hint on the free path of a hint that
does not touch the page boundaries, it was incorrectly using the
starting offset of the free region and the block's contig_hint. This
could lead to incorrect assumptions about fit given a size and better
alignment of the start. Fix this by using (end - start) as this is only
called when updating a hint within a block.

Signed-off-by: Dennis Zhou 
---
 mm/percpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index db86282fd024..53bd79a617b1 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -871,7 +871,7 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk 
*chunk, int bit_off,
pcpu_chunk_refresh_hint(chunk);
else
pcpu_chunk_update(chunk, pcpu_block_off_to_off(s_index, start),
- s_block->contig_hint);
+ end - start);
 }
 
 /**
-- 
2.17.1

[PATCH 06/12] percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE

2019-02-27 Thread Dennis Zhou

Previously, block size was flexible based on the constraint that the
GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the
overhead that keeping a floating number of populated free pages required
scanning over the free regions of a chunk.

Setting the block size to be fixed at PAGE_SIZE lets us know when an
empty page becomes used as we will break a full contig_hint of a block.
This means we no longer have to scan the whole chunk upon breaking a
contig_hint which empty page management piggybacks off. A later patch
takes advantage of this to optimize the allocation path by only scanning
forward using the scan_hint introduced later too.

Signed-off-by: Dennis Zhou 
---
 include/linux/percpu.h |  12 ++---
 mm/percpu-km.c |   2 +-
 mm/percpu.c| 111 +
 3 files changed, 49 insertions(+), 76 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 70b7123f38c7..9909dc0e273a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -26,16 +26,10 @@
 #define PCPU_MIN_ALLOC_SHIFT   2
 #define PCPU_MIN_ALLOC_SIZE(1 << PCPU_MIN_ALLOC_SHIFT)
 
-/* number of bits per page, used to trigger a scan if blocks are > PAGE_SIZE */
-#define PCPU_BITS_PER_PAGE (PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT)
-
 /*
- * This determines the size of each metadata block.  There are several subtle
- * constraints around this constant.  The reserved region must be a multiple of
- * PCPU_BITMAP_BLOCK_SIZE.  Additionally, PCPU_BITMAP_BLOCK_SIZE must be a
- * multiple of PAGE_SIZE or PAGE_SIZE must be a multiple of
- * PCPU_BITMAP_BLOCK_SIZE to align with the populated page map. The unit_size
- * also has to be a multiple of PCPU_BITMAP_BLOCK_SIZE to ensure full blocks.
+ * The PCPU_BITMAP_BLOCK_SIZE must be the same size as PAGE_SIZE as the
+ * updating of hints is used to manage the nr_empty_pop_pages in both
+ * the chunk and globally.
  */
 #define PCPU_BITMAP_BLOCK_SIZE PAGE_SIZE
 #define PCPU_BITMAP_BLOCK_BITS (PCPU_BITMAP_BLOCK_SIZE >>  \
diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index 0f643dc2dc65..c10bf7466596 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -70,7 +70,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
chunk->base_addr = page_address(pages) - pcpu_group_offsets[0];
 
spin_lock_irqsave(&pcpu_lock, flags);
-   pcpu_chunk_populated(chunk, 0, nr_pages, false);
+   pcpu_chunk_populated(chunk, 0, nr_pages);
spin_unlock_irqrestore(&pcpu_lock, flags);
 
pcpu_stats_chunk_alloc();
diff --git a/mm/percpu.c b/mm/percpu.c
index 3d7deece9556..967c9cc3a928 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -527,37 +527,21 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, 
int oslot)
__pcpu_chunk_move(chunk, nslot, oslot < nslot);
 }
 
-/**
- * pcpu_cnt_pop_pages- counts populated backing pages in range
+/*
+ * pcpu_update_empty_pages - update empty page counters
  * @chunk: chunk of interest
- * @bit_off: start offset
- * @bits: size of area to check
+ * @nr: nr of empty pages
  *
- * Calculates the number of populated pages in the region
- * [page_start, page_end).  This keeps track of how many empty populated
- * pages are available and decide if async work should be scheduled.
- *
- * RETURNS:
- * The nr of populated pages.
+ * This is used to keep track of the empty pages now based on the premise
+ * a pcpu_block_md covers a page.  The hint update functions recognize if
+ * a block is made full or broken to calculate deltas for keeping track of
+ * free pages.
  */
-static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk, int bit_off,
-int bits)
+static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr)
 {
-   int page_start = PFN_UP(bit_off * PCPU_MIN_ALLOC_SIZE);
-   int page_end = PFN_DOWN((bit_off + bits) * PCPU_MIN_ALLOC_SIZE);
-
-   if (page_start >= page_end)
-   return 0;
-
-   /*
-* bitmap_weight counts the number of bits set in a bitmap up to
-* the specified number of bits.  This is counting the populated
-* pages up to page_end and then subtracting the populated pages
-* up to page_start to count the populated pages in
-* [page_start, page_end).
-*/
-   return bitmap_weight(chunk->populated, page_end) -
-  bitmap_weight(chunk->populated, page_start);
+   chunk->nr_empty_pop_pages += nr;
+   if (chunk != pcpu_reserved_chunk)
+   pcpu_nr_empty_pop_pages += nr;
 }
 
 /*
@@ -611,36 +595,19 @@ static void pcpu_chunk_update(struct pcpu_chunk *chunk, 
int bit_off, int bits)
  * Updates:
  *  chunk->contig_bits
  *  chunk->contig_bits_start
- *  nr_empty_pop_pages (chunk and global)
  */
 static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk)
 {
-

[PATCH 09/12] percpu: use block scan_hint to only scan forward

2019-02-27 Thread Dennis Zhou

Blocks now remember the latest scan_hint. This can be used on the
allocation path as when a contig_hint is broken, we can promote the
scan_hint to the contig_hint and scan forward from there. This works
because pcpu_block_refresh_hint() is only called on the allocation path
while block free regions are updated manually in
pcpu_block_update_hint_free().

Signed-off-by: Dennis Zhou 
---
 mm/percpu.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index dac18968d79f..e51c151ed692 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -765,14 +765,23 @@ static void pcpu_block_refresh_hint(struct pcpu_chunk 
*chunk, int index)
 {
struct pcpu_block_md *block = chunk->md_blocks + index;
unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index);
-   int rs, re; /* region start, region end */
+   int rs, re, start;  /* region start, region end */
+
+   /* promote scan_hint to contig_hint */
+   if (block->scan_hint) {
+   start = block->scan_hint_start + block->scan_hint;
+   block->contig_hint_start = block->scan_hint_start;
+   block->contig_hint = block->scan_hint;
+   block->scan_hint = 0;
+   } else {
+   start = block->first_free;
+   block->contig_hint = 0;
+   }
 
-   /* clear hints */
-   block->contig_hint = block->scan_hint = 0;
-   block->left_free = block->right_free = 0;
+   block->right_free = 0;
 
/* iterate over free areas and update the contig hints */
-   pcpu_for_each_unpop_region(alloc_map, rs, re, block->first_free,
+   pcpu_for_each_unpop_region(alloc_map, rs, re, start,
   PCPU_BITMAP_BLOCK_BITS) {
pcpu_block_update(block, rs, re);
}
@@ -837,6 +846,8 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk 
*chunk, int bit_off,
s_off,
s_off + bits)) {
/* block contig hint is broken - scan to fix it */
+   if (!s_off)
+   s_block->left_free = 0;
pcpu_block_refresh_hint(chunk, s_index);
} else {
/* update left and right contig manually */
@@ -870,11 +881,11 @@ static void pcpu_block_update_hint_alloc(struct 
pcpu_chunk *chunk, int bit_off,
if (e_off > e_block->scan_hint_start)
e_block->scan_hint = 0;
 
+   e_block->left_free = 0;
if (e_off > e_block->contig_hint_start) {
/* contig hint is broken - scan to fix it */
pcpu_block_refresh_hint(chunk, e_index);
} else {
-   e_block->left_free = 0;
e_block->right_free =
min_t(int, e_block->right_free,
  PCPU_BITMAP_BLOCK_BITS - e_off);
-- 
2.17.1

[PATCH 10/12] percpu: make pcpu_block_md generic

2019-02-27 Thread Dennis Zhou

In reality, a chunk is just a block covering a larger number of bits.
The hints themselves are one in the same. Rather than maintaining the
hints separately, first introduce nr_bits to genericize
pcpu_block_update() to correctly maintain block->right_free. The next
patch will convert chunk hints to be managed as a pcpu_block_md.

Signed-off-by: Dennis Zhou 
---
 mm/percpu-internal.h |  1 +
 mm/percpu.c  | 20 +---
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index ec58b244545d..119bd1119aa7 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -28,6 +28,7 @@ struct pcpu_block_md {
int right_free; /* size of free space along
   the right side of the block 
*/
int first_free; /* block position of first free 
*/
+   int nr_bits;/* total bits responsible for */
 };
 
 struct pcpu_chunk {
diff --git a/mm/percpu.c b/mm/percpu.c
index e51c151ed692..7cdf14c242de 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -658,7 +658,7 @@ static void pcpu_block_update(struct pcpu_block_md *block, 
int start, int end)
if (start == 0)
block->left_free = contig;
 
-   if (end == PCPU_BITMAP_BLOCK_BITS)
+   if (end == block->nr_bits)
block->right_free = contig;
 
if (contig > block->contig_hint) {
@@ -1271,18 +1271,24 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, 
int off)
pcpu_chunk_relocate(chunk, oslot);
 }
 
+static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits)
+{
+   block->scan_hint = 0;
+   block->contig_hint = nr_bits;
+   block->left_free = nr_bits;
+   block->right_free = nr_bits;
+   block->first_free = 0;
+   block->nr_bits = nr_bits;
+}
+
 static void pcpu_init_md_blocks(struct pcpu_chunk *chunk)
 {
struct pcpu_block_md *md_block;
 
for (md_block = chunk->md_blocks;
 md_block != chunk->md_blocks + pcpu_chunk_nr_blocks(chunk);
-md_block++) {
-   md_block->scan_hint = 0;
-   md_block->contig_hint = PCPU_BITMAP_BLOCK_BITS;
-   md_block->left_free = PCPU_BITMAP_BLOCK_BITS;
-   md_block->right_free = PCPU_BITMAP_BLOCK_BITS;
-   }
+md_block++)
+   pcpu_init_md_block(md_block, PCPU_BITMAP_BLOCK_BITS);
 }
 
 /**
-- 
2.17.1

[PATCH 02/12] percpu: do not search past bitmap when allocating an area

2019-02-27 Thread Dennis Zhou

pcpu_find_block_fit() guarantees that a fit is found within
PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as
it compares against the block's contig_hint. This can lead to
incorrectly scanning past the end of the bitmap. The behavior was okay
given the check after for bit_off >= end and the correctness of the
hints from pcpu_find_block_fit().

This patch fixes this by bounding the end offset by the number of bits
in a chunk.

Signed-off-by: Dennis Zhou 
---
 mm/percpu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 53bd79a617b1..69ca51d238b5 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -988,7 +988,8 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int 
alloc_bits,
/*
 * Search to find a fit.
 */
-   end = start + alloc_bits + PCPU_BITMAP_BLOCK_BITS;
+   end = min_t(int, start + alloc_bits + PCPU_BITMAP_BLOCK_BITS,
+   pcpu_chunk_map_bits(chunk));
bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end, start,
 alloc_bits, align_mask);
if (bit_off >= end)
-- 
2.17.1

[PATCH 08/12] percpu: remember largest area skipped during allocation

2019-02-27 Thread Dennis Zhou

Percpu allocations attempt to do first fit by scanning forward from the
first_free of a block. However, fragmentation from allocation requests
can cause holes not seen by block hint update functions. To address
this, create a local version of bitmap_find_next_zero_area_off() that
remembers the largest area skipped over. The caveat is that it only sees
regions skipped over due to not fitting, not regions skipped due to
alignment. Prior to updating the scan_hint, a scan backwards is done to
try and recover free bits skipped due to alignment. While this can cause
scanning to miss earlier possible free areas, smaller allocations will
eventually fill those holes.

Signed-off-by: Dennis Zhou 
---
 mm/percpu.c | 101 ++--
 1 file changed, 99 insertions(+), 2 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index df1aacf58ac8..dac18968d79f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -716,6 +716,43 @@ static void pcpu_block_update(struct pcpu_block_md *block, 
int start, int end)
}
 }
 
+/*
+ * pcpu_block_update_scan - update a block given a free area from a scan
+ * @chunk: chunk of interest
+ * @bit_off: chunk offset
+ * @bits: size of free area
+ *
+ * Finding the final allocation spot first goes through pcpu_find_block_fit()
+ * to find a block that can hold the allocation and then pcpu_alloc_area()
+ * where a scan is used.  When allocations require specific alignments,
+ * we can inadvertently create holes which will not be seen in the alloc
+ * or free paths.
+ *
+ * This takes a given free area hole and updates a block as it may change the
+ * scan_hint.  We need to scan backwards to ensure we don't miss free bits
+ * from alignment.
+ */
+static void pcpu_block_update_scan(struct pcpu_chunk *chunk, int bit_off,
+  int bits)
+{
+   int s_off = pcpu_off_to_block_off(bit_off);
+   int e_off = s_off + bits;
+   int s_index, l_bit;
+   struct pcpu_block_md *block;
+
+   if (e_off > PCPU_BITMAP_BLOCK_BITS)
+   return;
+
+   s_index = pcpu_off_to_block_index(bit_off);
+   block = chunk->md_blocks + s_index;
+
+   /* scan backwards in case of alignment skipping free bits */
+   l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index), s_off);
+   s_off = (s_off == l_bit) ? 0 : l_bit + 1;
+
+   pcpu_block_update(block, s_off, e_off);
+}
+
 /**
  * pcpu_block_refresh_hint
  * @chunk: chunk of interest
@@ -1064,6 +1101,62 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, 
int alloc_bits,
return bit_off;
 }
 
+/*
+ * pcpu_find_zero_area - modified from bitmap_find_next_zero_area_off
+ * @map: the address to base the search on
+ * @size: the bitmap size in bits
+ * @start: the bitnumber to start searching at
+ * @nr: the number of zeroed bits we're looking for
+ * @align_mask: alignment mask for zero area
+ * @largest_off: offset of the largest area skipped
+ * @largest_bits: size of the largest area skipped
+ *
+ * The @align_mask should be one less than a power of 2.
+ *
+ * This is a modified version of bitmap_find_next_zero_area_off() to remember
+ * the largest area that was skipped.  This is imperfect, but in general is
+ * good enough.  The largest remembered region is the largest failed region
+ * seen.  This does not include anything we possibly skipped due to alignment.
+ * pcpu_block_update_scan() does scan backwards to try and recover what was
+ * lost to alignment.  While this can cause scanning to miss earlier possible
+ * free areas, smaller allocations will eventually fill those holes.
+ */
+static unsigned long pcpu_find_zero_area(unsigned long *map,
+unsigned long size,
+unsigned long start,
+unsigned long nr,
+unsigned long align_mask,
+unsigned long *largest_off,
+unsigned long *largest_bits)
+{
+   unsigned long index, end, i, area_off, area_bits;
+again:
+   index = find_next_zero_bit(map, size, start);
+
+   /* Align allocation */
+   index = __ALIGN_MASK(index, align_mask);
+   area_off = index;
+
+   end = index + nr;
+   if (end > size)
+   return end;
+   i = find_next_bit(map, end, index);
+   if (i < end) {
+   area_bits = i - area_off;
+   /* remember largest unused area with best alignment */
+   if (area_bits > *largest_bits ||
+   (area_bits == *largest_bits && *largest_off &&
+(!area_off || __ffs(area_off) > __ffs(*largest_off {
+   *largest_off = area_off;
+   *largest_bits = area_bits;
+   }
+
+   start = i + 1;
+

[PATCH 05/12] percpu: relegate chunks unusable when failing small allocations

2019-02-27 Thread Dennis Zhou

In certain cases, requestors of percpu memory may want specific
alignments. However, it is possible to end up in situations where the
contig_hint matches, but the alignment does not. This causes excess
scanning of chunks that will fail. To prevent this, if a small
allocation fails (< 32B), the chunk is moved to the empty list. Once an
allocation is freed from that chunk, it is placed back into rotation.

Signed-off-by: Dennis Zhou 
---
 mm/percpu.c | 35 ++-
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index c996bcffbb2a..3d7deece9556 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -94,6 +94,8 @@
 
 /* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
 #define PCPU_SLOT_BASE_SHIFT   5
+/* chunks in slots below this are subject to being sidelined on failed alloc */
+#define PCPU_SLOT_FAIL_THRESHOLD   3
 
 #define PCPU_EMPTY_POP_PAGES_LOW   2
 #define PCPU_EMPTY_POP_PAGES_HIGH  4
@@ -488,6 +490,22 @@ static void pcpu_mem_free(void *ptr)
kvfree(ptr);
 }
 
+static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
+ bool move_front)
+{
+   if (chunk != pcpu_reserved_chunk) {
+   if (move_front)
+   list_move(&chunk->list, &pcpu_slot[slot]);
+   else
+   list_move_tail(&chunk->list, &pcpu_slot[slot]);
+   }
+}
+
+static void pcpu_chunk_move(struct pcpu_chunk *chunk, int slot)
+{
+   __pcpu_chunk_move(chunk, slot, true);
+}
+
 /**
  * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
  * @chunk: chunk of interest
@@ -505,12 +523,8 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, 
int oslot)
 {
int nslot = pcpu_chunk_slot(chunk);
 
-   if (chunk != pcpu_reserved_chunk && oslot != nslot) {
-   if (oslot < nslot)
-   list_move(&chunk->list, &pcpu_slot[nslot]);
-   else
-   list_move_tail(&chunk->list, &pcpu_slot[nslot]);
-   }
+   if (oslot != nslot)
+   __pcpu_chunk_move(chunk, nslot, oslot < nslot);
 }
 
 /**
@@ -1381,7 +1395,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
bool do_warn = !(gfp & __GFP_NOWARN);
static int warn_limit = 10;
-   struct pcpu_chunk *chunk;
+   struct pcpu_chunk *chunk, *next;
const char *err;
int slot, off, cpu, ret;
unsigned long flags;
@@ -1443,11 +1457,14 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
 restart:
/* search through normal chunks */
for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
-   list_for_each_entry(chunk, &pcpu_slot[slot], list) {
+   list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
off = pcpu_find_block_fit(chunk, bits, bit_align,
  is_atomic);
-   if (off < 0)
+   if (off < 0) {
+   if (slot < PCPU_SLOT_FAIL_THRESHOLD)
+   pcpu_chunk_move(chunk, 0);
continue;
+   }
 
off = pcpu_alloc_area(chunk, bits, bit_align, off);
if (off >= 0)
-- 
2.17.1

[PATCH 03/12] percpu: introduce helper to determine if two regions overlap

2019-02-27 Thread Dennis Zhou

While block hints were always accurate, it's possible when spanning
across blocks that we miss updating the chunk's contig_hint. Rather than
rely on correctness of the boundaries of hints, do a full overlap
comparison.

Signed-off-by: Dennis Zhou 
---
 mm/percpu.c | 31 +++
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 69ca51d238b5..b40112b2fc59 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -546,6 +546,24 @@ static inline int pcpu_cnt_pop_pages(struct pcpu_chunk 
*chunk, int bit_off,
   bitmap_weight(chunk->populated, page_start);
 }
 
+/*
+ * pcpu_region_overlap - determines if two regions overlap
+ * @a: start of first region, inclusive
+ * @b: end of first region, exclusive
+ * @x: start of second region, inclusive
+ * @y: end of second region, exclusive
+ *
+ * This is used to determine if the hint region [a, b) overlaps with the
+ * allocated region [x, y).
+ */
+static inline bool pcpu_region_overlap(int a, int b, int x, int y)
+{
+   if ((x >= a && x < b) || (y > a && y <= b) ||
+   (x <= a && y >= b))
+   return true;
+   return false;
+}
+
 /**
  * pcpu_chunk_update - updates the chunk metadata given a free area
  * @chunk: chunk of interest
@@ -710,8 +728,11 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk 
*chunk, int bit_off,
PCPU_BITMAP_BLOCK_BITS,
s_off + bits);
 
-   if (s_off >= s_block->contig_hint_start &&
-   s_off < s_block->contig_hint_start + s_block->contig_hint) {
+   if (pcpu_region_overlap(s_block->contig_hint_start,
+   s_block->contig_hint_start +
+   s_block->contig_hint,
+   s_off,
+   s_off + bits)) {
/* block contig hint is broken - scan to fix it */
pcpu_block_refresh_hint(chunk, s_index);
} else {
@@ -764,8 +785,10 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk 
*chunk, int bit_off,
 * contig hint is broken.  Otherwise, it means a smaller space
 * was used and therefore the chunk contig hint is still correct.
 */
-   if (bit_off >= chunk->contig_bits_start  &&
-   bit_off < chunk->contig_bits_start + chunk->contig_bits)
+   if (pcpu_region_overlap(chunk->contig_bits_start,
+   chunk->contig_bits_start + chunk->contig_bits,
+   bit_off,
+   bit_off + bits))
pcpu_chunk_refresh_hint(chunk);
 }
 
-- 
2.17.1

[PATCH 11/12] percpu: convert chunk hints to be based on pcpu_block_md

2019-02-27 Thread Dennis Zhou

As mentioned in the last patch, a chunk's hints are no different than a
block just responsible for more bits. This converts chunk level hints to
use a pcpu_block_md to maintain them. This lets us reuse the same hint
helper functions as a block. The left_free and right_free are unused by
the chunk's pcpu_block_md.

Signed-off-by: Dennis Zhou 
---
 mm/percpu-internal.h |   5 +-
 mm/percpu-stats.c|   5 +-
 mm/percpu.c  | 120 +++
 3 files changed, 57 insertions(+), 73 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 119bd1119aa7..0468ba500bd4 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -39,9 +39,7 @@ struct pcpu_chunk {
 
struct list_headlist;   /* linked to pcpu_slot lists */
int free_bytes; /* free bytes in the chunk */
-   int contig_bits;/* max contiguous size hint */
-   int contig_bits_start; /* contig_bits starting
- offset */
+   struct pcpu_block_mdchunk_md;
void*base_addr; /* base address of this chunk */
 
unsigned long   *alloc_map; /* allocation map */
@@ -49,7 +47,6 @@ struct pcpu_chunk {
struct pcpu_block_md*md_blocks; /* metadata blocks */
 
void*data;  /* chunk data */
-   int first_bit;  /* no free below this */
boolimmutable;  /* no [de]population allowed */
int start_offset;   /* the overlap with the previous
   region to have a page aligned
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index b5fdd43b60c9..ef5034a0464e 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -53,6 +53,7 @@ static int find_max_nr_alloc(void)
 static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
int *buffer)
 {
+   struct pcpu_block_md *chunk_md = &chunk->chunk_md;
int i, last_alloc, as_len, start, end;
int *alloc_sizes, *p;
/* statistics */
@@ -121,9 +122,9 @@ static void chunk_map_stats(struct seq_file *m, struct 
pcpu_chunk *chunk,
P("nr_alloc", chunk->nr_alloc);
P("max_alloc_size", chunk->max_alloc_size);
P("empty_pop_pages", chunk->nr_empty_pop_pages);
-   P("first_bit", chunk->first_bit);
+   P("first_bit", chunk_md->first_free);
P("free_bytes", chunk->free_bytes);
-   P("contig_bytes", chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
+   P("contig_bytes", chunk_md->contig_hint * PCPU_MIN_ALLOC_SIZE);
P("sum_frag", sum_frag);
P("max_frag", max_frag);
P("cur_min_alloc", cur_min_alloc);
diff --git a/mm/percpu.c b/mm/percpu.c
index 7cdf14c242de..197479f2c489 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -233,10 +233,13 @@ static int pcpu_size_to_slot(int size)
 
 static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
 {
-   if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits == 0)
+   const struct pcpu_block_md *chunk_md = &chunk->chunk_md;
+
+   if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE ||
+   chunk_md->contig_hint == 0)
return 0;
 
-   return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
+   return pcpu_size_to_slot(chunk_md->contig_hint * PCPU_MIN_ALLOC_SIZE);
 }
 
 /* set the pointer to a chunk in a page struct */
@@ -592,54 +595,6 @@ static inline bool pcpu_region_overlap(int a, int b, int 
x, int y)
return false;
 }
 
-/**
- * pcpu_chunk_update - updates the chunk metadata given a free area
- * @chunk: chunk of interest
- * @bit_off: chunk offset
- * @bits: size of free area
- *
- * This updates the chunk's contig hint and starting offset given a free area.
- * Choose the best starting offset if the contig hint is equal.
- */
-static void pcpu_chunk_update(struct pcpu_chunk *chunk, int bit_off, int bits)
-{
-   if (bits > chunk->contig_bits) {
-   chunk->contig_bits_start = bit_off;
-   chunk->contig_bits = bits;
-   } else if (bits == chunk->contig_bits && chunk->contig_bits_start &&
-  (!bit_off ||
-   __ffs(bit_off) > __ffs(chunk->contig_bits_start))) {
-   /* use the start with the best alignment */
-   chunk->contig_bits_start = bit_off;
-   }
-}
-
-/**
- * pcpu_chunk_refresh_hint - updates metadata about a chunk
- * @chunk: chunk of interest
- *
- * Iterates over the metadata blocks to find the largest contig area.
- * It also counts the popula

[PATCH 04/12] percpu: manage chunks based on contig_bits instead of free_bytes

2019-02-27 Thread Dennis Zhou

When a chunk becomes fragmented, it can end up having a large number of
small allocation areas free. The free_bytes sorting of chunks leads to
unnecessary checking of chunks that cannot satisfy the allocation.
Switch to contig_bits sorting to prevent scanning chunks that may not be
able to service the allocation request.

Signed-off-by: Dennis Zhou 
---
 mm/percpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index b40112b2fc59..c996bcffbb2a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -234,7 +234,7 @@ static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits == 0)
return 0;
 
-   return pcpu_size_to_slot(chunk->free_bytes);
+   return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
 }
 
 /* set the pointer to a chunk in a page struct */
-- 
2.17.1

[PATCH 12/12] percpu: use chunk scan_hint to skip some scanning

2019-02-27 Thread Dennis Zhou

Just like blocks, chunks now maintain a scan_hint. This can be used to
skip some scanning by promoting the scan_hint to be the contig_hint.
The chunk's scan_hint is primarily updated on the backside and relies on
full scanning when a block becomes free or the free region spans across
blocks.

Signed-off-by: Dennis Zhou 
---
 mm/percpu.c | 36 +++-
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 197479f2c489..40d49d7fb286 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -711,20 +711,31 @@ static void pcpu_block_update_scan(struct pcpu_chunk 
*chunk, int bit_off,
 /**
  * pcpu_chunk_refresh_hint - updates metadata about a chunk
  * @chunk: chunk of interest
+ * @full_scan: if we should scan from the beginning
  *
  * Iterates over the metadata blocks to find the largest contig area.
- * It also counts the populated pages and uses the delta to update the
- * global count.
+ * A full scan can be avoided on the allocation path as this is triggered
+ * if we broke the contig_hint.  In doing so, the scan_hint will be before
+ * the contig_hint or after if the scan_hint == contig_hint.  This cannot
+ * be prevented on freeing as we want to find the largest area possibly
+ * spanning blocks.
  */
-static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk)
+static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk, bool full_scan)
 {
struct pcpu_block_md *chunk_md = &chunk->chunk_md;
int bit_off, bits;
 
-   /* clear metadata */
-   chunk_md->contig_hint = 0;
+   /* promote scan_hint to contig_hint */
+   if (!full_scan && chunk_md->scan_hint) {
+   bit_off = chunk_md->scan_hint_start + chunk_md->scan_hint;
+   chunk_md->contig_hint_start = chunk_md->scan_hint_start;
+   chunk_md->contig_hint = chunk_md->scan_hint;
+   chunk_md->scan_hint = 0;
+   } else {
+   bit_off = chunk_md->first_free;
+   chunk_md->contig_hint = 0;
+   }
 
-   bit_off = chunk_md->first_free;
bits = 0;
pcpu_for_each_md_free_region(chunk, bit_off, bits) {
pcpu_block_update(chunk_md, bit_off, bit_off + bits);
@@ -884,6 +895,13 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk 
*chunk, int bit_off,
if (nr_empty_pages)
pcpu_update_empty_pages(chunk, -1 * nr_empty_pages);
 
+   if (pcpu_region_overlap(chunk_md->scan_hint_start,
+   chunk_md->scan_hint_start +
+   chunk_md->scan_hint,
+   bit_off,
+   bit_off + bits))
+   chunk_md->scan_hint = 0;
+
/*
 * The only time a full chunk scan is required is if the chunk
 * contig hint is broken.  Otherwise, it means a smaller space
@@ -894,7 +912,7 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk 
*chunk, int bit_off,
chunk_md->contig_hint,
bit_off,
bit_off + bits))
-   pcpu_chunk_refresh_hint(chunk);
+   pcpu_chunk_refresh_hint(chunk, false);
 }
 
 /**
@@ -1005,7 +1023,7 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk 
*chunk, int bit_off,
 * the else condition below.
 */
if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index != e_index)
-   pcpu_chunk_refresh_hint(chunk);
+   pcpu_chunk_refresh_hint(chunk, true);
else
pcpu_block_update(&chunk->chunk_md,
  pcpu_block_off_to_off(s_index, start),
@@ -1078,7 +1096,7 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, 
int alloc_bits,
if (bit_off + alloc_bits > chunk_md->contig_hint)
return -1;
 
-   bit_off = chunk_md->first_free;
+   bit_off = pcpu_next_hint(chunk_md, alloc_bits);
bits = 0;
pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) {
if (!pop_only || pcpu_is_populated(chunk, bit_off, bits,
-- 
2.17.1

[PATCH 07/12] percpu: add block level scan_hint

2019-02-27 Thread Dennis Zhou

Fragmentation can cause both blocks and chunks to have an early
first_firee bit available, but only able to satisfy allocations much
later on. This patch introduces a scan_hint to help mitigate some
unnecessary scanning.

The scan_hint remembers the largest area prior to the contig_hint. If
the contig_hint == scan_hint, then scan_hint_start > contig_hint_start.
This is necessary for scan_hint discovery when refreshing a block.

Signed-off-by: Dennis Zhou 
---
 mm/percpu-internal.h |   9 
 mm/percpu.c  | 101 ---
 2 files changed, 103 insertions(+), 7 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index b1739dc06b73..ec58b244545d 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -9,8 +9,17 @@
  * pcpu_block_md is the metadata block struct.
  * Each chunk's bitmap is split into a number of full blocks.
  * All units are in terms of bits.
+ *
+ * The scan hint is the largest known contiguous area before the contig hint.
+ * It is not necessarily the actual largest contig hint though.  There is an
+ * invariant that the scan_hint_start > contig_hint_start iff
+ * scan_hint == contig_hint.  This is necessary because when scanning forward,
+ * we don't know if a new contig hint would be better than the current one.
  */
 struct pcpu_block_md {
+   int scan_hint;  /* scan hint for block */
+   int scan_hint_start; /* block relative starting
+   position of the scan hint */
int contig_hint;/* contig hint for block */
int contig_hint_start; /* block relative starting
  position of the contig 
hint */
diff --git a/mm/percpu.c b/mm/percpu.c
index 967c9cc3a928..df1aacf58ac8 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -320,6 +320,34 @@ static unsigned long pcpu_block_off_to_off(int index, int 
off)
return index * PCPU_BITMAP_BLOCK_BITS + off;
 }
 
+/*
+ * pcpu_next_hint - determine which hint to use
+ * @block: block of interest
+ * @alloc_bits: size of allocation
+ *
+ * This determines if we should scan based on the scan_hint or first_free.
+ * In general, we want to scan from first_free to fulfill allocations by
+ * first fit.  However, if we know a scan_hint at position scan_hint_start
+ * cannot fulfill an allocation, we can begin scanning from there knowing
+ * the contig_hint will be our fallback.
+ */
+static int pcpu_next_hint(struct pcpu_block_md *block, int alloc_bits)
+{
+   /*
+* The three conditions below determine if we can skip past the
+* scan_hint.  First, does the scan hint exist.  Second, is the
+* contig_hint after the scan_hint (possibly not true iff
+* contig_hint == scan_hint).  Third, is the allocation request
+* larger than the scan_hint.
+*/
+   if (block->scan_hint &&
+   block->contig_hint_start > block->scan_hint_start &&
+   alloc_bits > block->scan_hint)
+   return block->scan_hint_start + block->scan_hint;
+
+   return block->first_free;
+}
+
 /**
  * pcpu_next_md_free_region - finds the next hint free area
  * @chunk: chunk of interest
@@ -415,9 +443,11 @@ static void pcpu_next_fit_region(struct pcpu_chunk *chunk, 
int alloc_bits,
if (block->contig_hint &&
block->contig_hint_start >= block_off &&
block->contig_hint >= *bits + alloc_bits) {
+   int start = pcpu_next_hint(block, alloc_bits);
+
*bits += alloc_bits + block->contig_hint_start -
-block->first_free;
-   *bit_off = pcpu_block_off_to_off(i, block->first_free);
+start;
+   *bit_off = pcpu_block_off_to_off(i, start);
return;
}
/* reset to satisfy the second predicate above */
@@ -632,12 +662,57 @@ static void pcpu_block_update(struct pcpu_block_md 
*block, int start, int end)
block->right_free = contig;
 
if (contig > block->contig_hint) {
+   /* promote the old contig_hint to be the new scan_hint */
+   if (start > block->contig_hint_start) {
+   if (block->contig_hint > block->scan_hint) {
+   block->scan_hint_start =
+   block->contig_hint_start;
+   block->scan_hint = block->contig_hint;
+   } else if (start < block->scan_hint_start) {
+   /*
+* The old contig_hint == scan_hint.  But, the
+

Re: [PATCH] percpu/module resevation: change resevation size iff X86_VSMP is set

2019-03-01 Thread Dennis Zhou

Hi Barret,

On Fri, Mar 01, 2019 at 01:30:15PM -0500, Barret Rhoden wrote:
> Hi -
> 
> On 01/21/2019 06:47 AM, Eial Czerwacki wrote:
> >
> 
> Your main issue was that you only sent this patch to LKML, but not the
> maintainers of the file.  If you don't, your patch might get lost.  To get
> the appropriate people and lists, run:
> 
>   scripts/get_maintainer.pl YOUR_PATCH.patch.
> 
> For this patch, you'll get this:
> 
> Dennis Zhou  (maintainer:PER-CPU MEMORY ALLOCATOR)
> Tejun Heo  (maintainer:PER-CPU MEMORY ALLOCATOR)
> Christoph Lameter  (maintainer:PER-CPU MEMORY ALLOCATOR)
> linux-kernel@vger.kernel.org (open list)
> 
> I added the three maintainers to this email.
> 
> I have a few minor comments below.
> 
> > [PATCH] percpu/module resevation: change resevation size iff X86_VSMP is
> set
> 
> You misspelled 'reservation'.  Also, I'd just say: "percpu: increase module
> reservation size if X86_VSMP is set".  ('change' -> 'increase'), only says
> 'reservation' once.)
> 
> > as reported in bug #201339 
> > (https://bugzilla.kernel.org/show_bug.cgi?id=201339)
> 
> I think you can add a tag for this right above your Signed-off-by tags.
> e.g.:
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201339
> 
> > by enabling X86_VSMP, INTERNODE_CACHE_BYTES's definition differs from the 
> > default one
> > causing the struct size to exceed the size ok 8KB.
> ^of
> 
> Which struct are you talking about?  I have one in mind, but others might
> not know from reading the commit message.
> 
> I ran into this on https://bugzilla.kernel.org/show_bug.cgi?id=202511. In
> that case, it was because modules (drm and amdkfd) were using DEFINE_SRCU,
> which does a DEFINE_PER_CPU on struct srcu_data, and that used
> cacheline_internodealigned_in_smp.
> 
> > 
> > in order to avoid such issue, increse PERCPU_MODULE_RESERVE to 64KB if 
> > CONFIG_X86_VSMP is set.
>^increase
> 
> > 
> > the value was caculated on linux 4.20.3, make allmodconfig all and the 
> > following oneliner:
>^calculated
> 
> > for f in `find -name *.ko`; do echo $f; readelf -S $f  |grep perc; done 
> > |grep data..percpu -B 1 |grep ko |while read r; do echo -n "$r: "; objdump 
> > --syms --section=.data..percpu $r|grep data |sort -n  |awk '{c++; 
> > d=strtonum("0x" $1) + strtonum("0x" $5); if (m < d) m = d;} END {printf("%d 
> > vars-> last addr 0x%x ( %d )\n", c, m, m)}' ; done |column -t |sort -k 8 -n 
> > | awk '{print $8}'| paste -sd+ | bc
> 
> Not sure how useful the one-liner is, versus a description of what you're
> doing.  i.e. "the size of all module percpu data sections, or something."
> 
> Also, how close was that calculated value to 64K?  If more modules start
> using DEFINE_SRCU, each of which uses 8K, then that 64K might run out.
> 
> Thanks,
> Barret
> 
> > Signed-off-by: Eial Czerwacki 
> > Signed-off-by: Shai Fultheim 
> > Signed-off-by: Oren Twaig 
> > ---
> >   include/linux/percpu.h | 4 
> >   1 file changed, 4 insertions(+)
> > 
> > diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> > index 70b7123..6b79693 100644
> > --- a/include/linux/percpu.h
> > +++ b/include/linux/percpu.h
> > @@ -14,7 +14,11 @@
> >   /* enough to cover all DEFINE_PER_CPUs in modules */
> >   #ifdef CONFIG_MODULES
> > +#ifdef X86_VSMP
> > +#define PERCPU_MODULE_RESERVE  (1 << 16)
> > +#else
> >   #define PERCPU_MODULE_RESERVE (8 << 10)
> > +#endif
> >   #else
> >   #define PERCPU_MODULE_RESERVE 0
> >   #endif
> > 
> 

Thanks for sending this to me.

I must say, I really do not want to expand the reserved region. In most
cases, it can easily end up unused and thus wasted memory as it is hard
allocated on boot. This is done because code gen assumes static
variables are close to the program counter. This would not be true with
dynamic allocations which being at the end of the vmalloc area
(Summarized from Tejun's account in [1]).

Another note on the reserved region. It starts at the end of the static
region which means it generally isn't page aligned. So while an 8kb
allocation would fit, a 4kb alignment more than likely would fail.
Something as large as 8kb should probably be dynamically allocated as
well.

I read through the bugzilla report and it seems that the culprits are:
  drivers/gpu/drm/amd/amdkfd/kfd_process.c:DEFINE_SRCU(kfd_processes_srcu);
  drivers/gpu/drm/drm_drv.c:DEFINE_STATIC_SRCU(drm_unplug_srcu);

Is there a reason we cannot dynamically initialize these structs? I've
cced Paul McKenney because we saw an issue with ipmi in December [1].

[1] 
https://lore.kernel.org/linux-mm/CAJM9R-JWO1P_qJzw2JboMH2dgPX7K1tF49nO5ojvf=iwgdd...@mail.gmail.com/

Thanks,
Dennis

Re: [PATCH 02/12] percpu: do not search past bitmap when allocating an area

2019-03-02 Thread Dennis Zhou

On Sat, Mar 02, 2019 at 01:32:04PM +, Peng Fan wrote:
> Hi Dennis,
> 
> > -Original Message-
> > From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> > Behalf Of Dennis Zhou
> > Sent: 2019年2月28日 10:18
> > To: Dennis Zhou ; Tejun Heo ; Christoph
> > Lameter 
> > Cc: Vlad Buslov ; kernel-t...@fb.com;
> > linux...@kvack.org; linux-kernel@vger.kernel.org
> > Subject: [PATCH 02/12] percpu: do not search past bitmap when allocating an
> > area
> > 
> > pcpu_find_block_fit() guarantees that a fit is found within
> > PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as it
> > compares against the block's contig_hint. This can lead to incorrectly 
> > scanning
> > past the end of the bitmap. The behavior was okay given the check after for
> > bit_off >= end and the correctness of the hints from pcpu_find_block_fit().
> > 
> > This patch fixes this by bounding the end offset by the number of bits in a
> > chunk.
> > 
> > Signed-off-by: Dennis Zhou 
> > ---
> >  mm/percpu.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 53bd79a617b1..69ca51d238b5 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -988,7 +988,8 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk,
> > int alloc_bits,
> > /*
> >  * Search to find a fit.
> >  */
> > -   end = start + alloc_bits + PCPU_BITMAP_BLOCK_BITS;
> > +   end = min_t(int, start + alloc_bits + PCPU_BITMAP_BLOCK_BITS,
> > +   pcpu_chunk_map_bits(chunk));
> > bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end, start,
> >  alloc_bits, align_mask);
> > if (bit_off >= end)
> > --
> 
> From pcpu_alloc_area itself, I think this is correct to avoid 
> bitmap_find_next_zero_area
> scan past the boundaries of alloc_map, so
> 
> Reviewed-by: Peng Fan 
> 
> There are a few points I did not understand well,
> Per understanding pcpu_find_block_fit is to find the first bit off in a chunk 
> which could satisfy
> the bits allocation, so bits might be larger than PCPU_BITMAP_BLOCK_BITS. And 
> if
> pcpu_find_block_fit returns a good off, it means there is a area in the chunk 
> could satisfy
> the bits allocation, then the following pcpu_alloc_area will not scan past 
> the boundaries of
> alloc_map, right?
> 

pcpu_find_block_fit() finds the chunk offset corresponding to the block
that will be able to fit the chunk. Allocations are done by first fit,
so scanning begins from the first_free of a block. Because the hints are
always accurate, you never fail to find a fit in pcpu_alloc_area() if
pcpu_find_block_fit() gives you an offset. This means you never scan
past the end anyway.

Thanks,
Dennis

Re: [PATCH 03/12] percpu: introduce helper to determine if two regions overlap

2019-03-02 Thread Dennis Zhou

On Sat, Mar 02, 2019 at 01:37:37PM +, Peng Fan wrote:
> Hi Dennis,
> 
> > -Original Message-
> > From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> > Behalf Of Dennis Zhou
> > Sent: 2019年2月28日 10:19
> > To: Dennis Zhou ; Tejun Heo ; Christoph
> > Lameter 
> > Cc: Vlad Buslov ; kernel-t...@fb.com;
> > linux...@kvack.org; linux-kernel@vger.kernel.org
> > Subject: [PATCH 03/12] percpu: introduce helper to determine if two regions
> > overlap
> > 
> > While block hints were always accurate, it's possible when spanning across
> > blocks that we miss updating the chunk's contig_hint. Rather than rely on
> > correctness of the boundaries of hints, do a full overlap comparison.
> > 
> > Signed-off-by: Dennis Zhou 
> > ---
> >  mm/percpu.c | 31 +++
> >  1 file changed, 27 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 69ca51d238b5..b40112b2fc59 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -546,6 +546,24 @@ static inline int pcpu_cnt_pop_pages(struct
> > pcpu_chunk *chunk, int bit_off,
> >bitmap_weight(chunk->populated, page_start);  }
> > 
> > +/*
> > + * pcpu_region_overlap - determines if two regions overlap
> > + * @a: start of first region, inclusive
> > + * @b: end of first region, exclusive
> > + * @x: start of second region, inclusive
> > + * @y: end of second region, exclusive
> > + *
> > + * This is used to determine if the hint region [a, b) overlaps with
> > +the
> > + * allocated region [x, y).
> > + */
> > +static inline bool pcpu_region_overlap(int a, int b, int x, int y) {
> > +   if ((x >= a && x < b) || (y > a && y <= b) ||
> > +   (x <= a && y >= b))
> 
> I think this could be simplified:
>  (a < y) && (x < b) could be used to do overlap check.
> 

I'll change it to be the negative.

Thanks,
Dennis

> 
> > +   return true;
> > +   return false;
> > +}
> > +
> >  /**
> >   * pcpu_chunk_update - updates the chunk metadata given a free area
> >   * @chunk: chunk of interest
> > @@ -710,8 +728,11 @@ static void pcpu_block_update_hint_alloc(struct
> > pcpu_chunk *chunk, int bit_off,
> > PCPU_BITMAP_BLOCK_BITS,
> > s_off + bits);
> > 
> > -   if (s_off >= s_block->contig_hint_start &&
> > -   s_off < s_block->contig_hint_start + s_block->contig_hint) {
> > +   if (pcpu_region_overlap(s_block->contig_hint_start,
> > +   s_block->contig_hint_start +
> > +   s_block->contig_hint,
> > +   s_off,
> > +   s_off + bits)) {
> > /* block contig hint is broken - scan to fix it */
> > pcpu_block_refresh_hint(chunk, s_index);
> > } else {
> > @@ -764,8 +785,10 @@ static void pcpu_block_update_hint_alloc(struct
> > pcpu_chunk *chunk, int bit_off,
> >  * contig hint is broken.  Otherwise, it means a smaller space
> >  * was used and therefore the chunk contig hint is still correct.
> >  */
> > -   if (bit_off >= chunk->contig_bits_start  &&
> > -   bit_off < chunk->contig_bits_start + chunk->contig_bits)
> > +   if (pcpu_region_overlap(chunk->contig_bits_start,
> > +   chunk->contig_bits_start + chunk->contig_bits,
> > +   bit_off,
> > +   bit_off + bits))
> > pcpu_chunk_refresh_hint(chunk);
> >  }
> > 
> > --
> > 2.17.1
>

Re: [PATCH 04/12] percpu: manage chunks based on contig_bits instead of free_bytes

2019-03-02 Thread Dennis Zhou

On Sat, Mar 02, 2019 at 01:48:20PM +, Peng Fan wrote:
> 
> 
> > -Original Message-
> > From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> > Behalf Of Dennis Zhou
> > Sent: 2019年2月28日 10:19
> > To: Dennis Zhou ; Tejun Heo ; Christoph
> > Lameter 
> > Cc: Vlad Buslov ; kernel-t...@fb.com;
> > linux...@kvack.org; linux-kernel@vger.kernel.org
> > Subject: [PATCH 04/12] percpu: manage chunks based on contig_bits instead
> > of free_bytes
> > 
> > When a chunk becomes fragmented, it can end up having a large number of
> > small allocation areas free. The free_bytes sorting of chunks leads to
> > unnecessary checking of chunks that cannot satisfy the allocation.
> > Switch to contig_bits sorting to prevent scanning chunks that may not be 
> > able
> > to service the allocation request.
> > 
> > Signed-off-by: Dennis Zhou 
> > ---
> >  mm/percpu.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index b40112b2fc59..c996bcffbb2a 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -234,7 +234,7 @@ static int pcpu_chunk_slot(const struct pcpu_chunk
> > *chunk)
> > if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits
> > == 0)
> > return 0;
> > 
> > -   return pcpu_size_to_slot(chunk->free_bytes);
> > +   return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
> >  }
> > 
> >  /* set the pointer to a chunk in a page struct */
> 
> Reviewed-by: Peng Fan 
> 
> Not relevant to this patch, another optimization to percpu might be good
> to use per chunk spin_lock, not gobal pcpu_lock.
> 

Percpu memory itself is expensive and for the most part shouldn't be
part of the critical path. Ideally, we don't have multiple chunks being
allocated simultaneously because once an allocation is given out, the
chunk is pinned until all allocations are freed.

Thanks,
Dennis

Re: [PATCH 05/12] percpu: relegate chunks unusable when failing small allocations

2019-03-02 Thread Dennis Zhou

On Sat, Mar 02, 2019 at 01:55:54PM +, Peng Fan wrote:
> Hi Dennis,
> 
> > -Original Message-
> > From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> > Behalf Of Dennis Zhou
> > Sent: 2019年2月28日 10:19
> > To: Dennis Zhou ; Tejun Heo ; Christoph
> > Lameter 
> > Cc: Vlad Buslov ; kernel-t...@fb.com;
> > linux...@kvack.org; linux-kernel@vger.kernel.org
> > Subject: [PATCH 05/12] percpu: relegate chunks unusable when failing small
> > allocations
> > 
> > In certain cases, requestors of percpu memory may want specific alignments.
> > However, it is possible to end up in situations where the contig_hint 
> > matches,
> > but the alignment does not. This causes excess scanning of chunks that will 
> > fail.
> > To prevent this, if a small allocation fails (< 32B), the chunk is moved to 
> > the
> > empty list. Once an allocation is freed from that chunk, it is placed back 
> > into
> > rotation.
> > 
> > Signed-off-by: Dennis Zhou 
> > ---
> >  mm/percpu.c | 35 ++-
> >  1 file changed, 26 insertions(+), 9 deletions(-)
> > 
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index c996bcffbb2a..3d7deece9556 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -94,6 +94,8 @@
> > 
> >  /* the slots are sorted by free bytes left, 1-31 bytes share the same slot 
> > */
> >  #define PCPU_SLOT_BASE_SHIFT   5
> > +/* chunks in slots below this are subject to being sidelined on failed 
> > alloc */
> > +#define PCPU_SLOT_FAIL_THRESHOLD   3
> > 
> >  #define PCPU_EMPTY_POP_PAGES_LOW   2
> >  #define PCPU_EMPTY_POP_PAGES_HIGH  4
> > @@ -488,6 +490,22 @@ static void pcpu_mem_free(void *ptr)
> > kvfree(ptr);
> >  }
> > 
> > +static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
> > + bool move_front)
> > +{
> > +   if (chunk != pcpu_reserved_chunk) {
> > +   if (move_front)
> > +   list_move(&chunk->list, &pcpu_slot[slot]);
> > +   else
> > +   list_move_tail(&chunk->list, &pcpu_slot[slot]);
> > +   }
> > +}
> > +
> > +static void pcpu_chunk_move(struct pcpu_chunk *chunk, int slot) {
> > +   __pcpu_chunk_move(chunk, slot, true);
> > +}
> > +
> >  /**
> >   * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
> >   * @chunk: chunk of interest
> > @@ -505,12 +523,8 @@ static void pcpu_chunk_relocate(struct pcpu_chunk
> > *chunk, int oslot)  {
> > int nslot = pcpu_chunk_slot(chunk);
> > 
> > -   if (chunk != pcpu_reserved_chunk && oslot != nslot) {
> > -   if (oslot < nslot)
> > -   list_move(&chunk->list, &pcpu_slot[nslot]);
> > -   else
> > -   list_move_tail(&chunk->list, &pcpu_slot[nslot]);
> > -   }
> > +   if (oslot != nslot)
> > +   __pcpu_chunk_move(chunk, nslot, oslot < nslot);
> >  }
> > 
> >  /**
> > @@ -1381,7 +1395,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t
> > align, bool reserved,
> > bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
> > bool do_warn = !(gfp & __GFP_NOWARN);
> > static int warn_limit = 10;
> > -   struct pcpu_chunk *chunk;
> > +   struct pcpu_chunk *chunk, *next;
> > const char *err;
> > int slot, off, cpu, ret;
> > unsigned long flags;
> > @@ -1443,11 +1457,14 @@ static void __percpu *pcpu_alloc(size_t size,
> > size_t align, bool reserved,
> >  restart:
> > /* search through normal chunks */
> > for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
> > -   list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> > +   list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
> > off = pcpu_find_block_fit(chunk, bits, bit_align,
> >   is_atomic);
> > -   if (off < 0)
> > +   if (off < 0) {
> > +   if (slot < PCPU_SLOT_FAIL_THRESHOLD)
> > +   pcpu_chunk_move(chunk, 0);
> > continue;
> > +   }
> > 
> > off = pcpu_alloc_area(chunk, bits, bit_align, off);
> > if (off >= 0)
> 
> For the code: Reviewed-by: Peng Fan 
> 
> But I did not understand well why choose 32B? If there are
> more information, better put in commit log.
> 

There isn't I just picked a small allocation size.

Thanks,
Dennis

Re: [PATCH 00/12] introduce percpu block scan_hint

2019-03-13 Thread Dennis Zhou

On Wed, Feb 27, 2019 at 09:18:27PM -0500, Dennis Zhou wrote:
> Hi everyone,
> 
> It was reported a while [1] that an increase in allocation alignment
> requirement [2] caused the percpu memory allocator to do significantly
> more work.
> 
> After spending quite a bit of time diving into it, it seems the crux was
> the following:
>   1) chunk management by free_bytes caused allocations to scan over
>  chunks that could not fit due to fragmentation
>   2) per block fragmentation required scanning from an early first_free
>  bit causing allocations to repeat work
> 
> This series introduces a scan_hint for pcpu_block_md and merges the
> paths used to manage the hints. The scan_hint represents the largest
> known free area prior to the contig_hint. There are some caveats to
> this. First, it may not necessarily be the largest area as we do partial
> updates based on freeing of regions and failed scanning in
> pcpu_alloc_area(). Second, if contig_hint == scan_hint, then
> scan_hint_start > contig_hint_start is possible. This is necessary
> for scan_hint discovery when refreshing the hint of a block.
> 
> A necessary change is to enforce a block to be the size of a page. This
> let's the management of nr_empty_pop_pages to be done by breaking and
> making full contig_hints in the hint update paths. Prior, this was done
> by piggy backing off of refreshing the chunk contig_hint as it performed
> a full scan and counting empty full pages.
> 
> The following are the results found using the workload provided in [3].
> 
> branch| time
>
> 5.0-rc7   | 69s
> [2] reverted  | 44s
> scan_hint | 39s
> 
> The times above represent the approximate average across multiple runs.
> I tested based on a basic 1M 16-byte allocation pattern with no
> alignment requirement and times did not differ between 5.0-rc7 and
> scan_hint.
> 
> [1] 
> https://lore.kernel.org/netdev/CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=kz3hdcwp...@mail.gmail.com/
> [2] 
> https://lore.kernel.org/netdev/20181116154329.247947-1-eduma...@google.com/
> [3] https://lore.kernel.org/netdev/vbfzhrj9smb@mellanox.com/
> 
> This patchset contains the following 12 patches:
>   0001-percpu-update-free-path-with-correct-new-free-region.patch
>   0002-percpu-do-not-search-past-bitmap-when-allocating-an-.patch
>   0003-percpu-introduce-helper-to-determine-if-two-regions-.patch
>   0004-percpu-manage-chunks-based-on-contig_bits-instead-of.patch
>   0005-percpu-relegate-chunks-unusable-when-failing-small-a.patch
>   0006-percpu-set-PCPU_BITMAP_BLOCK_SIZE-to-PAGE_SIZE.patch
>   0007-percpu-add-block-level-scan_hint.patch
>   0008-percpu-remember-largest-area-skipped-during-allocati.patch
>   0009-percpu-use-block-scan_hint-to-only-scan-forward.patch
>   0010-percpu-make-pcpu_block_md-generic.patch
>   0011-percpu-convert-chunk-hints-to-be-based-on-pcpu_block.patch
>   0012-percpu-use-chunk-scan_hint-to-skip-some-scanning.patch
> 
> 0001 fixes an issue where the chunk contig_hint was being updated
> improperly with the new region's starting offset and possibly differing
> contig_hint. 0002 fixes possibly scanning pass the end of the bitmap.
> 0003 introduces a helper to do region overlap comparison. 0004 switches
> to chunk management by contig_hint rather than free_bytes. 0005 moves
> chunks that fail to allocate to the empty block list to prevent excess
> scanning with of chunks with small contig_hints and poor alignment.
> 0006 introduces the constraint PCPU_BITMAP_BLOCK_SIZE == PAGE_SIZE and
> modifies nr_empty_pop_pages management to be a part of the hint updates.
> 0007-0009 introduces percpu block scan_hint. 0010 makes pcpu_block_md
> generic so chunk hints can be managed as a pcpu_block_md responsible
> for more bits. 0011-0012 add chunk scan_hints.
> 
> This patchset is on top of percpu#master a3b22b9f11d9.
> 
> diffstats below:
> 
> Dennis Zhou (12):
>   percpu: update free path with correct new free region
>   percpu: do not search past bitmap when allocating an area
>   percpu: introduce helper to determine if two regions overlap
>   percpu: manage chunks based on contig_bits instead of free_bytes
>   percpu: relegate chunks unusable when failing small allocations
>   percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
>   percpu: add block level scan_hint
>   percpu: remember largest area skipped during allocation
>   percpu: use block scan_hint to only scan forward
>   percpu: make pcpu_block_md generic
>   percpu: convert chunk hints to be based on pcpu_block_md
>   percpu: use chunk scan_hint to skip some scanning
> 
>  include/linux/percpu.h |  12 +-
>  mm/percpu-internal.h   |  15 +-
>  mm/percpu-km.c |   2 +-
>  mm/percpu-stats.c  |   5 +-
>  mm/percpu.c| 547 +
>  5 files changed, 404 insertions(+), 177 deletions(-)
> 
> Thanks,
> Dennis

Applied to percpu/for-5.2.

Thanks,
Dennis

[PATCH 10/12] btrfs: zstd use the passed through level instead of default

2019-02-04 Thread Dennis Zhou

Zstd currently only supports the default level of compression. This
patch switches to using the level passed in for btrfs zstd
configuration.

Zstd workspaces now keep track of the requested level as this can differ
from the size of the workspace.

Signed-off-by: Dennis Zhou 
Reviewed-by: Nikolay Borisov 
---
 fs/btrfs/zstd.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 43f3be755b8c..a951d4fe77f7 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -21,10 +21,10 @@
 #define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG)
 #define ZSTD_BTRFS_DEFAULT_LEVEL 3
 
-static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len)
+static ZSTD_parameters zstd_get_btrfs_parameters(unsigned int level,
+size_t src_len)
 {
-   ZSTD_parameters params = ZSTD_getParams(ZSTD_BTRFS_DEFAULT_LEVEL,
-   src_len, 0);
+   ZSTD_parameters params = ZSTD_getParams(level, src_len, 0);
 
if (params.cParams.windowLog > ZSTD_BTRFS_MAX_WINDOWLOG)
params.cParams.windowLog = ZSTD_BTRFS_MAX_WINDOWLOG;
@@ -36,6 +36,7 @@ struct workspace {
void *mem;
size_t size;
char *buf;
+   unsigned int req_level;
struct list_head list;
ZSTD_inBuffer in_buf;
ZSTD_outBuffer out_buf;
@@ -55,7 +56,12 @@ static void zstd_cleanup_workspace_manager(void)
 
 static struct list_head *zstd_get_workspace(unsigned int level)
 {
-   return btrfs_get_workspace(&wsm, level);
+   struct list_head *ws = btrfs_get_workspace(&wsm, level);
+   struct workspace *workspace = list_entry(ws, struct workspace, list);
+
+   workspace->req_level = level;
+
+   return ws;
 }
 
 static void zstd_put_workspace(struct list_head *ws)
@@ -75,7 +81,7 @@ static void zstd_free_workspace(struct list_head *ws)
 static struct list_head *zstd_alloc_workspace(unsigned int level)
 {
ZSTD_parameters params =
-   zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT);
+   zstd_get_btrfs_parameters(level, ZSTD_BTRFS_MAX_INPUT);
struct workspace *workspace;
 
workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
@@ -117,7 +123,8 @@ static int zstd_compress_pages(struct list_head *ws,
unsigned long len = *total_out;
const unsigned long nr_dest_pages = *out_pages;
unsigned long max_out = nr_dest_pages * PAGE_SIZE;
-   ZSTD_parameters params = zstd_get_btrfs_parameters(len);
+   ZSTD_parameters params = zstd_get_btrfs_parameters(workspace->req_level,
+  len);
 
*out_pages = 0;
*total_out = 0;
-- 
2.17.1

[PATCH 06/12] btrfs: add compression interface in (get/put)_workspace()

2019-02-04 Thread Dennis Zhou

There are two levels of workspace management. First, alloc()/free()
which are responsible for actually creating and destroy workspaces.
Second, at a higher level, get()/put() which is the compression code
asking for a workspace from a workspace_manager.

The compression code shouldn't really care how it gets a workspace, but
that it got a workspace. This adds get_workspace() and put_workspace()
to be the higher level interface which is responsible for indexing into
the appropriate compression type. It also introduces
btrfs_put_workspace() and btrfs_get_workspace() to be the generic
implementations of the higher interface.

Signed-off-by: Dennis Zhou 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/compression.c | 57 +-
 1 file changed, 34 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b213d1efb586..98e1b222f78c 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -839,7 +839,7 @@ static void btrfs_cleanup_workspace_manager(struct 
workspace_manager *wsman)
  * Preallocation makes a forward progress guarantees and we do not return
  * errors.
  */
-static struct list_head *find_workspace(int type)
+static struct list_head *btrfs_get_workspace(struct workspace_manager *wsman)
 {
struct list_head *workspace;
int cpus = num_online_cpus();
@@ -850,11 +850,11 @@ static struct list_head *find_workspace(int type)
wait_queue_head_t *ws_wait;
int *free_ws;
 
-   idle_ws  = &wsm[type].idle_ws;
-   ws_lock  = &wsm[type].ws_lock;
-   total_ws = &wsm[type].total_ws;
-   ws_wait  = &wsm[type].ws_wait;
-   free_ws  = &wsm[type].free_ws;
+   idle_ws  = &wsman->idle_ws;
+   ws_lock  = &wsman->ws_lock;
+   total_ws = &wsman->total_ws;
+   ws_wait  = &wsman->ws_wait;
+   free_ws  = &wsman->free_ws;
 
 again:
spin_lock(ws_lock);
@@ -885,7 +885,7 @@ static struct list_head *find_workspace(int type)
 * context of btrfs_compress_bio/btrfs_compress_pages
 */
nofs_flag = memalloc_nofs_save();
-   workspace = wsm[type].ops->alloc_workspace();
+   workspace = wsman->ops->alloc_workspace();
memalloc_nofs_restore(nofs_flag);
 
if (IS_ERR(workspace)) {
@@ -916,11 +916,17 @@ static struct list_head *find_workspace(int type)
return workspace;
 }
 
+static struct list_head *get_workspace(int type)
+{
+   return btrfs_get_workspace(&wsm[type]);
+}
+
 /*
  * put a workspace struct back on the list or free it if we have enough
  * idle ones sitting around
  */
-static void free_workspace(int type, struct list_head *workspace)
+static void btrfs_put_workspace(struct workspace_manager *wsman,
+   struct list_head *ws)
 {
struct list_head *idle_ws;
spinlock_t *ws_lock;
@@ -928,27 +934,32 @@ static void free_workspace(int type, struct list_head 
*workspace)
wait_queue_head_t *ws_wait;
int *free_ws;
 
-   idle_ws  = &wsm[type].idle_ws;
-   ws_lock  = &wsm[type].ws_lock;
-   total_ws = &wsm[type].total_ws;
-   ws_wait  = &wsm[type].ws_wait;
-   free_ws  = &wsm[type].free_ws;
+   idle_ws  = &wsman->idle_ws;
+   ws_lock  = &wsman->ws_lock;
+   total_ws = &wsman->total_ws;
+   ws_wait  = &wsman->ws_wait;
+   free_ws  = &wsman->free_ws;
 
spin_lock(ws_lock);
if (*free_ws <= num_online_cpus()) {
-   list_add(workspace, idle_ws);
+   list_add(ws, idle_ws);
(*free_ws)++;
spin_unlock(ws_lock);
goto wake;
}
spin_unlock(ws_lock);
 
-   wsm[type].ops->free_workspace(workspace);
+   wsman->ops->free_workspace(ws);
atomic_dec(total_ws);
 wake:
cond_wake_up(ws_wait);
 }
 
+static void put_workspace(int type, struct list_head *ws)
+{
+   return btrfs_put_workspace(&wsm[type], ws);
+}
+
 /*
  * Given an address space and start and length, compress the bytes into @pages
  * that are allocated on demand.
@@ -982,14 +993,14 @@ int btrfs_compress_pages(unsigned int type_level, struct 
address_space *mapping,
struct list_head *workspace;
int ret;
 
-   workspace = find_workspace(type);
+   workspace = get_workspace(type);
 
btrfs_compress_op[type]->set_level(workspace, type_level);
ret = btrfs_compress_op[type]->compress_pages(workspace, mapping,
  start, pages,
  out_pages,
  total_in, total_out);
-   free_workspace(type, workspace);
+   put_workspace(type, workspace);
return ret;
 }
 
@@ -1013,9 +1024,9 @@ static int btrfs_decompress_bio(struct co

[PATCH 11/12] btrfs: make zstd memory requirements monotonic

2019-02-04 Thread Dennis Zhou

It is possible based on the level configurations that a higher level
workspace uses less memory than a lower level workspace. In order to
reuse workspaces, this must be made a monotonic relationship. This
precomputes the required memory for each level and enforces the
monotonicity between level and memory required. This is also done
in upstream zstd in [1].

[1] 
https://github.com/facebook/zstd/commit/a68b76afefec6876f8e8a538155109a5aeac0143

Signed-off-by: Dennis Zhou 
Cc: Nick Terrell 
---
 fs/btrfs/zstd.c | 38 +-
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index a951d4fe77f7..2231123fedbe 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -20,6 +20,7 @@
 #define ZSTD_BTRFS_MAX_WINDOWLOG 17
 #define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG)
 #define ZSTD_BTRFS_DEFAULT_LEVEL 3
+#define ZSTD_BTRFS_MAX_LEVEL 15
 
 static ZSTD_parameters zstd_get_btrfs_parameters(unsigned int level,
 size_t src_len)
@@ -44,8 +45,39 @@ struct workspace {
 
 static struct workspace_manager wsm;
 
+static size_t zstd_ws_mem_sizes[ZSTD_BTRFS_MAX_LEVEL];
+
+/*
+ * zstd_calc_ws_mem_sizes - calculate monotonic memory bounds
+ *
+ * It is possible based on the level configurations that a higher level
+ * workspace uses less memory than a lower level workspace.  In order to
+ * reuse workspaces, this must be made a monotonic relationship.  This
+ * precomputes the required memory for each level and enforces the
+ * monotonicity between level and memory required.
+ */
+static void zstd_calc_ws_mem_sizes(void)
+{
+   size_t max_size = 0;
+   unsigned int level;
+
+   for (level = 1; level <= ZSTD_BTRFS_MAX_LEVEL; level++) {
+   ZSTD_parameters params =
+   zstd_get_btrfs_parameters(level, ZSTD_BTRFS_MAX_INPUT);
+   size_t level_size =
+   max_t(size_t,
+ ZSTD_CStreamWorkspaceBound(params.cParams),
+ ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT));
+
+   max_size = max_t(size_t, max_size, level_size);
+   zstd_ws_mem_sizes[level - 1] = max_size;
+   }
+}
+
 static void zstd_init_workspace_manager(void)
 {
+   zstd_calc_ws_mem_sizes();
+
btrfs_init_workspace_manager(&wsm, &btrfs_zstd_compress);
 }
 
@@ -80,17 +112,13 @@ static void zstd_free_workspace(struct list_head *ws)
 
 static struct list_head *zstd_alloc_workspace(unsigned int level)
 {
-   ZSTD_parameters params =
-   zstd_get_btrfs_parameters(level, ZSTD_BTRFS_MAX_INPUT);
struct workspace *workspace;
 
workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
if (!workspace)
return ERR_PTR(-ENOMEM);
 
-   workspace->size = max_t(size_t,
-   ZSTD_CStreamWorkspaceBound(params.cParams),
-   ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT));
+   workspace->size = zstd_ws_mem_sizes[level - 1];
workspace->mem = kvmalloc(workspace->size, GFP_KERNEL);
workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!workspace->mem || !workspace->buf)
-- 
2.17.1

[PATCH 08/12] btrfs: plumb level through the compression interface

2019-02-04 Thread Dennis Zhou

Zlib compression supports multiple levels, but doesn't require changing
in how a workspace itself is created and managed. Zstd introduces a
different memory requirement such that higher levels of compression
require more memory. This requires changes in how the alloc()/get()
methods work for zstd. This pach plumbs compression level through the
interface as a parameter in preparation for zstd compression levels.
This gives the compression types opportunity to create/manage based on
the compression level.

Signed-off-by: Dennis Zhou 
Reviewed-by: Nikolay Borisov 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/compression.c | 31 ---
 fs/btrfs/compression.h |  7 ---
 fs/btrfs/lzo.c |  6 +++---
 fs/btrfs/zlib.c|  7 ---
 fs/btrfs/zstd.c|  6 +++---
 5 files changed, 30 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 3d42ab71455d..a54b210739bc 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -742,9 +742,9 @@ static void heuristic_cleanup_workspace_manager(void)
btrfs_cleanup_workspace_manager(&heuristic_wsm);
 }
 
-static struct list_head *heuristic_get_workspace(void)
+static struct list_head *heuristic_get_workspace(unsigned int level)
 {
-   return btrfs_get_workspace(&heuristic_wsm);
+   return btrfs_get_workspace(&heuristic_wsm, level);
 }
 
 static void heuristic_put_workspace(struct list_head *ws)
@@ -764,7 +764,7 @@ static void free_heuristic_ws(struct list_head *ws)
kfree(workspace);
 }
 
-static struct list_head *alloc_heuristic_ws(void)
+static struct list_head *alloc_heuristic_ws(unsigned int level)
 {
struct heuristic_ws *ws;
 
@@ -823,7 +823,7 @@ void btrfs_init_workspace_manager(struct workspace_manager 
*wsm,
 * Preallocate one workspace for each compression type so
 * we can guarantee forward progress in the worst case
 */
-   workspace = wsm->ops->alloc_workspace();
+   workspace = wsm->ops->alloc_workspace(0);
if (IS_ERR(workspace)) {
pr_warn("BTRFS: cannot preallocate compression workspace, will 
try later\n");
} else {
@@ -851,7 +851,8 @@ void btrfs_cleanup_workspace_manager(struct 
workspace_manager *wsman)
  * Preallocation makes a forward progress guarantees and we do not return
  * errors.
  */
-struct list_head *btrfs_get_workspace(struct workspace_manager *wsm)
+struct list_head *btrfs_get_workspace(struct workspace_manager *wsm,
+ unsigned int level)
 {
struct list_head *workspace;
int cpus = num_online_cpus();
@@ -897,7 +898,7 @@ struct list_head *btrfs_get_workspace(struct 
workspace_manager *wsm)
 * context of btrfs_compress_bio/btrfs_compress_pages
 */
nofs_flag = memalloc_nofs_save();
-   workspace = wsm->ops->alloc_workspace();
+   workspace = wsm->ops->alloc_workspace(level);
memalloc_nofs_restore(nofs_flag);
 
if (IS_ERR(workspace)) {
@@ -928,9 +929,9 @@ struct list_head *btrfs_get_workspace(struct 
workspace_manager *wsm)
return workspace;
 }
 
-static struct list_head *get_workspace(int type)
+static struct list_head *get_workspace(int type, int level)
 {
-   return btrfs_compress_op[type]->get_workspace();
+   return btrfs_compress_op[type]->get_workspace(level);
 }
 
 /*
@@ -1001,12 +1002,13 @@ int btrfs_compress_pages(unsigned int type_level, 
struct address_space *mapping,
 unsigned long *total_out)
 {
int type = btrfs_compress_type(type_level);
+   int level = btrfs_compress_level(type_level);
struct list_head *workspace;
int ret;
 
-   workspace = get_workspace(type);
+   workspace = get_workspace(type, level);
 
-   btrfs_compress_op[type]->set_level(workspace, type_level);
+   btrfs_compress_op[type]->set_level(workspace, level);
ret = btrfs_compress_op[type]->compress_pages(workspace, mapping,
  start, pages,
  out_pages,
@@ -1035,7 +1037,7 @@ static int btrfs_decompress_bio(struct compressed_bio *cb)
int ret;
int type = cb->compress_type;
 
-   workspace = get_workspace(type);
+   workspace = get_workspace(type, 0);
ret = btrfs_compress_op[type]->decompress_bio(workspace, cb);
put_workspace(type, workspace);
 
@@ -1053,13 +1055,12 @@ int btrfs_decompress(int type, unsigned char *data_in, 
struct page *dest_page,
struct list_head *workspace;
int ret;
 
-   workspace = get_workspace(type);
-
+   workspace = get_workspace(type, 0);
ret = btrfs_compress_op[type]->decompress(workspace, data_in,
  dest_page, start_byte,
  srcle

[PATCH 07/12] btrfs: move to fn pointers for get/put workspaces

2019-02-04 Thread Dennis Zhou

The previous patch added generic helpers for get_workspace() and
put_workspace(). Now, we can migrate ownership of the workspace_manager
to be in the compression type code as the compression code itself
doesn't care beyond being able to get a workspace. The init/cleanup
and get/put methods are abstracted so each compression algorithm can
decide how they want to manage their workspaces.

Signed-off-by: Dennis Zhou 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/compression.c | 101 +++--
 fs/btrfs/compression.h |  26 +++
 fs/btrfs/lzo.c |  26 +++
 fs/btrfs/zlib.c|  26 +++
 fs/btrfs/zstd.c|  26 +++
 5 files changed, 160 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 98e1b222f78c..3d42ab71455d 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -730,6 +730,28 @@ struct heuristic_ws {
struct list_head list;
 };
 
+static struct workspace_manager heuristic_wsm;
+
+static void heuristic_init_workspace_manager(void)
+{
+   btrfs_init_workspace_manager(&heuristic_wsm, &btrfs_heuristic_compress);
+}
+
+static void heuristic_cleanup_workspace_manager(void)
+{
+   btrfs_cleanup_workspace_manager(&heuristic_wsm);
+}
+
+static struct list_head *heuristic_get_workspace(void)
+{
+   return btrfs_get_workspace(&heuristic_wsm);
+}
+
+static void heuristic_put_workspace(struct list_head *ws)
+{
+   btrfs_put_workspace(&heuristic_wsm, ws);
+}
+
 static void free_heuristic_ws(struct list_head *ws)
 {
struct heuristic_ws *workspace;
@@ -770,24 +792,14 @@ static struct list_head *alloc_heuristic_ws(void)
 }
 
 const struct btrfs_compress_op btrfs_heuristic_compress = {
+   .init_workspace_manager = heuristic_init_workspace_manager,
+   .cleanup_workspace_manager = heuristic_cleanup_workspace_manager,
+   .get_workspace = heuristic_get_workspace,
+   .put_workspace = heuristic_put_workspace,
.alloc_workspace = alloc_heuristic_ws,
.free_workspace = free_heuristic_ws,
 };
 
-struct workspace_manager {
-   const struct btrfs_compress_op *ops;
-   struct list_head idle_ws;
-   spinlock_t ws_lock;
-   /* Number of free workspaces */
-   int free_ws;
-   /* Total number of allocated workspaces */
-   atomic_t total_ws;
-   /* Waiters for a free workspace */
-   wait_queue_head_t ws_wait;
-};
-
-static struct workspace_manager wsm[BTRFS_NR_WORKSPACE_MANAGERS];
-
 static const struct btrfs_compress_op * const btrfs_compress_op[] = {
&btrfs_heuristic_compress,
&btrfs_zlib_compress,
@@ -795,33 +807,33 @@ static const struct btrfs_compress_op * const 
btrfs_compress_op[] = {
&btrfs_zstd_compress,
 };
 
-static void btrfs_init_workspace_manager(int type)
+void btrfs_init_workspace_manager(struct workspace_manager *wsm,
+ const struct btrfs_compress_op *ops)
 {
-   struct workspace_manager *wsman = &wsm[type];
struct list_head *workspace;
 
-   wsman->ops = btrfs_compress_op[type];
+   wsm->ops = ops;
 
-   INIT_LIST_HEAD(&wsman->idle_ws);
-   spin_lock_init(&wsman->ws_lock);
-   atomic_set(&wsman->total_ws, 0);
-   init_waitqueue_head(&wsman->ws_wait);
+   INIT_LIST_HEAD(&wsm->idle_ws);
+   spin_lock_init(&wsm->ws_lock);
+   atomic_set(&wsm->total_ws, 0);
+   init_waitqueue_head(&wsm->ws_wait);
 
/*
 * Preallocate one workspace for each compression type so
 * we can guarantee forward progress in the worst case
 */
-   workspace = wsman->ops->alloc_workspace();
+   workspace = wsm->ops->alloc_workspace();
if (IS_ERR(workspace)) {
pr_warn("BTRFS: cannot preallocate compression workspace, will 
try later\n");
} else {
-   atomic_set(&wsman->total_ws, 1);
-   wsman->free_ws = 1;
-   list_add(workspace, &wsman->idle_ws);
+   atomic_set(&wsm->total_ws, 1);
+   wsm->free_ws = 1;
+   list_add(workspace, &wsm->idle_ws);
}
 }
 
-static void btrfs_cleanup_workspace_manager(struct workspace_manager *wsman)
+void btrfs_cleanup_workspace_manager(struct workspace_manager *wsman)
 {
struct list_head *ws;
 
@@ -839,7 +851,7 @@ static void btrfs_cleanup_workspace_manager(struct 
workspace_manager *wsman)
  * Preallocation makes a forward progress guarantees and we do not return
  * errors.
  */
-static struct list_head *btrfs_get_workspace(struct workspace_manager *wsman)
+struct list_head *btrfs_get_workspace(struct workspace_manager *wsm)
 {
struct list_head *workspace;
int cpus = num_online_cpus();
@@ -850,11 +862,11 @@ static struct list_head *btrfs_get_workspace(struc

[PATCH 09/12] btrfs: change set_level() to bound the level passed in

2019-02-04 Thread Dennis Zhou

Currently, the only user of set_level() is zlib which sets an internal
workspace parameter. As level is now plumbed into get_workspace(), this
can be handled there rather than separately.

This repurposes set_level() to bound the level passed in so it can be
used when setting the mounts compression level and as well as verifying
the level before getting a workspace. The other benefit is this divides
the meaning of compress(0) and get_workspace(0). The former means we
want to use the default compression level of the compression type. The
latter means we can use any workspace available.

Signed-off-by: Dennis Zhou 
---
 fs/btrfs/compression.c | 21 +
 fs/btrfs/compression.h | 10 --
 fs/btrfs/lzo.c |  3 ++-
 fs/btrfs/super.c   |  4 +++-
 fs/btrfs/zlib.c| 18 ++
 fs/btrfs/zstd.c|  3 ++-
 6 files changed, 38 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index a54b210739bc..dc4afd7d9027 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1007,8 +1007,6 @@ int btrfs_compress_pages(unsigned int type_level, struct 
address_space *mapping,
int ret;
 
workspace = get_workspace(type, level);
-
-   btrfs_compress_op[type]->set_level(workspace, level);
ret = btrfs_compress_op[type]->compress_pages(workspace, mapping,
  start, pages,
  out_pages,
@@ -1561,14 +1559,21 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
return ret;
 }
 
-unsigned int btrfs_compress_str2level(const char *str)
+unsigned int btrfs_compress_str2level(unsigned int type, const char *str)
 {
-   if (strncmp(str, "zlib", 4) != 0)
+   unsigned int level;
+   int ret;
+
+   if (!type)
return 0;
 
-   /* Accepted form: zlib:1 up to zlib:9 and nothing left after the number 
*/
-   if (str[4] == ':' && '1' <= str[5] && str[5] <= '9' && str[6] == 0)
-   return str[5] - '0';
+   if (str[0] == ':') {
+   ret = kstrtouint(str + 1, 10, &level);
+   if (ret)
+   level = 0;
+   }
+
+   level = btrfs_compress_op[type]->set_level(level);
 
-   return BTRFS_ZLIB_DEFAULT_LEVEL;
+   return level;
 }
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 32ba40ebae1d..3f2d36b0aabe 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -97,7 +97,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 int mirror_num, unsigned long bio_flags);
 
-unsigned btrfs_compress_str2level(const char *str);
+unsigned int btrfs_compress_str2level(unsigned int type, const char *str);
 
 enum btrfs_compression_type {
BTRFS_COMPRESS_NONE  = 0,
@@ -156,7 +156,13 @@ struct btrfs_compress_op {
  unsigned long start_byte,
  size_t srclen, size_t destlen);
 
-   void (*set_level)(struct list_head *ws, unsigned int type);
+   /*
+* This bounds the level set by the user to be within range of
+* a particular compression type.  It returns the level that
+* will be used if the level is out of bounds or the default
+* if 0 is passed in.
+*/
+   unsigned int (*set_level)(unsigned int level);
 };
 
 /* the heuristic workspaces are managed via the 0th workspace manager */
diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index f132af45a924..579d53ae256f 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -507,8 +507,9 @@ static int lzo_decompress(struct list_head *ws, unsigned 
char *data_in,
return ret;
 }
 
-static void lzo_set_level(struct list_head *ws, unsigned int type)
+static unsigned int lzo_set_level(unsigned int level)
 {
+   return 0;
 }
 
 const struct btrfs_compress_op btrfs_lzo_compress = {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index c5586ffd1426..b28dff207383 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -529,7 +529,9 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
if (token != Opt_compress &&
token != Opt_compress_force)
info->compress_level =
- 
btrfs_compress_str2level(args[0].from);
+ btrfs_compress_str2level(
+   BTRFS_COMPRESS_ZLIB,
+   args[0].from + 4);
btrfs_set_opt(info->mount_opt, COMPRESS);

[PATCH 04/12] btrfs: unify compression ops with workspace_manager

2019-02-04 Thread Dennis Zhou

Make the workspace_manager own the interface operations rather than
managing index-paired arrays for the workspace_manager and compression
operations.

Signed-off-by: Dennis Zhou 
Reviewed-by: Nikolay Borisov 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/compression.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 5b508cb3b236..ef560b47b9c5 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -775,6 +775,7 @@ const struct btrfs_compress_op btrfs_heuristic_compress = {
 };
 
 struct workspace_manager {
+   const struct btrfs_compress_op *ops;
struct list_head idle_ws;
spinlock_t ws_lock;
/* Number of free workspaces */
@@ -800,6 +801,8 @@ void __init btrfs_init_compress(void)
int i;
 
for (i = 0; i < BTRFS_NR_WORKSPACE_MANAGERS; i++) {
+   wsm[i].ops = btrfs_compress_op[i];
+
INIT_LIST_HEAD(&wsm[i].idle_ws);
spin_lock_init(&wsm[i].ws_lock);
atomic_set(&wsm[i].total_ws, 0);
@@ -809,7 +812,7 @@ void __init btrfs_init_compress(void)
 * Preallocate one workspace for each compression type so
 * we can guarantee forward progress in the worst case
 */
-   workspace = btrfs_compress_op[i]->alloc_workspace();
+   workspace = wsm[i].ops->alloc_workspace();
if (IS_ERR(workspace)) {
pr_warn("BTRFS: cannot preallocate compression 
workspace, will try later\n");
} else {
@@ -872,7 +875,7 @@ static struct list_head *find_workspace(int type)
 * context of btrfs_compress_bio/btrfs_compress_pages
 */
nofs_flag = memalloc_nofs_save();
-   workspace = btrfs_compress_op[type]->alloc_workspace();
+   workspace = wsm[type].ops->alloc_workspace();
memalloc_nofs_restore(nofs_flag);
 
if (IS_ERR(workspace)) {
@@ -930,7 +933,7 @@ static void free_workspace(int type, struct list_head 
*workspace)
}
spin_unlock(ws_lock);
 
-   btrfs_compress_op[type]->free_workspace(workspace);
+   wsm[type].ops->free_workspace(workspace);
atomic_dec(total_ws);
 wake:
cond_wake_up(ws_wait);
@@ -948,7 +951,7 @@ static void free_workspaces(void)
while (!list_empty(&wsm[i].idle_ws)) {
workspace = wsm[i].idle_ws.next;
list_del(workspace);
-   btrfs_compress_op[i]->free_workspace(workspace);
+   wsm[i].ops->free_workspace(workspace);
atomic_dec(&wsm[i].total_ws);
}
}
-- 
2.17.1

[PATCH 01/12] btrfs: add helpers for compression type and level

2019-02-04 Thread Dennis Zhou

It is very easy to miss places that rely on a certain bitshifting for
decyphering the type_level overloading. Add helpers to do this instead.

Signed-off-by: Dennis Zhou 
Reviewed-by: Nikolay Borisov 
Reviewed-by: Josef Bacik 
Cc: Omar Sandoval 
---
 fs/btrfs/compression.c |  2 +-
 fs/btrfs/compression.h | 10 ++
 fs/btrfs/zlib.c|  2 +-
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 548057630b69..94a0b0a3a301 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1036,9 +1036,9 @@ int btrfs_compress_pages(unsigned int type_level, struct 
address_space *mapping,
 unsigned long *total_in,
 unsigned long *total_out)
 {
+   int type = btrfs_compress_type(type_level);
struct list_head *workspace;
int ret;
-   int type = type_level & 0xF;
 
workspace = find_workspace(type);
 
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index ddda9b80bf20..004db0b3111b 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -64,6 +64,16 @@ struct compressed_bio {
u32 sums;
 };
 
+static inline unsigned int btrfs_compress_type(unsigned int type_level)
+{
+   return (type_level & 0xF);
+}
+
+static inline unsigned int btrfs_compress_level(unsigned int type_level)
+{
+   return ((type_level & 0xF0) >> 4);
+}
+
 void __init btrfs_init_compress(void);
 void __cold btrfs_exit_compress(void);
 
diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index 970ff3e35bb3..2bd655c4f8b4 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -393,7 +393,7 @@ static int zlib_decompress(struct list_head *ws, unsigned 
char *data_in,
 static void zlib_set_level(struct list_head *ws, unsigned int type)
 {
struct workspace *workspace = list_entry(ws, struct workspace, list);
-   unsigned level = (type & 0xF0) >> 4;
+   unsigned int level = btrfs_compress_level(type);
 
if (level > 9)
level = 9;
-- 
2.17.1

[PATCH 05/12] btrfs: add helper methods for workspace manager init and cleanup

2019-02-04 Thread Dennis Zhou

Workspace manager init and cleanup code is open coded inside a for loop
over the compression types. This forces each compression type to rely on
the same workspace manager implementation. This patch creates helper
methods that will be the generic implementation for btrfs workspace
management.

Signed-off-by: Dennis Zhou 
Reviewed-by: Nikolay Borisov 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/compression.c | 81 ++
 1 file changed, 42 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ef560b47b9c5..b213d1efb586 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -795,31 +795,41 @@ static const struct btrfs_compress_op * const 
btrfs_compress_op[] = {
&btrfs_zstd_compress,
 };
 
-void __init btrfs_init_compress(void)
+static void btrfs_init_workspace_manager(int type)
 {
+   struct workspace_manager *wsman = &wsm[type];
struct list_head *workspace;
-   int i;
 
-   for (i = 0; i < BTRFS_NR_WORKSPACE_MANAGERS; i++) {
-   wsm[i].ops = btrfs_compress_op[i];
+   wsman->ops = btrfs_compress_op[type];
 
-   INIT_LIST_HEAD(&wsm[i].idle_ws);
-   spin_lock_init(&wsm[i].ws_lock);
-   atomic_set(&wsm[i].total_ws, 0);
-   init_waitqueue_head(&wsm[i].ws_wait);
+   INIT_LIST_HEAD(&wsman->idle_ws);
+   spin_lock_init(&wsman->ws_lock);
+   atomic_set(&wsman->total_ws, 0);
+   init_waitqueue_head(&wsman->ws_wait);
 
-   /*
-* Preallocate one workspace for each compression type so
-* we can guarantee forward progress in the worst case
-*/
-   workspace = wsm[i].ops->alloc_workspace();
-   if (IS_ERR(workspace)) {
-   pr_warn("BTRFS: cannot preallocate compression 
workspace, will try later\n");
-   } else {
-   atomic_set(&wsm[i].total_ws, 1);
-   wsm[i].free_ws = 1;
-   list_add(workspace, &wsm[i].idle_ws);
-   }
+   /*
+* Preallocate one workspace for each compression type so
+* we can guarantee forward progress in the worst case
+*/
+   workspace = wsman->ops->alloc_workspace();
+   if (IS_ERR(workspace)) {
+   pr_warn("BTRFS: cannot preallocate compression workspace, will 
try later\n");
+   } else {
+   atomic_set(&wsman->total_ws, 1);
+   wsman->free_ws = 1;
+   list_add(workspace, &wsman->idle_ws);
+   }
+}
+
+static void btrfs_cleanup_workspace_manager(struct workspace_manager *wsman)
+{
+   struct list_head *ws;
+
+   while (!list_empty(&wsman->idle_ws)) {
+   ws = wsman->idle_ws.next;
+   list_del(ws);
+   wsman->ops->free_workspace(ws);
+   atomic_dec(&wsman->total_ws);
}
 }
 
@@ -939,24 +949,6 @@ static void free_workspace(int type, struct list_head 
*workspace)
cond_wake_up(ws_wait);
 }
 
-/*
- * cleanup function for module exit
- */
-static void free_workspaces(void)
-{
-   struct list_head *workspace;
-   int i;
-
-   for (i = 0; i < BTRFS_NR_WORKSPACE_MANAGERS; i++) {
-   while (!list_empty(&wsm[i].idle_ws)) {
-   workspace = wsm[i].idle_ws.next;
-   list_del(workspace);
-   wsm[i].ops->free_workspace(workspace);
-   atomic_dec(&wsm[i].total_ws);
-   }
-   }
-}
-
 /*
  * Given an address space and start and length, compress the bytes into @pages
  * that are allocated on demand.
@@ -1049,9 +1041,20 @@ int btrfs_decompress(int type, unsigned char *data_in, 
struct page *dest_page,
return ret;
 }
 
+void __init btrfs_init_compress(void)
+{
+   int i;
+
+   for (i = 0; i < BTRFS_NR_WORKSPACE_MANAGERS; i++)
+   btrfs_init_workspace_manager(i);
+}
+
 void __cold btrfs_exit_compress(void)
 {
-   free_workspaces();
+   int i;
+
+   for (i = 0; i < BTRFS_NR_WORKSPACE_MANAGERS; i++)
+   btrfs_cleanup_workspace_manager(&wsm[i]);
 }
 
 /*
-- 
2.17.1

[PATCH 02/12] btrfs: rename workspaces_list to workspace_manager

2019-02-04 Thread Dennis Zhou

This is in preparation for zstd compression levels. As each level will
require different sized workspaces, workspaces_list is no longer a
really fitting name.

Signed-off-by: Dennis Zhou 
Reviewed-by: Nikolay Borisov 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/compression.c | 46 +-
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 94a0b0a3a301..d098df768b67 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -769,7 +769,7 @@ static struct list_head *alloc_heuristic_ws(void)
return ERR_PTR(-ENOMEM);
 }
 
-struct workspaces_list {
+struct workspace_manager {
struct list_head idle_ws;
spinlock_t ws_lock;
/* Number of free workspaces */
@@ -780,9 +780,9 @@ struct workspaces_list {
wait_queue_head_t ws_wait;
 };
 
-static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES];
+static struct workspace_manager wsm[BTRFS_COMPRESS_TYPES];
 
-static struct workspaces_list btrfs_heuristic_ws;
+static struct workspace_manager btrfs_heuristic_ws;
 
 static const struct btrfs_compress_op * const btrfs_compress_op[] = {
&btrfs_zlib_compress,
@@ -811,10 +811,10 @@ void __init btrfs_init_compress(void)
}
 
for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
-   INIT_LIST_HEAD(&btrfs_comp_ws[i].idle_ws);
-   spin_lock_init(&btrfs_comp_ws[i].ws_lock);
-   atomic_set(&btrfs_comp_ws[i].total_ws, 0);
-   init_waitqueue_head(&btrfs_comp_ws[i].ws_wait);
+   INIT_LIST_HEAD(&wsm[i].idle_ws);
+   spin_lock_init(&wsm[i].ws_lock);
+   atomic_set(&wsm[i].total_ws, 0);
+   init_waitqueue_head(&wsm[i].ws_wait);
 
/*
 * Preallocate one workspace for each compression type so
@@ -824,9 +824,9 @@ void __init btrfs_init_compress(void)
if (IS_ERR(workspace)) {
pr_warn("BTRFS: cannot preallocate compression 
workspace, will try later\n");
} else {
-   atomic_set(&btrfs_comp_ws[i].total_ws, 1);
-   btrfs_comp_ws[i].free_ws = 1;
-   list_add(workspace, &btrfs_comp_ws[i].idle_ws);
+   atomic_set(&wsm[i].total_ws, 1);
+   wsm[i].free_ws = 1;
+   list_add(workspace, &wsm[i].idle_ws);
}
}
 }
@@ -856,11 +856,11 @@ static struct list_head *__find_workspace(int type, bool 
heuristic)
ws_wait  = &btrfs_heuristic_ws.ws_wait;
free_ws  = &btrfs_heuristic_ws.free_ws;
} else {
-   idle_ws  = &btrfs_comp_ws[idx].idle_ws;
-   ws_lock  = &btrfs_comp_ws[idx].ws_lock;
-   total_ws = &btrfs_comp_ws[idx].total_ws;
-   ws_wait  = &btrfs_comp_ws[idx].ws_wait;
-   free_ws  = &btrfs_comp_ws[idx].free_ws;
+   idle_ws  = &wsm[idx].idle_ws;
+   ws_lock  = &wsm[idx].ws_lock;
+   total_ws = &wsm[idx].total_ws;
+   ws_wait  = &wsm[idx].ws_wait;
+   free_ws  = &wsm[idx].free_ws;
}
 
 again:
@@ -952,11 +952,11 @@ static void __free_workspace(int type, struct list_head 
*workspace,
ws_wait  = &btrfs_heuristic_ws.ws_wait;
free_ws  = &btrfs_heuristic_ws.free_ws;
} else {
-   idle_ws  = &btrfs_comp_ws[idx].idle_ws;
-   ws_lock  = &btrfs_comp_ws[idx].ws_lock;
-   total_ws = &btrfs_comp_ws[idx].total_ws;
-   ws_wait  = &btrfs_comp_ws[idx].ws_wait;
-   free_ws  = &btrfs_comp_ws[idx].free_ws;
+   idle_ws  = &wsm[idx].idle_ws;
+   ws_lock  = &wsm[idx].ws_lock;
+   total_ws = &wsm[idx].total_ws;
+   ws_wait  = &wsm[idx].ws_wait;
+   free_ws  = &wsm[idx].free_ws;
}
 
spin_lock(ws_lock);
@@ -998,11 +998,11 @@ static void free_workspaces(void)
}
 
for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) {
-   while (!list_empty(&btrfs_comp_ws[i].idle_ws)) {
-   workspace = btrfs_comp_ws[i].idle_ws.next;
+   while (!list_empty(&wsm[i].idle_ws)) {
+   workspace = wsm[i].idle_ws.next;
list_del(workspace);
btrfs_compress_op[i]->free_workspace(workspace);
-   atomic_dec(&btrfs_comp_ws[i].total_ws);
+   atomic_dec(&wsm[i].total_ws);
}
}
 }
-- 
2.17.1

[PATCH v2 00/12] btrfs: add zstd compression level support

2019-02-04 Thread Dennis Zhou

terface and converts set_level() to be a
bounding function rather than setting level on a workspace. 0011
makes zstd compression level memory monotonic. 00012 adds zstd
compression level support.

This patchset is on top of kdave#master d73aba1115cf.

diffstats below:

Dennis Zhou (12):
  btrfs: add helpers for compression type and level
  btrfs: rename workspaces_list to workspace_manager
  btrfs: manage heuristic workspace as index 0
  btrfs: unify compression ops with workspace_manager
  btrfs: add helper methods for workspace manager init and cleanup
  btrfs: add compression interface in (get/put)_workspace()
  btrfs: move to fn pointers for get/put workspaces
  btrfs: plumb level through the compression interface
  btrfs: change set_level() to bound the level passed in
  btrfs: zstd use the passed through level instead of default
  btrfs: make zstd memory requirements monotonic
  btrfs: add zstd compression level support

 fs/btrfs/compression.c | 249 +++--
 fs/btrfs/compression.h |  53 ++-
 fs/btrfs/lzo.c |  31 +++-
 fs/btrfs/super.c   |  10 +-
 fs/btrfs/zlib.c|  45 --
 fs/btrfs/zstd.c| 311 +++--
 6 files changed, 539 insertions(+), 160 deletions(-)

Thanks,
Dennis

Re: [PATCH v2 00/12] btrfs: add zstd compression level support

2019-02-05 Thread Dennis Zhou

On Tue, Feb 05, 2019 at 03:57:53PM +0100, David Sterba wrote:
> On Mon, Feb 04, 2019 at 03:19:56PM -0500, Dennis Zhou wrote:
> > Hi everyone,
> > 
> > V2 had only a handful of changes outside of minor feedback.
> > 0001:
> > - use functions over macros
> > 0003:
> > - BTRFS_NR_WORKSPACE_MANAGERS is added instead of overriding
> >   BTRFS_COMPRESS_TYPES
> > 0011 (new):
> > - address monotonic memory requirement for zstd workspaces
> > 0012:
> > - increase reclaim timer to 307s from 67s
> > - move from keeping track of time in ns to jiffies
> > - remove timer in cleanup code
> > - use min_t() instead of if statements in .set_level()
> > - add header text to describe how workspaces are managed
> > - nofs_flag type -> unsigned long to unsigned int
> 
> Something is wrong, the patchset on top of 5.0-rc5 hangs in test
> btrfs/007, without a stacktrace. V1 was fine and I double checked that
> rc5 itself is fine.

Hmmm, well that's awkward. I ran xfstests and that test passed on my
machine. I'll figure out the delta and submit a v3. Thanks David!

Thanks,
Dennis

Re: [PATCH v2 00/12] btrfs: add zstd compression level support

2019-02-05 Thread Dennis Zhou

On Tue, Feb 05, 2019 at 05:27:34PM +0100, David Sterba wrote:
> On Tue, Feb 05, 2019 at 11:03:02AM -0500, Dennis Zhou wrote:
> > On Tue, Feb 05, 2019 at 03:57:53PM +0100, David Sterba wrote:
> > > On Mon, Feb 04, 2019 at 03:19:56PM -0500, Dennis Zhou wrote:
> > > > Hi everyone,
> > > > 
> > > > V2 had only a handful of changes outside of minor feedback.
> > > > 0001:
> > > > - use functions over macros
> > > > 0003:
> > > > - BTRFS_NR_WORKSPACE_MANAGERS is added instead of overriding
> > > >   BTRFS_COMPRESS_TYPES
> > > > 0011 (new):
> > > > - address monotonic memory requirement for zstd workspaces
> > > > 0012:
> > > > - increase reclaim timer to 307s from 67s
> > > > - move from keeping track of time in ns to jiffies
> > > > - remove timer in cleanup code
> > > > - use min_t() instead of if statements in .set_level()
> > > > - add header text to describe how workspaces are managed
> > > > - nofs_flag type -> unsigned long to unsigned int
> > > 
> > > Something is wrong, the patchset on top of 5.0-rc5 hangs in test
> > > btrfs/007, without a stacktrace. V1 was fine and I double checked that
> > > rc5 itself is fine.
> > 
> > Hmmm, well that's awkward. I ran xfstests and that test passed on my
> > machine. I'll figure out the delta and submit a v3. Thanks David!
> 
> It failed on a VM and 2 other physical machines, so it's not somethig
> related to the testing setup.

Oh, I wasn't alluding to that. I'm more that sure it's my fault
somewhere along the line.

Thanks,
Dennis

Re: [PATCH v2 00/12] btrfs: add zstd compression level support

2019-02-05 Thread Dennis Zhou

On Tue, Feb 05, 2019 at 07:27:49PM +0100, David Sterba wrote:
> On Tue, Feb 05, 2019 at 06:07:49PM +0100, David Sterba wrote:
> > On Tue, Feb 05, 2019 at 05:51:13PM +0100, David Sterba wrote:
> > > On Tue, Feb 05, 2019 at 11:30:12AM -0500, Dennis Zhou wrote:
> > > > > > > Something is wrong, the patchset on top of 5.0-rc5 hangs in test
> > > > > > > btrfs/007, without a stacktrace. V1 was fine and I double checked 
> > > > > > > that
> > > > > > > rc5 itself is fine.
> > > > > > 
> > > > > > Hmmm, well that's awkward. I ran xfstests and that test passed on my
> > > > > > machine. I'll figure out the delta and submit a v3. Thanks David!
> > > > > 
> > > > > It failed on a VM and 2 other physical machines, so it's not somethig
> > > > > related to the testing setup.
> > > > 
> > > > Oh, I wasn't alluding to that. I'm more that sure it's my fault
> > > > somewhere along the line.
> > > 
> > > Hang on, it's probably caused by fstests, as fssum is busy waiting
> > > somewhere, so it's not even kernel code.
> > > 
> > > fstests a4235f29f780b15d9e87c0f2446a53b9e69c5e2d -- ok
> > > fstests 6d17c9076fe9d97d2411e5909e6b255e5e721fc3 (master) -- hangs
> > 
> > Git bisect points to
> > 
> > commit 04b405fa1f8d0cacde489f3d1c8771dfe7a34dc0
> > Author: Zorro Lang 
> > Date:   Wed Jan 23 15:34:55 2019 +0800
> > 
> > common/dump: disable splice from FSSTRESS_AVOID
> 
> So with this commit reverted, test btrfs/007 does not hang anymore.

Ok great! I'm going to add a v2 for 0012 to add taking the spin_lock
just to be safe in cleanup. I should have that ready in a few minutes.

Thanks,
Dennis

[PATCH v2 12/12] btrfs: add zstd compression level support

2019-02-05 Thread Dennis Zhou

>From 0d8684d1d7b18dfa9d5bc9c74033c6c3b6fecd92 Mon Sep 17 00:00:00 2001
From: Dennis Zhou 
Date: Sat, 19 Jan 2019 18:51:39 -0800

Zstd compression requires different amounts of memory for each level of
compression. The prior patches implemented indirection to allow for each
compression type to manage their workspaces independently. This patch
uses this indirection to implement compression level support for zstd.

To manage the additional memory require, each compression level has its
own queue of workspaces. A global LRU is used to help with reclaim.
Reclaim is done via a timer which provides a mechanism to decrease
memory utilization by keeping only workspaces around that are sized
appropriately. Forward progress is guaranteed by a preallocated max
workspace hidden from the LRU.

When getting a workspace, it uses a bitmap to identify the levels that
are populated and scans up. If it finds a workspace that is greater than
it, it uses it, but does not update the last_used time and the
corresponding place in the LRU. If we hit memory pressure, we sleep on
the max level workspace. We continue to rescan in case we can use a
smaller workspace, but eventually should be able to obtain the max level
workspace or allocate one again should memory pressure subside.

The memory requirement for decompression is the same as level 1, and
therefore can use any of available workspace.

The number of workspaces is bound by an upper limit of the workqueue's
limit which currently is 2 (percpu limit). The reclaim timer is used to
free inactive/improperly sized workspaces and is set to 307s to avoid
colliding with transaction commit (every 30s).

Repeating the experiment from v2 [1], the Silesia corpus was copied to a
btrfs filesystem 10 times and then read back after dropping the caches.
The btrfs filesystem was on an SSD.

Level   Ratio   Compression (MB/s)  Decompression (MB/s)  Memory (KB)
1   2.658438.47910.51780
2   2.744364.86886.55   1004
3   2.801336.33828.41   1260
4   2.858286.71886.55   1260
5   2.916212.77556.84   1388
6   2.363119.82990.85   1516
7   3.000154.06849.30   1516
8   3.011159.54875.03   1772
9   3.025100.51940.15   1772
10  3.033118.97616.26   1772
11  3.036 94.19802.11   1772
12  3.037 73.45931.49   1772
13  3.041 55.17835.26   2284
14  3.087 44.70716.78   2547
15  3.126 37.30878.84   2547

[1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terre...@fb.com/

Signed-off-by: Dennis Zhou 
Cc: Nick Terrell 
Cc: Omar Sandoval 
---
v2:
- increase reclaim timer to 307s from 67s
- move from keeping track of time in ns to jiffies
- remove timer in cleanup code (use del_timer_sync)
- use min_t() instead of if statements in .set_level()
- add header text to describe how workspaces are managed
- nofs_flag type -> unsigned long to unsigned int

 fs/btrfs/super.c |   6 +-
 fs/btrfs/zstd.c  | 245 +--
 2 files changed, 242 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b28dff207383..0ecc513cb56c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -544,9 +544,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
btrfs_clear_opt(info->mount_opt, NODATASUM);
btrfs_set_fs_incompat(info, COMPRESS_LZO);
no_compress = 0;
-   } else if (strcmp(args[0].from, "zstd") == 0) {
+   } else if (strncmp(args[0].from, "zstd", 4) == 0) {
compress_type = "zstd";
info->compress_type = BTRFS_COMPRESS_ZSTD;
+   info->compress_level =
+   btrfs_compress_str2level(
+BTRFS_COMPRESS_ZSTD,
+args[0].from + 4);
btrfs_set_opt(info->mount_opt, COMPRESS);
btrfs_clear_opt(info->mount_opt, NODATACOW);
btrfs_clear_opt(info->mount_opt, NODATASUM);
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 2231123fedbe..54219c84320c 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -3,24 +3,45 @@
  * Copyright (c) 2016-present, Facebook, Inc.
  * All rights reserved.
  *
+ * Zs

Re: [PATCH 09/12] btrfs: change set_level() to bound the level passed in

2019-02-05 Thread Dennis Zhou

On Tue, Feb 05, 2019 at 08:06:37PM +0100, David Sterba wrote:
> On Mon, Feb 04, 2019 at 03:20:05PM -0500, Dennis Zhou wrote:
> > -unsigned int btrfs_compress_str2level(const char *str)
> > +unsigned int btrfs_compress_str2level(unsigned int type, const char *str)
> >  {
> > -   if (strncmp(str, "zlib", 4) != 0)
> > +   unsigned int level;
> > +   int ret;
> > +
> > +   if (!type)
> > return 0;
> >  
> > -   /* Accepted form: zlib:1 up to zlib:9 and nothing left after the number 
> > */
> > -   if (str[4] == ':' && '1' <= str[5] && str[5] <= '9' && str[6] == 0)
> > -   return str[5] - '0';
> > +   if (str[0] == ':') {
> > +   ret = kstrtouint(str + 1, 10, &level);
> 
> The docs kstrtouint of say that initial + is also accepted, I'd rather
> keep the level specification strict, ie. no "zlib:+3" and no garbage
> after the number.
> 
> The validation is currently missing but I think we should catch levels
> out of range during mount/remount. The fallback to default is a safety
> but wrong specification should be communicated to the user early.

Ok. To make sure I understand properly for improper level (ie "30",
"+3", "+3d") set the level to default (already done) and pr_warn saying
invalid level?

Thanks,
Dennis

Re: [PATCH 09/12] btrfs: change set_level() to bound the level passed in

2019-02-05 Thread Dennis Zhou

On Tue, Feb 05, 2019 at 08:54:15PM +0100, David Sterba wrote:
> On Tue, Feb 05, 2019 at 02:32:54PM -0500, Dennis Zhou wrote:
> > On Tue, Feb 05, 2019 at 08:06:37PM +0100, David Sterba wrote:
> > > On Mon, Feb 04, 2019 at 03:20:05PM -0500, Dennis Zhou wrote:
> > > > -unsigned int btrfs_compress_str2level(const char *str)
> > > > +unsigned int btrfs_compress_str2level(unsigned int type, const char 
> > > > *str)
> > > >  {
> > > > -   if (strncmp(str, "zlib", 4) != 0)
> > > > +   unsigned int level;
> > > > +   int ret;
> > > > +
> > > > +   if (!type)
> > > > return 0;
> > > >  
> > > > -   /* Accepted form: zlib:1 up to zlib:9 and nothing left after 
> > > > the number */
> > > > -   if (str[4] == ':' && '1' <= str[5] && str[5] <= '9' && str[6] 
> > > > == 0)
> > > > -   return str[5] - '0';
> > > > +   if (str[0] == ':') {
> > > > +   ret = kstrtouint(str + 1, 10, &level);
> > > 
> > > The docs kstrtouint of say that initial + is also accepted, I'd rather
> > > keep the level specification strict, ie. no "zlib:+3" and no garbage
> > > after the number.
> > > 
> > > The validation is currently missing but I think we should catch levels
> > > out of range during mount/remount. The fallback to default is a safety
> > > but wrong specification should be communicated to the user early.
> > 
> > Ok. To make sure I understand properly for improper level (ie "30",
> > "+3", "+3d") set the level to default (already done) and pr_warn saying
> > invalid level?
> 
> So we have (at least) two ways how to handle:
> 
> - warn and fail the mount -- catch typos and the like, continuing with
>   default could cause problems later though not that severe as the
>   compressionw would be different than expected, but this could be
>   considered a usability bug
> 
> - only warn and continue with the default -- not mounting a root
>   filesystem just because of a typo can be worse than mounting with
>   wrong level; as long as there's a warning in the log, the user still
>   has a working system and can fix it manually
> 
> Both have some pros and cons and initially I was more inclined to pick
> the 1st option, but now that I'm thinking about that again, the other
> has some merit.
> 
> Given that zlib now falls back to default for unrecognized level, we
> should probably stick to that for zstd too.

Ok. With that I'll run a v3 adding a warning for a '+' or invalid input.

I think moving compression to being a property rather than a mount
option would be ideal. That would enable multiple compression levels per
mount and prevent remounting with incorrect mount options.

Thanks,
Dennis

Re: [PATCH v2 00/12] btrfs: add zstd compression level support

2019-02-05 Thread Dennis Zhou

On Tue, Feb 05, 2019 at 01:30:27PM -0500, Dennis Zhou wrote:
> On Tue, Feb 05, 2019 at 07:27:49PM +0100, David Sterba wrote:
> > On Tue, Feb 05, 2019 at 06:07:49PM +0100, David Sterba wrote:
> > > On Tue, Feb 05, 2019 at 05:51:13PM +0100, David Sterba wrote:
> > > > On Tue, Feb 05, 2019 at 11:30:12AM -0500, Dennis Zhou wrote:
> > > > > > > > Something is wrong, the patchset on top of 5.0-rc5 hangs in test
> > > > > > > > btrfs/007, without a stacktrace. V1 was fine and I double 
> > > > > > > > checked that
> > > > > > > > rc5 itself is fine.
> > > > > > > 
> > > > > > > Hmmm, well that's awkward. I ran xfstests and that test passed on 
> > > > > > > my
> > > > > > > machine. I'll figure out the delta and submit a v3. Thanks David!
> > > > > > 
> > > > > > It failed on a VM and 2 other physical machines, so it's not 
> > > > > > somethig
> > > > > > related to the testing setup.
> > > > > 
> > > > > Oh, I wasn't alluding to that. I'm more that sure it's my fault
> > > > > somewhere along the line.
> > > > 
> > > > Hang on, it's probably caused by fstests, as fssum is busy waiting
> > > > somewhere, so it's not even kernel code.
> > > > 
> > > > fstests a4235f29f780b15d9e87c0f2446a53b9e69c5e2d -- ok
> > > > fstests 6d17c9076fe9d97d2411e5909e6b255e5e721fc3 (master) -- hangs
> > > 
> > > Git bisect points to
> > > 
> > > commit 04b405fa1f8d0cacde489f3d1c8771dfe7a34dc0
> > > Author: Zorro Lang 
> > > Date:   Wed Jan 23 15:34:55 2019 +0800
> > > 
> > > common/dump: disable splice from FSSTRESS_AVOID
> > 
> > So with this commit reverted, test btrfs/007 does not hang anymore.
> 
> Ok great! I'm going to add a v2 for 0012 to add taking the spin_lock
> just to be safe in cleanup. I should have that ready in a few minutes.
> 

Before I run v3, are there any other concerns (pending the changes I
made to address the mentioned make sense)?

Thanks,
Dennis

Re: [PATCH v3 12/12] btrfs: add zstd compression level support

2019-02-06 Thread Dennis Zhou

>From 16b7c3fe05984a95436da1e9e01c80de1fdbba25 Mon Sep 17 00:00:00 2001
From: Dennis Zhou 
Date: Sat, 19 Jan 2019 18:51:39 -0800

Zstd compression requires different amounts of memory for each level of
compression. The prior patches implemented indirection to allow for each
compression type to manage their workspaces independently. This patch
uses this indirection to implement compression level support for zstd.

To manage the additional memory require, each compression level has its
own queue of workspaces. A global LRU is used to help with reclaim.
Reclaim is done via a timer which provides a mechanism to decrease
memory utilization by keeping only workspaces around that are sized
appropriately. Forward progress is guaranteed by a preallocated max
workspace hidden from the LRU.

When getting a workspace, it uses a bitmap to identify the levels that
are populated and scans up. If it finds a workspace that is greater than
it, it uses it, but does not update the last_used time and the
corresponding place in the LRU. If we hit memory pressure, we sleep on
the max level workspace. We continue to rescan in case we can use a
smaller workspace, but eventually should be able to obtain the max level
workspace or allocate one again should memory pressure subside.

The memory requirement for decompression is the same as level 1, and
therefore can use any of available workspace.

The number of workspaces is bound by an upper limit of the workqueue's
limit which currently is 2 (percpu limit). The reclaim timer is used to
free inactive/improperly sized workspaces and is set to 307s to avoid
colliding with transaction commit (every 30s).

Repeating the experiment from v2 [1], the Silesia corpus was copied to a
btrfs filesystem 10 times and then read back after dropping the caches.
The btrfs filesystem was on an SSD.

Level   Ratio   Compression (MB/s)  Decompression (MB/s)  Memory (KB)
1   2.658438.47910.51780
2   2.744364.86886.55   1004
3   2.801336.33828.41   1260
4   2.858286.71886.55   1260
5   2.916212.77556.84   1388
6   2.363119.82990.85   1516
7   3.000154.06849.30   1516
8   3.011159.54875.03   1772
9   3.025100.51940.15   1772
10  3.033118.97616.26   1772
11  3.036 94.19802.11   1772
12  3.037 73.45931.49   1772
13  3.041 55.17835.26   2284
14  3.087 44.70716.78   2547
15  3.126 37.30878.84   2547

[1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terre...@fb.com/

Signed-off-by: Dennis Zhou 
Cc: Nick Terrell 
Cc: Omar Sandoval 
---
v3:
- warn on level out of bounds

v2:
- increase reclaim timer to 307s from 67s
- move from keeping track of time in ns to jiffies
- remove timer in cleanup code (use del_timer_sync)
- use min_t() instead of if statements in .set_level()
- add header text to describe how workspaces are managed
- nofs_flag type -> unsigned long to unsigned int

 fs/btrfs/super.c |   6 +-
 fs/btrfs/zstd.c  | 250 +--
 2 files changed, 247 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b28dff207383..0ecc513cb56c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -544,9 +544,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
btrfs_clear_opt(info->mount_opt, NODATASUM);
btrfs_set_fs_incompat(info, COMPRESS_LZO);
no_compress = 0;
-   } else if (strcmp(args[0].from, "zstd") == 0) {
+   } else if (strncmp(args[0].from, "zstd", 4) == 0) {
compress_type = "zstd";
info->compress_type = BTRFS_COMPRESS_ZSTD;
+   info->compress_level =
+   btrfs_compress_str2level(
+BTRFS_COMPRESS_ZSTD,
+args[0].from + 4);
btrfs_set_opt(info->mount_opt, COMPRESS);
btrfs_clear_opt(info->mount_opt, NODATACOW);
btrfs_clear_opt(info->mount_opt, NODATASUM);
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 2231123fedbe..9946cbea624a 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -3,24 +3,45 @@
  * Copyright (c) 2016-present, Facebook, Inc.

Re: [PATCH v2 09/12] btrfs: change set_level() to bound the level passed in

2019-02-06 Thread Dennis Zhou

>From ef64a8d1f4ec44f52bd13a825288a91667121708 Mon Sep 17 00:00:00 2001
From: Dennis Zhou 
Date: Thu, 17 Jan 2019 12:13:27 -0800

Currently, the only user of set_level() is zlib which sets an internal
workspace parameter. As level is now plumbed into get_workspace(), this
can be handled there rather than separately.

This repurposes set_level() to bound the level passed in so it can be
used when setting the mounts compression level and as well as verifying
the level before getting a workspace. The other benefit is this divides
the meaning of compress(0) and get_workspace(0). The former means we
want to use the default compression level of the compression type. The
latter means we can use any workspace available.

Signed-off-by: Dennis Zhou 
---
v2:
- warn on bad level format eg: zlib:c100
- warn on level out of bounds eg: zlib:10

 fs/btrfs/compression.c | 27 +++
 fs/btrfs/compression.h | 10 --
 fs/btrfs/lzo.c |  3 ++-
 fs/btrfs/super.c   |  4 +++-
 fs/btrfs/zlib.c| 21 ++---
 fs/btrfs/zstd.c|  3 ++-
 6 files changed, 48 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index a54b210739bc..560a926c2e2f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1007,8 +1007,6 @@ int btrfs_compress_pages(unsigned int type_level, struct 
address_space *mapping,
int ret;
 
workspace = get_workspace(type, level);
-
-   btrfs_compress_op[type]->set_level(workspace, level);
ret = btrfs_compress_op[type]->compress_pages(workspace, mapping,
  start, pages,
  out_pages,
@@ -1561,14 +1559,27 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
return ret;
 }
 
-unsigned int btrfs_compress_str2level(const char *str)
+unsigned int btrfs_compress_str2level(unsigned int type, const char *str)
 {
-   if (strncmp(str, "zlib", 4) != 0)
+   unsigned int level = 0;
+   int ret;
+
+   if (!type)
return 0;
 
-   /* Accepted form: zlib:1 up to zlib:9 and nothing left after the number 
*/
-   if (str[4] == ':' && '1' <= str[5] && str[5] <= '9' && str[6] == 0)
-   return str[5] - '0';
+   if (str[0] == ':') {
+   if (str[1] != '+') {
+   ret = kstrtouint(str + 1, 10, &level);
+   if (!ret)
+   goto set_level;
+   }
+
+   pr_warn("BTRFS: invalid compression level %s, using default",
+   str + 1);
+   }
+
+set_level:
+   level = btrfs_compress_op[type]->set_level(level);
 
-   return BTRFS_ZLIB_DEFAULT_LEVEL;
+   return level;
 }
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 32ba40ebae1d..3f2d36b0aabe 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -97,7 +97,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 int mirror_num, unsigned long bio_flags);
 
-unsigned btrfs_compress_str2level(const char *str);
+unsigned int btrfs_compress_str2level(unsigned int type, const char *str);
 
 enum btrfs_compression_type {
BTRFS_COMPRESS_NONE  = 0,
@@ -156,7 +156,13 @@ struct btrfs_compress_op {
  unsigned long start_byte,
  size_t srclen, size_t destlen);
 
-   void (*set_level)(struct list_head *ws, unsigned int type);
+   /*
+* This bounds the level set by the user to be within range of
+* a particular compression type.  It returns the level that
+* will be used if the level is out of bounds or the default
+* if 0 is passed in.
+*/
+   unsigned int (*set_level)(unsigned int level);
 };
 
 /* the heuristic workspaces are managed via the 0th workspace manager */
diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index f132af45a924..579d53ae256f 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -507,8 +507,9 @@ static int lzo_decompress(struct list_head *ws, unsigned 
char *data_in,
return ret;
 }
 
-static void lzo_set_level(struct list_head *ws, unsigned int type)
+static unsigned int lzo_set_level(unsigned int level)
 {
+   return 0;
 }
 
 const struct btrfs_compress_op btrfs_lzo_compress = {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index c5586ffd1426..b28dff207383 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -529,7 +529,9 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
if (token != Opt_compress &&
token != Opt_compress_force)

Re: [PATCH v2 00/12] btrfs: add zstd compression level support

2019-02-06 Thread Dennis Zhou

On Wed, Feb 06, 2019 at 04:15:52PM +0100, David Sterba wrote:
> On Tue, Feb 05, 2019 at 03:48:48PM -0500, Dennis Zhou wrote:
> > > Ok great! I'm going to add a v2 for 0012 to add taking the spin_lock
> > > just to be safe in cleanup. I should have that ready in a few minutes.
> > > 
> > 
> > Before I run v3, are there any other concerns (pending the changes I
> > made to address the mentioned make sense)?
> 
> You don't needs to do full v3 resend, either new version of the
> individual patches or incremental changes that can be squashed to the
> patch. I'm about to move the patchset from topic branch to misc-next so
> incrementals work better now as there aren't big updates expected.

Ok great! I just sent a v2 of 0009 and v3 of 00012 which include the
changes for warning on incorrect input and level out of bounds. I did
forget to take out the "Re:" though.. whoops.

Thanks,
Dennis

Re: [PATCH 1/2] percpu: km: remove SMP check

2019-02-25 Thread Dennis Zhou

On Sun, Feb 24, 2019 at 01:13:43PM +, Peng Fan wrote:
> percpu-km could only be selected by NEED_PER_CPU_KM which
> depends on !SMP, so CONFIG_SMP will be false when choose percpu-km.
> 
> Signed-off-by: Peng Fan 
> ---
>  mm/percpu-km.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/percpu-km.c b/mm/percpu-km.c
> index 0f643dc2dc65..66e5598be876 100644
> --- a/mm/percpu-km.c
> +++ b/mm/percpu-km.c
> @@ -27,7 +27,7 @@
>   *   chunk size is not aligned.  percpu-km code will whine about it.
>   */
>  
> -#if defined(CONFIG_SMP) && defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
> +#if defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
>  #error "contiguous percpu allocation is incompatible with paged first chunk"
>  #endif
>  
> -- 
> 2.16.4
> 

Hi,

I think keeping CONFIG_SMP makes this easier to remember dependencies
rather than having to dig into the config. So this is a NACK from me.

Thanks,
Dennis

Re: [PATCH 2/2] percpu: km: no need to consider pcpu_group_offsets[0]

2019-02-26 Thread Dennis Zhou

On Tue, Feb 26, 2019 at 03:15:50PM +, Christopher Lameter wrote:
> On Mon, 25 Feb 2019, den...@kernel.org wrote:
> 
> > > @@ -67,7 +67,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> > >   pcpu_set_page_chunk(nth_page(pages, i), chunk);
> > >
> > >   chunk->data = pages;
> > > - chunk->base_addr = page_address(pages) - pcpu_group_offsets[0];
> > > + chunk->base_addr = page_address(pages);
> > >
> > >   spin_lock_irqsave(&pcpu_lock, flags);
> > >   pcpu_chunk_populated(chunk, 0, nr_pages, false);
> > > --
> > > 2.16.4
> > >
> >
> > While I do think you're right, creating a chunk is not a part of the
> > critical path and subtracting 0 is incredibly minor overhead. So I'd
> > rather keep the code as is to maintain consistency between percpu-vm.c
> > and percpu-km.c.
> 
> Well it is confusing if there the expression is there but never used. It
> is clearer with the patch.
> 

Okay. I'll apply it with your ack if that's fine.

Thanks,
Dennis

Re: [PATCH 1/2] percpu: km: remove SMP check

2019-02-26 Thread Dennis Zhou

On Tue, Feb 26, 2019 at 03:16:44PM +, Christopher Lameter wrote:
> On Mon, 25 Feb 2019, Dennis Zhou wrote:
> 
> > > @@ -27,7 +27,7 @@
> > >   *   chunk size is not aligned.  percpu-km code will whine about it.
> > >   */
> > >
> > > -#if defined(CONFIG_SMP) && defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
> > > +#if defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
> > >  #error "contiguous percpu allocation is incompatible with paged first 
> > > chunk"
> > >  #endif
> > >
> > > --
> > > 2.16.4
> > >
> >
> > Hi,
> >
> > I think keeping CONFIG_SMP makes this easier to remember dependencies
> > rather than having to dig into the config. So this is a NACK from me.
> 
> But it simplifies the code and makes it easier to read.
> 
> 

I think the check isn't quite right after looking at it a little longer.
Looking at x86, I believe you can compile it with !SMP and
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK will still be set. This should
still work because x86 has an MMU.

I think more correctly it would be something like below, but I don't
have the time to fully verify it right now.

Thanks,
Dennis

---
diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index 0f643dc2dc65..69ccad7d9807 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -27,7 +27,7 @@
  *   chunk size is not aligned.  percpu-km code will whine about it.
  */
 
-#if defined(CONFIG_SMP) && defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
+#if !defined(CONFIG_MMU) && defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
 #error "contiguous percpu allocation is incompatible with paged first chunk"
 #endif

Re: [RFC] percpu: decrease pcpu_nr_slots by 1

2019-02-26 Thread Dennis Zhou

On Tue, Feb 26, 2019 at 12:09:28AM +, Peng Fan wrote:
> Hi Dennis,
> 
> > -Original Message-
> > From: den...@kernel.org [mailto:den...@kernel.org]
> > Sent: 2019年2月25日 23:24
> > To: Peng Fan 
> > Cc: t...@kernel.org; c...@linux.com; linux...@kvack.org;
> > linux-kernel@vger.kernel.org; van.free...@gmail.com
> > Subject: Re: [RFC] percpu: decrease pcpu_nr_slots by 1
> > 
> > On Sun, Feb 24, 2019 at 09:17:08AM +, Peng Fan wrote:
> > > Entry pcpu_slot[pcpu_nr_slots - 2] is wasted with current code logic.
> > > pcpu_nr_slots is calculated with `__pcpu_size_to_slot(size) + 2`.
> > > Take pcpu_unit_size as 1024 for example, __pcpu_size_to_slot will
> > > return max(11 - PCPU_SLOT_BASE_SHIFT + 2, 1), it is 8, so the
> > > pcpu_nr_slots will be 10.
> > >
> > > The chunk with free_bytes 1024 will be linked into pcpu_slot[9].
> > > However free_bytes in range [512,1024) will be linked into
> > > pcpu_slot[7], because `fls(512) - PCPU_SLOT_BASE_SHIFT + 2` is 7.
> > > So pcpu_slot[8] is has no chance to be used.
> > >
> > > According comments of PCPU_SLOT_BASE_SHIFT, 1~31 bytes share the
> > same
> > > slot and PCPU_SLOT_BASE_SHIFT is defined as 5. But actually 1~15 share
> > > the same slot 1 if we not take PCPU_MIN_ALLOC_SIZE into consideration,
> > > 16~31 share slot 2. Calculation as below:
> > > highbit = fls(16) -> highbit = 5
> > > max(5 - PCPU_SLOT_BASE_SHIFT + 2, 1) equals 2, not 1.
> > >
> > > This patch by decreasing pcpu_nr_slots to avoid waste one slot and let
> > > [PCPU_MIN_ALLOC_SIZE, 31) really share the same slot.
> > >
> > > Signed-off-by: Peng Fan 
> > > ---
> > >
> > > V1:
> > >  Not very sure about whether it is intended to leave the slot there.
> > >
> > >  mm/percpu.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/percpu.c b/mm/percpu.c index
> > > 8d9933db6162..12a9ba38f0b5 100644
> > > --- a/mm/percpu.c
> > > +++ b/mm/percpu.c
> > > @@ -219,7 +219,7 @@ static bool pcpu_addr_in_chunk(struct pcpu_chunk
> > > *chunk, void *addr)  static int __pcpu_size_to_slot(int size)  {
> > >   int highbit = fls(size);/* size is in bytes */
> > > - return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
> > > + return max(highbit - PCPU_SLOT_BASE_SHIFT + 1, 1);
> > >  }
> > 
> > Honestly, it may be better to just have [1-16) [16-31) be separate. I'm 
> > working
> > on a change to this area, so I may change what's going on here.
> > 
> > >
> > >  static int pcpu_size_to_slot(int size) @@ -2145,7 +2145,7 @@ int
> > > __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
> > >* Allocate chunk slots.  The additional last slot is for
> > >* empty chunks.
> > >*/
> > > - pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
> > > + pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 1;
> > >   pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]),
> > >  SMP_CACHE_BYTES);
> > >   for (i = 0; i < pcpu_nr_slots; i++)
> > > --
> > > 2.16.4
> > >
> > 
> > This is a tricky change. The nice thing about keeping the additional
> > slot around is that it ensures a distinction between a completely empty
> > chunk and a nearly empty chunk.
> 
> Are there any issues met before if not keeping the unused slot?
> From reading the code and git history I could not find information.
> I tried this code on aarch64 qemu and did not meet issues.
> 

This change would require verification that all paths lead to power of 2
chunk sizes and most likely a BUG_ON if that's not the case.

So while this would work, we're holding onto an additional slot also to
be used for chunk reclamation via pcpu_balance_workfn(). If a chunk was
not a power of 2 resulting in the last slot being entirely empty chunks
we could free stuff a chunk with addresses still in use.

> > It happens to be that the logic creates
> > power of 2 chunks which ends up being an additional slot anyway. 
> 
> 
> So,
> > given that this logic is tricky and architecture dependent, 
> 
> Could you share more information about architecture dependent?
> 

The crux of the logic is in pcpu_build_alloc_info(). It's been some time
since I've thought deeply about it, but I don't believe there is a
guarantee that it will be a power of 2 chunk.

Thanks,
Dennis

Re: [PATCH 1/2] percpu: km: remove SMP check

2019-02-27 Thread Dennis Zhou

On Wed, Feb 27, 2019 at 01:02:16PM +, Peng Fan wrote:
> Hi Dennis
> 
> > -Original Message-
> > From: Dennis Zhou [mailto:den...@kernel.org]
> > Sent: 2019年2月27日 1:04
> > To: Christopher Lameter 
> > Cc: Peng Fan ; t...@kernel.org; linux...@kvack.org;
> > linux-kernel@vger.kernel.org; van.free...@gmail.com
> > Subject: Re: [PATCH 1/2] percpu: km: remove SMP check
> > 
> > On Tue, Feb 26, 2019 at 03:16:44PM +, Christopher Lameter wrote:
> > > On Mon, 25 Feb 2019, Dennis Zhou wrote:
> > >
> > > > > @@ -27,7 +27,7 @@
> > > > >   *   chunk size is not aligned.  percpu-km code will whine about it.
> > > > >   */
> > > > >
> > > > > -#if defined(CONFIG_SMP) &&
> > > > > defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
> > > > > +#if defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
> > > > >  #error "contiguous percpu allocation is incompatible with paged first
> > chunk"
> > > > >  #endif
> > > > >
> > > > > --
> > > > > 2.16.4
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > I think keeping CONFIG_SMP makes this easier to remember
> > > > dependencies rather than having to dig into the config. So this is a 
> > > > NACK
> > from me.
> > >
> > > But it simplifies the code and makes it easier to read.
> > >
> > >
> > 
> > I think the check isn't quite right after looking at it a little longer.
> > Looking at x86, I believe you can compile it with !SMP and
> > CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK will still be set. This should
> > still work because x86 has an MMU.
> 
> You are right, x86 could boots up with NEED_PER_CPU_PAGE_FIRST_CHUNK
> =y and SMP=n. Tested with qemu, info as below:
> 
> / # zcat /proc/config.gz | grep NEED_PER_CPU_KM
> CONFIG_NEED_PER_CPU_KM=y
> / # zcat /proc/config.gz | grep SMP
> CONFIG_BROKEN_ON_SMP=y
> # CONFIG_SMP is not set
> CONFIG_GENERIC_SMP_IDLE_THREAD=y
> / # zcat /proc/config.gz | grep NEED_PER_CPU_PAGE_FIRST_CHUNK
> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
> / # cat /proc/cpuinfo
> processor   : 0
> vendor_id   : AuthenticAMD
> cpu family  : 6
> model   : 6
> model name  : QEMU Virtual CPU version 2.5+
> stepping: 3
> cpu MHz : 3192.613
> cache size  : 512 KB
> fpu : yes
> fpu_exception   : yes
> cpuid level : 13
> wp  : yes
> flags   : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush mmx fxsr sse sse2 syscall nx lm nopl cpuid pni cx16 
> hypervisor lahf_lm svm 3dnowprefetl
> bugs: fxsave_leak sysret_ss_attrs spectre_v1 spectre_v2 
> spec_store_bypass
> bogomips: 6385.22
> TLB size: 1024 4K pages
> clflush size: 64
> cache_alignment : 64
> address sizes   : 42 bits physical, 48 bits virtual
> power management:
> 
> 
> But from the comments in this file:
> "
> * - CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK must not be defined.  It's
>  *   not compatible with PER_CPU_KM.  EMBED_FIRST_CHUNK should work
>  *   fine.
> "
> 
> I did not read into details why it is not allowed, but x86 could still work 
> with KM
> and NEED_PER_CPU_PAGE_FIRST_CHUNK.
> 

The first chunk requires special handling on SMP to bring the static
variables into the percpu address space. On UP, identity mapping makes
static variables indistinguishable by aligning the percpu address space
and the virtual adress space. The percpu-km allocator allocates full
chunks at a time to deal with NOMMU archs. So the difference is if the
virtual address space is the same as the physical.

> > 
> > I think more correctly it would be something like below, but I don't have 
> > the
> > time to fully verify it right now.
> > 
> > Thanks,
> > Dennis
> > 
> > ---
> > diff --git a/mm/percpu-km.c b/mm/percpu-km.c index
> > 0f643dc2dc65..69ccad7d9807 100644
> > --- a/mm/percpu-km.c
> > +++ b/mm/percpu-km.c
> > @@ -27,7 +27,7 @@
> >   *   chunk size is not aligned.  percpu-km code will whine about it.
> >   */
> > 
> > -#if defined(CONFIG_SMP) &&
> > defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
> > +#if !defined(CONFIG_MMU) &&
> > +defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
> >  #error "contiguous percpu allocation is incompatible with paged first 
> > chunk"
> >  #endif
> > 
> 
> Acked-by: Peng Fan 
> 
> Thanks,
> Peng

While this change may seem right to me. Verification would be to double
check other architectures too. x86 just happened to be a counter example
I had in mind. Unless someone reports this as being an issue or someone
takes the time to validate this more thoroughly than my cursory look.
I think the risk of this outweighs the benefit. This may be something I
fix in the future when I have more time. This would also involve making
sure the comments are consistent.

Thanks,
Dennis

Re: [RFC] percpu: decrease pcpu_nr_slots by 1

2019-02-27 Thread Dennis Zhou

On Wed, Feb 27, 2019 at 01:33:15PM +, Peng Fan wrote:
> Hi Dennis,
> 
> > -Original Message-
> > From: Dennis Zhou [mailto:den...@kernel.org]
> > Sent: 2019年2月27日 1:33
> > To: Peng Fan 
> > Cc: den...@kernel.org; t...@kernel.org; c...@linux.com; linux...@kvack.org;
> > linux-kernel@vger.kernel.org; van.free...@gmail.com
> > Subject: Re: [RFC] percpu: decrease pcpu_nr_slots by 1
> > 
> > On Tue, Feb 26, 2019 at 12:09:28AM +, Peng Fan wrote:
> > > Hi Dennis,
> > >
> > > > -Original Message-
> > > > From: den...@kernel.org [mailto:den...@kernel.org]
> > > > Sent: 2019年2月25日 23:24
> > > > To: Peng Fan 
> > > > Cc: t...@kernel.org; c...@linux.com; linux...@kvack.org;
> > > > linux-kernel@vger.kernel.org; van.free...@gmail.com
> > > > Subject: Re: [RFC] percpu: decrease pcpu_nr_slots by 1
> > > >
> > > > On Sun, Feb 24, 2019 at 09:17:08AM +, Peng Fan wrote:
> > > > > Entry pcpu_slot[pcpu_nr_slots - 2] is wasted with current code logic.
> > > > > pcpu_nr_slots is calculated with `__pcpu_size_to_slot(size) + 2`.
> > > > > Take pcpu_unit_size as 1024 for example, __pcpu_size_to_slot will
> > > > > return max(11 - PCPU_SLOT_BASE_SHIFT + 2, 1), it is 8, so the
> > > > > pcpu_nr_slots will be 10.
> > > > >
> > > > > The chunk with free_bytes 1024 will be linked into pcpu_slot[9].
> > > > > However free_bytes in range [512,1024) will be linked into
> > > > > pcpu_slot[7], because `fls(512) - PCPU_SLOT_BASE_SHIFT + 2` is 7.
> > > > > So pcpu_slot[8] is has no chance to be used.
> > > > >
> > > > > According comments of PCPU_SLOT_BASE_SHIFT, 1~31 bytes share the
> > > > same
> > > > > slot and PCPU_SLOT_BASE_SHIFT is defined as 5. But actually 1~15
> > > > > share the same slot 1 if we not take PCPU_MIN_ALLOC_SIZE into
> > > > > consideration,
> > > > > 16~31 share slot 2. Calculation as below:
> > > > > highbit = fls(16) -> highbit = 5
> > > > > max(5 - PCPU_SLOT_BASE_SHIFT + 2, 1) equals 2, not 1.
> > > > >
> > > > > This patch by decreasing pcpu_nr_slots to avoid waste one slot and
> > > > > let [PCPU_MIN_ALLOC_SIZE, 31) really share the same slot.
> > > > >
> > > > > Signed-off-by: Peng Fan 
> > > > > ---
> > > > >
> > > > > V1:
> > > > >  Not very sure about whether it is intended to leave the slot there.
> > > > >
> > > > >  mm/percpu.c | 4 ++--
> > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/mm/percpu.c b/mm/percpu.c index
> > > > > 8d9933db6162..12a9ba38f0b5 100644
> > > > > --- a/mm/percpu.c
> > > > > +++ b/mm/percpu.c
> > > > > @@ -219,7 +219,7 @@ static bool pcpu_addr_in_chunk(struct
> > > > > pcpu_chunk *chunk, void *addr)  static int __pcpu_size_to_slot(int 
> > > > > size)
> > {
> > > > >   int highbit = fls(size);/* size is in bytes */
> > > > > - return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
> > > > > + return max(highbit - PCPU_SLOT_BASE_SHIFT + 1, 1);
> > > > >  }
> > > >
> > > > Honestly, it may be better to just have [1-16) [16-31) be separate.
> 
> Missed to reply this in previous thread, the following comments let
> me think the chunk slot calculation might be wrong, so this comment
> needs to be updated, saying "[PCPU_MIN_ALLOC_SIZE - 15) bytes share
> the same slot", if [1-16)[16-31) is expected.
> "
> /* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
> #define PCPU_SLOT_BASE_SHIFT5
> "
> 
> > > > I'm working on a change to this area, so I may change what's going on
> > here.
> > > >
> > > > >
> > > > >  static int pcpu_size_to_slot(int size) @@ -2145,7 +2145,7 @@ int
> > > > > __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
> > > > >* Allocate chunk slots.  The additional last slot is for
> > > > >* empty chunks.
> > > > >*/
> > > > > - pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
> > > > > + pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 1;

Re: [PATCH v2] btrfs: zstd ensure reclaim timer is properly cleaned up

2019-02-27 Thread Dennis Zhou

On Wed, Feb 27, 2019 at 05:44:41PM +0100, David Sterba wrote:
> On Fri, Feb 22, 2019 at 02:53:48PM -0500, Dennis Zhou wrote:
> > The timer function, zstd_reclaim_timer_fn(), reschedules itself under
> > certain conditions. When cleaning up, take the lock and remove all
> > workspaces. This prevents the timer from rearming itself. Lastly, switch
> > to del_timer_sync() to ensure that the timer function can't trigger as
> > we're unloading.
> > 
> > Signed-off-by: Dennis Zhou 
> 
> Reviewed-by: David Sterba 
> 

Thanks!

> > ---
> > v2:
> > - cleanup workspaces and then disable the timer
> > 
> >  fs/btrfs/zstd.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
> > index 3e418a3aeb11..6b9e29d050f3 100644
> > --- a/fs/btrfs/zstd.c
> > +++ b/fs/btrfs/zstd.c
> > @@ -195,8 +195,7 @@ static void zstd_cleanup_workspace_manager(void)
> > struct workspace *workspace;
> > int i;
> >  
> > -   del_timer(&wsm.timer);
> > -
> > +   spin_lock(&wsm.lock);
> > for (i = 0; i < ZSTD_BTRFS_MAX_LEVEL; i++) {
> > while (!list_empty(&wsm.idle_ws[i])) {
> > workspace = container_of(wsm.idle_ws[i].next,
> > @@ -206,6 +205,9 @@ static void zstd_cleanup_workspace_manager(void)
> > wsm.ops->free_workspace(&workspace->list);
> 
> I've noticed while reading the code, why do you use the indirect call
> here? The wsm.ops points to btrfs_zstd_compress so free_workspace is
> always zstd_free_workspace.
> 
> The compiler is usually smart to replace such things by direct call if
> the type has not escaped, but this is not true for btrfs_compress_op so
> the indirect function call must be preserved.

I don't have a strong reason to use the indirect call here. It was just
to make it consistent for everyone to use the indirection. This at least
is in the cleanup path, so I don't think performance is that important?
But I don't feel strongly for or against calling zstd_free_workspace()
directly.

Thanks,
Dennis

[PATCH] btrfs: remove indirect function calls from zstd

2019-02-27 Thread Dennis Zhou

While calling functions inside zstd, we don't need to use the
indirection provided by the workspace_manager. Forward declarations are
added to maintain the function order of btrfs_compress_op.

Signed-off-by: Dennis Zhou 
---
 fs/btrfs/zstd.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 6b9e29d050f3..a6ff07cf11d5 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -90,6 +90,9 @@ static inline struct workspace *list_to_workspace(struct 
list_head *list)
return container_of(list, struct workspace, list);
 }
 
+static void zstd_free_workspace(struct list_head *ws);
+static struct list_head *zstd_alloc_workspace(unsigned int level);
+
 /*
  * zstd_reclaim_timer_fn - reclaim timer
  * @t: timer
@@ -124,7 +127,7 @@ static void zstd_reclaim_timer_fn(struct timer_list *timer)
level = victim->level;
list_del(&victim->lru_list);
list_del(&victim->list);
-   wsm.ops->free_workspace(&victim->list);
+   zstd_free_workspace(&victim->list);
 
if (list_empty(&wsm.idle_ws[level - 1]))
clear_bit(level - 1, &wsm.active_map);
@@ -180,7 +183,7 @@ static void zstd_init_workspace_manager(void)
for (i = 0; i < ZSTD_BTRFS_MAX_LEVEL; i++)
INIT_LIST_HEAD(&wsm.idle_ws[i]);
 
-   ws = wsm.ops->alloc_workspace(ZSTD_BTRFS_MAX_LEVEL);
+   ws = zstd_alloc_workspace(ZSTD_BTRFS_MAX_LEVEL);
if (IS_ERR(ws)) {
pr_warn(
"BTRFS: cannot preallocate zstd compression workspace\n");
@@ -202,7 +205,7 @@ static void zstd_cleanup_workspace_manager(void)
 struct workspace, list);
list_del(&workspace->list);
list_del(&workspace->lru_list);
-   wsm.ops->free_workspace(&workspace->list);
+   zstd_free_workspace(&workspace->list);
}
}
spin_unlock(&wsm.lock);
@@ -272,7 +275,7 @@ static struct list_head *zstd_get_workspace(unsigned int 
level)
return ws;
 
nofs_flag = memalloc_nofs_save();
-   ws = wsm.ops->alloc_workspace(level);
+   ws = zstd_alloc_workspace(level);
memalloc_nofs_restore(nofs_flag);
 
if (IS_ERR(ws)) {
-- 
2.17.1

Re: [PATCH v2] btrfs: zstd ensure reclaim timer is properly cleaned up

2019-02-27 Thread Dennis Zhou

On Wed, Feb 27, 2019 at 07:36:50PM +0100, David Sterba wrote:
> On Wed, Feb 27, 2019 at 01:29:16PM -0500, Dennis Zhou wrote:
> > > I've noticed while reading the code, why do you use the indirect call
> > > here? The wsm.ops points to btrfs_zstd_compress so free_workspace is
> > > always zstd_free_workspace.
> > > 
> > > The compiler is usually smart to replace such things by direct call if
> > > the type has not escaped, but this is not true for btrfs_compress_op so
> > > the indirect function call must be preserved.
> > 
> > I don't have a strong reason to use the indirect call here. It was just
> > to make it consistent for everyone to use the indirection. This at least
> > is in the cleanup path, so I don't think performance is that important?
> 
> It's not just that, the timer uses it too and there are indirect calls
> of the alloc_workspace callback. The indirection is not used by lzo nor
> zlib code, so I don't see what 'everyone' you mean. In the generic
> compression code it makes sense, I see that.
> 
> > But I don't feel strongly for or against calling zstd_free_workspace()
> > directly.
> 
> I feel strongly about not using the indirection when not necessary :)

Great :). I sent you a patch just now to remove the indirection [1].

[1] https://lore.kernel.org/linux-btrfs/20190227212128.38491-1-den...@kernel.org

Thanks,
Dennis

[GIT PULL] percpu changes for v5.12-rc1

2021-02-22 Thread Dennis Zhou

Hi Linus,

Percpu had a cleanup come in that makes use of the cpu bitmask helpers
instead of the current iterative approach. This clean up has an adverse
interaction when clang's inlining sensitivity is changed such that
not all sites are inlined resulting in modpost being upset with section
mismatch due to percpu setup being marked __init. It is fixed by
introducing __flatten to compiler_attributes.h. This has been supported
since clang 3.5 and gcc 4.4 [1].

[1] 
https://lore.kernel.org/lkml/CAKwvOdnxnooqtyeSem63V_P5980jc0Z2PDG=0im8ixeytsa...@mail.gmail.com/

Thanks,
Dennis

The following changes since commit 92bf22614b21a2706f4993b278017e437f7785b3:

  Linux 5.11-rc7 (2021-02-07 13:57:38 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git for-5.12

for you to fetch changes up to 258e0815e2b1706e87c0d874211097aa8a7aa52f:

  percpu: fix clang modpost section mismatch (2021-02-14 18:15:15 +)


Dennis Zhou (1):
  percpu: fix clang modpost section mismatch

Wonhyuk Yang (1):
  percpu: reduce the number of cpu distance comparisons

 include/linux/compiler_attributes.h |  6 ++
 mm/percpu.c | 36 +---
 2 files changed, 27 insertions(+), 15 deletions(-)

[PATCH] percpu: fix clang modpost warning in pcpu_build_alloc_info()

2020-12-31 Thread Dennis Zhou

This is an unusual situation so I thought it best to explain it in a
separate patch.

"percpu: reduce the number of cpu distance comparisons" introduces a
dependency on cpumask helper functions in __init code. This code
references a struct cpumask annotated __initdata. When the function is
inlined (gcc), everything is fine, but clang decides not to inline these
function calls. This causes modpost to warn about an __initdata access
by a function not annotated with __init [1].

Ways I thought about fixing it:
1. figure out why clang thinks this inlining is too costly.
2. create a wrapper function annotated __init (this).
3. annotate cpumask with __refdata.

Ultimately it comes down to if it's worth saving the cpumask memory and
allowing it to be freed. IIUC, __refdata won't be freed, so option 3 is
just a little wasteful. 1 is out of my depth, leaving 2. I don't feel
great about this behavior being dependent on inlining semantics, but
cpumask helpers are small and probably should be inlined.

modpost complaint:
  WARNING: modpost: vmlinux.o(.text+0x735425): Section mismatch in reference 
from the function cpumask_clear_cpu() to the variable 
.init.data:pcpu_build_alloc_info.mask
  The function cpumask_clear_cpu() references
  the variable __initdata pcpu_build_alloc_info.mask.
  This is often because cpumask_clear_cpu lacks a __initdata
  annotation or the annotation of pcpu_build_alloc_info.mask is wrong.

clang output:
  mm/percpu.c:2724:5: remark: cpumask_clear_cpu not inlined into 
pcpu_build_alloc_info because too costly to inline (cost=725, threshold=325) 
[-Rpass-missed=inline]

[1] https://lore.kernel.org/linux-mm/202012220454.9f6bkz9q-...@intel.com/

Reported-by: kernel test robot 
Signed-off-by: Dennis Zhou 
---
This is on top of percpu#for-5.12.

 mm/percpu.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 80f8f885a990..357977c4cb00 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2642,6 +2642,18 @@ early_param("percpu_alloc", percpu_alloc_setup);
 
 /* pcpu_build_alloc_info() is used by both embed and page first chunk */
 #if defined(BUILD_EMBED_FIRST_CHUNK) || defined(BUILD_PAGE_FIRST_CHUNK)
+
+/*
+ * This wrapper is to avoid a warning where cpumask_clear_cpu() is not inlined
+ * when compiling with clang causing modpost to warn about accessing __initdata
+ * from a non __init function.  By doing this, we allow the struct cpumask to 
be
+ * freed instead of it taking space by annotating with __refdata.
+ */
+static void __init pcpu_cpumask_clear_cpu(int cpu, struct cpumask *mask)
+{
+   cpumask_clear_cpu(cpu, mask);
+}
+
 /**
  * pcpu_build_alloc_info - build alloc_info considering distances between CPUs
  * @reserved_size: the size of reserved percpu area in bytes
@@ -2713,7 +2725,7 @@ static struct pcpu_alloc_info * __init 
pcpu_build_alloc_info(
cpu = cpumask_first(&mask);
group_map[cpu] = group;
group_cnt[group]++;
-   cpumask_clear_cpu(cpu, &mask);
+   pcpu_cpumask_clear_cpu(cpu, &mask);
 
for_each_cpu(tcpu, &mask) {
if (!cpu_distance_fn ||
@@ -2721,7 +2733,7 @@ static struct pcpu_alloc_info * __init 
pcpu_build_alloc_info(
 cpu_distance_fn(tcpu, cpu) == LOCAL_DISTANCE)) {
group_map[tcpu] = group;
group_cnt[group]++;
-   cpumask_clear_cpu(tcpu, &mask);
+   pcpu_cpumask_clear_cpu(tcpu, &mask);
}
}
}
-- 
2.29.2.729.g45daf8777d-goog

Re: [PATCH rfc 1/4] percpu: implement partial chunk depopulation

2021-03-29 Thread Dennis Zhou

On Wed, Mar 24, 2021 at 12:06:23PM -0700, Roman Gushchin wrote:
> This patch implements partial depopulation of percpu chunks.
> 
> As now, a chunk can be depopulated only as a part of the final
> destruction, when there are no more outstanding allocations. However
> to minimize a memory waste, it might be useful to depopulate a
> partially filed chunk, if a small number of outstanding allocations
> prevents the chunk from being reclaimed.
> 
> This patch implements the following depopulation process: it scans
> over the chunk pages, looks for a range of empty and populated pages
> and performs the depopulation. To avoid races with new allocations,
> the chunk is previously isolated. After the depopulation the chunk is
> returned to the original slot (but is appended to the tail of the list
> to minimize the chances of population).
> 
> Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
> the chunk can be concurrently moved to a different slot. So we need
> to isolate it again on each step. pcpu_alloc_mutex is held, so the
> chunk can't be populated/depopulated asynchronously.
> 
> Signed-off-by: Roman Gushchin 
> ---
>  mm/percpu.c | 90 +
>  1 file changed, 90 insertions(+)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 6596a0a4286e..78c55c73fa28 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -2055,6 +2055,96 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
> type)
>   mutex_unlock(&pcpu_alloc_mutex);
>  }
>  
> +/**
> + * pcpu_shrink_populated - scan chunks and release unused pages to the system
> + * @type: chunk type
> + *
> + * Scan over all chunks, find those marked with the depopulate flag and
> + * try to release unused pages to the system. On every attempt clear the
> + * chunk's depopulate flag to avoid wasting CPU by scanning the same
> + * chunk again and again.
> + */
> +static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> +{
> + struct list_head *pcpu_slot = pcpu_chunk_list(type);
> + struct pcpu_chunk *chunk;
> + int slot, i, off, start;
> +
> + spin_lock_irq(&pcpu_lock);
> + for (slot = pcpu_nr_slots - 1; slot >= 0; slot--) {
> +restart:
> + list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> + bool isolated = false;
> +
> + if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
> + break;
> +

Deallocation makes me a little worried for the atomic case as now we
could in theory pathologically scan deallocated chunks before finding a
populated one.

I wonder if we should do something like once a chunk gets depopulated,
it gets deprioritized and then only once we exhaust looking through
allocated chunks we then find a depopulated chunk and add it back into
the rotation. Possibly just add another set of slots? I guess it adds a
few dimensions to pcpu_slots after the memcg change.

> + for (i = 0, start = -1; i < chunk->nr_pages; i++) {
> + if (!chunk->nr_empty_pop_pages)
> + break;
> +
> + /*
> +  * If the page is empty and populated, start or
> +  * extend the [start, i) range.
> +  */
> + if (test_bit(i, chunk->populated)) {
> + off = find_first_bit(
> + pcpu_index_alloc_map(chunk, i),
> + PCPU_BITMAP_BLOCK_BITS);
> + if (off >= PCPU_BITMAP_BLOCK_BITS) {
> + if (start == -1)
> + start = i;
> + continue;
> + }

Here instead of looking at the alloc_map, you can look at the
pcpu_block_md and look for a fully free contig_hint.

> + }
> +
> + /*
> +  * Otherwise check if there is an active range,
> +  * and if yes, depopulate it.
> +  */
> + if (start == -1)
> + continue;
> +
> + /*
> +  * Isolate the chunk, so new allocations
> +  * wouldn't be served using this chunk.
> +  * Async releases can still happen.
> +  */
> + if (!list_empty(&chunk->list)) {
> + list_del_init(&chunk->list);
> + isolated = true;

Maybe when freeing a chunk, we should consider just isolating it period
and preventing pcpu_free_a

Re: [PATCH rfc 2/4] percpu: split __pcpu_balance_workfn()

2021-03-29 Thread Dennis Zhou

On Wed, Mar 24, 2021 at 12:06:24PM -0700, Roman Gushchin wrote:
> __pcpu_balance_workfn() became fairly big and hard to follow, but in
> fact it consists of two fully independent parts, responsible for
> the destruction of excessive free chunks and population of necessarily
> amount of free pages.
> 
> In order to simplify the code and prepare for adding of a new
> functionality, split it in two functions:
> 
>   1) pcpu_balance_free,
>   2) pcpu_balance_populated.
> 
> Move the taking/releasing of the pcpu_alloc_mutex to an upper level
> to keep the current synchronization in place.
> 
> Signed-off-by: Roman Gushchin 
> ---
>  mm/percpu.c | 46 +-
>  1 file changed, 29 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 78c55c73fa28..015d076893f5 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1930,31 +1930,22 @@ void __percpu *__alloc_reserved_percpu(size_t size, 
> size_t align)
>  }
>  
>  /**
> - * __pcpu_balance_workfn - manage the amount of free chunks and populated 
> pages
> + * pcpu_balance_free - manage the amount of free chunks
>   * @type: chunk type
>   *
> - * Reclaim all fully free chunks except for the first one.  This is also
> - * responsible for maintaining the pool of empty populated pages.  However,
> - * it is possible that this is called when physical memory is scarce causing
> - * OOM killer to be triggered.  We should avoid doing so until an actual
> - * allocation causes the failure as it is possible that requests can be
> - * serviced from already backed regions.
> + * Reclaim all fully free chunks except for the first one.
>   */
> -static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
> +static void pcpu_balance_free(enum pcpu_chunk_type type)
>  {
> - /* gfp flags passed to underlying allocators */
> - const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>   LIST_HEAD(to_free);
>   struct list_head *pcpu_slot = pcpu_chunk_list(type);
>   struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
>   struct pcpu_chunk *chunk, *next;
> - int slot, nr_to_pop, ret;
>  
>   /*
>* There's no reason to keep around multiple unused chunks and VM
>* areas can be scarce.  Destroy all free chunks except for one.
>*/
> - mutex_lock(&pcpu_alloc_mutex);
>   spin_lock_irq(&pcpu_lock);
>  
>   list_for_each_entry_safe(chunk, next, free_head, list) {
> @@ -1982,6 +1973,25 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
> type)
>   pcpu_destroy_chunk(chunk);
>   cond_resched();
>   }
> +}
> +
> +/**
> + * pcpu_balance_populated - manage the amount of populated pages
> + * @type: chunk type
> + *
> + * Maintain a certain amount of populated pages to satisfy atomic 
> allocations.
> + * It is possible that this is called when physical memory is scarce causing
> + * OOM killer to be triggered.  We should avoid doing so until an actual
> + * allocation causes the failure as it is possible that requests can be
> + * serviced from already backed regions.
> + */
> +static void pcpu_balance_populated(enum pcpu_chunk_type type)
> +{
> + /* gfp flags passed to underlying allocators */
> + const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> + struct list_head *pcpu_slot = pcpu_chunk_list(type);
> + struct pcpu_chunk *chunk;
> + int slot, nr_to_pop, ret;
>  
>   /*
>* Ensure there are certain number of free populated pages for
> @@ -2051,8 +2061,6 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
> type)
>   goto retry_pop;
>   }
>   }
> -
> - mutex_unlock(&pcpu_alloc_mutex);
>  }
>  
>  /**
> @@ -2149,14 +2157,18 @@ static void pcpu_shrink_populated(enum 
> pcpu_chunk_type type)
>   * pcpu_balance_workfn - manage the amount of free chunks and populated pages
>   * @work: unused
>   *
> - * Call __pcpu_balance_workfn() for each chunk type.
> + * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
>   */
>  static void pcpu_balance_workfn(struct work_struct *work)
>  {
>   enum pcpu_chunk_type type;
>  
> - for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> - __pcpu_balance_workfn(type);
> + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
> + mutex_lock(&pcpu_alloc_mutex);
> + pcpu_balance_free(type);
> + pcpu_balance_populated(type);
> + mutex_unlock(&pcpu_alloc_mutex);
> + }
>  }
>  
>  /**
> -- 
> 2.30.2
> 

Reviewed-by: Dennis Zhou 

This makes sense. If you want me to pick this and the last patch up
first I can. Otherwise, do you mind moving this to the front of the
stack because it is a clean up?

Thanks,
Dennis

Re: [PATCH rfc 3/4] percpu: on demand chunk depopulation

2021-03-29 Thread Dennis Zhou

On Wed, Mar 24, 2021 at 12:06:25PM -0700, Roman Gushchin wrote:
> To return unused memory to the system schedule an async
> depopulation of percpu chunks.
> 
> To balance between scanning too much and creating an overhead because
> of the pcpu_lock contention and scanning not enough, let's track an
> amount of chunks to scan and mark chunks which are potentially a good
> target for the depopulation with a new boolean flag.  The async
> depopulation work will clear the flag after trying to depopulate a
> chunk (successfully or not).
> 
> This commit suggest the following logic: if a chunk
>   1) has more than 1/4 of total pages free and populated
>   2) isn't a reserved chunk
>   3) isn't entirely free
>   4) isn't alone in the corresponding slot

I'm not sure I like the check for alone that much. The reason being what
about some odd case where each slot has a single chunk, but every slot
is populated. It doesn't really make sense to keep them all around.

I think there is some decision making we can do here to handle packing
post depopulation allocations into a handful of chunks. Depopulated
chunks could be sidelined with say a flag ->depopulated to prevent the
first attempt of allocations from using them. And then we could bring
back a chunk 1 by 1 somehow to attempt to suffice the allocation.
I'm not too sure if this is a good idea, just a thought.

> it's a good target for depopulation.
> 
> If there are 2 or more of such chunks, an async depopulation
> is scheduled.
> 
> Because chunk population and depopulation are opposite processes
> which make a little sense together, split out the shrinking part of
> pcpu_balance_populated() into pcpu_grow_populated() and make
> pcpu_balance_populated() calling into pcpu_grow_populated() or
> pcpu_shrink_populated() conditionally.
> 
> Signed-off-by: Roman Gushchin 
> ---
>  mm/percpu-internal.h |   1 +
>  mm/percpu.c  | 111 ---
>  2 files changed, 85 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 18b768ac7dca..1c5b92af02eb 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -67,6 +67,7 @@ struct pcpu_chunk {
>  
>   void*data;  /* chunk data */
>   boolimmutable;  /* no [de]population allowed */
> + booldepopulate; /* depopulation hint */
>   int start_offset;   /* the overlap with the previous
>  region to have a page aligned
>  base_addr */
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 015d076893f5..148137f0fc0b 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -178,6 +178,12 @@ static LIST_HEAD(pcpu_map_extend_chunks);
>   */
>  int pcpu_nr_empty_pop_pages;
>  
> +/*
> + * Track the number of chunks with a lot of free memory.
> + * It's used to release unused pages to the system.
> + */
> +static int pcpu_nr_chunks_to_depopulate;
> +
>  /*
>   * The number of populated pages in use by the allocator, protected by
>   * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page 
> gets
> @@ -1955,6 +1961,11 @@ static void pcpu_balance_free(enum pcpu_chunk_type 
> type)
>   if (chunk == list_first_entry(free_head, struct pcpu_chunk, 
> list))
>   continue;
>  
> + if (chunk->depopulate) {
> + chunk->depopulate = false;
> + pcpu_nr_chunks_to_depopulate--;
> + }
> +
>   list_move(&chunk->list, &to_free);
>   }
>  
> @@ -1976,7 +1987,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
>  }
>  
>  /**
> - * pcpu_balance_populated - manage the amount of populated pages
> + * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
>   * @type: chunk type
>   *
>   * Maintain a certain amount of populated pages to satisfy atomic 
> allocations.
> @@ -1985,35 +1996,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type 
> type)
>   * allocation causes the failure as it is possible that requests can be
>   * serviced from already backed regions.
>   */
> -static void pcpu_balance_populated(enum pcpu_chunk_type type)
> +static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
>  {
>   /* gfp flags passed to underlying allocators */
>   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>   struct list_head *pcpu_slot = pcpu_chunk_list(type);
>   struct pcpu_chunk *chunk;
> - int slot, nr_to_pop, ret;
> + int slot, ret;
>  
> - /*
> -  * Ensure there are certain number of free populated pages for
> -  * atomic allocs.  Fill up from the most packed so that atomic
> -  * allocs don't increase fragmentation.  If atomic allocation
> -  * failed previously, always populate the maximum amount.  This
> -  * should prevent atomic allo

Re: [PATCH rfc 1/4] percpu: implement partial chunk depopulation

2021-03-29 Thread Dennis Zhou

On Mon, Mar 29, 2021 at 11:29:22AM -0700, Roman Gushchin wrote:
> On Mon, Mar 29, 2021 at 05:20:55PM +0000, Dennis Zhou wrote:
> > On Wed, Mar 24, 2021 at 12:06:23PM -0700, Roman Gushchin wrote:
> > > This patch implements partial depopulation of percpu chunks.
> > > 
> > > As now, a chunk can be depopulated only as a part of the final
> > > destruction, when there are no more outstanding allocations. However
> > > to minimize a memory waste, it might be useful to depopulate a
> > > partially filed chunk, if a small number of outstanding allocations
> > > prevents the chunk from being reclaimed.
> > > 
> > > This patch implements the following depopulation process: it scans
> > > over the chunk pages, looks for a range of empty and populated pages
> > > and performs the depopulation. To avoid races with new allocations,
> > > the chunk is previously isolated. After the depopulation the chunk is
> > > returned to the original slot (but is appended to the tail of the list
> > > to minimize the chances of population).
> > > 
> > > Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
> > > the chunk can be concurrently moved to a different slot. So we need
> > > to isolate it again on each step. pcpu_alloc_mutex is held, so the
> > > chunk can't be populated/depopulated asynchronously.
> > > 
> > > Signed-off-by: Roman Gushchin 
> > > ---
> > >  mm/percpu.c | 90 +
> > >  1 file changed, 90 insertions(+)
> > > 
> > > diff --git a/mm/percpu.c b/mm/percpu.c
> > > index 6596a0a4286e..78c55c73fa28 100644
> > > --- a/mm/percpu.c
> > > +++ b/mm/percpu.c
> > > @@ -2055,6 +2055,96 @@ static void __pcpu_balance_workfn(enum 
> > > pcpu_chunk_type type)
> > >   mutex_unlock(&pcpu_alloc_mutex);
> > >  }
> > >  
> > > +/**
> > > + * pcpu_shrink_populated - scan chunks and release unused pages to the 
> > > system
> > > + * @type: chunk type
> > > + *
> > > + * Scan over all chunks, find those marked with the depopulate flag and
> > > + * try to release unused pages to the system. On every attempt clear the
> > > + * chunk's depopulate flag to avoid wasting CPU by scanning the same
> > > + * chunk again and again.
> > > + */
> > > +static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> > > +{
> > > + struct list_head *pcpu_slot = pcpu_chunk_list(type);
> > > + struct pcpu_chunk *chunk;
> > > + int slot, i, off, start;
> > > +
> > > + spin_lock_irq(&pcpu_lock);
> > > + for (slot = pcpu_nr_slots - 1; slot >= 0; slot--) {
> > > +restart:
> > > + list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> > > + bool isolated = false;
> > > +
> > > + if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
> > > + break;
> > > +
> > 
> > Deallocation makes me a little worried for the atomic case as now we
> > could in theory pathologically scan deallocated chunks before finding a
> > populated one.
> > 
> > I wonder if we should do something like once a chunk gets depopulated,
> > it gets deprioritized and then only once we exhaust looking through
> > allocated chunks we then find a depopulated chunk and add it back into
> > the rotation. Possibly just add another set of slots? I guess it adds a
> > few dimensions to pcpu_slots after the memcg change.
> 
> Please, take a look at patch 3 in the series ("percpu: on demand chunk 
> depopulation").
> Chunks considered to be a good target for the depopulation are in advance
> marked with a special flag, so we'll actually try to depopulate only
> few chunks at once. While the total number of chunks is fairly low,
> I think it should work.
> 
> Another option is to link all such chunks into a list and scan over it,
> instead of iterating over all slots.
> 
> Adding new dimensions to pcpu_slots is an option too, but I hope we can avoid
> this, as it would complicate the code.
> 

Yeah, depopulation has been on the todo list for a while. It adds the
dimension/opportunity of bin packing by sidelining chunks and I'm
wondering if that is the right thing to do.

Do you have a rough idea of the distribution of # of chunks you're
seeing?

> > 
> > > + for (i = 0, start = -1; i < chunk->nr_pages; i++) {
> > > +

Re: [PATCH rfc 3/4] percpu: on demand chunk depopulation

2021-03-29 Thread Dennis Zhou

On Mon, Mar 29, 2021 at 01:10:10PM -0700, Roman Gushchin wrote:
> On Mon, Mar 29, 2021 at 07:21:24PM +0000, Dennis Zhou wrote:
> > On Wed, Mar 24, 2021 at 12:06:25PM -0700, Roman Gushchin wrote:
> > > To return unused memory to the system schedule an async
> > > depopulation of percpu chunks.
> > > 
> > > To balance between scanning too much and creating an overhead because
> > > of the pcpu_lock contention and scanning not enough, let's track an
> > > amount of chunks to scan and mark chunks which are potentially a good
> > > target for the depopulation with a new boolean flag.  The async
> > > depopulation work will clear the flag after trying to depopulate a
> > > chunk (successfully or not).
> > > 
> > > This commit suggest the following logic: if a chunk
> > >   1) has more than 1/4 of total pages free and populated
> > >   2) isn't a reserved chunk
> > >   3) isn't entirely free
> > >   4) isn't alone in the corresponding slot
> > 
> > I'm not sure I like the check for alone that much. The reason being what
> > about some odd case where each slot has a single chunk, but every slot
> > is populated. It doesn't really make sense to keep them all around.
> 
> Yeah, I agree, I'm not sure either. Maybe we can just look at the total
> number of populated empty pages and make sure it's not too low and not
> too high. Btw, we should probably double PCPU_EMPTY_POP_PAGES_LOW/HIGH
> if memcg accounting is on.
> 

Hmmm. pcpu_nr_populated and pcpu_nr_empty_pop_pages should probably be
per chunk type now that you mention it.

> > 
> > I think there is some decision making we can do here to handle packing
> > post depopulation allocations into a handful of chunks. Depopulated
> > chunks could be sidelined with say a flag ->depopulated to prevent the
> > first attempt of allocations from using them. And then we could bring
> > back a chunk 1 by 1 somehow to attempt to suffice the allocation.
> > I'm not too sure if this is a good idea, just a thought.
> 
> I thought about it in this way: depopulated chunks are not different to
> new chunks, which are not yet fully populated. And they are naturally
> de-prioritized by being located in higher slots (and at the tail of the list).
> So I'm not sure we should handle them any special.
> 

I'm thinking of the following. Imagine 3 chunks, A and B in slot X, and
C in slot X+1. If B gets depopulated followed by A getting exhausted,
which chunk B or C should be used? If C is fully populated, we might
want to use that one.

I see that the priority is chunks at the very end, but I don't want to
take something that doesn't reasonable generalize to any slot PAGE_SIZE
and up. Or it should explicitly try to tackle only say the last N slots
(but preferably the former).

> > 
> > > it's a good target for depopulation.
> > > 
> > > If there are 2 or more of such chunks, an async depopulation
> > > is scheduled.
> > > 
> > > Because chunk population and depopulation are opposite processes
> > > which make a little sense together, split out the shrinking part of
> > > pcpu_balance_populated() into pcpu_grow_populated() and make
> > > pcpu_balance_populated() calling into pcpu_grow_populated() or
> > > pcpu_shrink_populated() conditionally.
> > > 
> > > Signed-off-by: Roman Gushchin 
> > > ---
> > >  mm/percpu-internal.h |   1 +
> > >  mm/percpu.c  | 111 ---
> > >  2 files changed, 85 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> > > index 18b768ac7dca..1c5b92af02eb 100644
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -67,6 +67,7 @@ struct pcpu_chunk {
> > >  
> > >   void*data;  /* chunk data */
> > >   boolimmutable;  /* no [de]population allowed */
> > > + booldepopulate; /* depopulation hint */
> > >   int start_offset;   /* the overlap with the previous
> > >  region to have a page aligned
> > >  base_addr */
> > > diff --git a/mm/percpu.c b/mm/percpu.c
> > > index 015d076893f5..148137f0fc0b 100644
> > > --- a/mm/percpu.c
> > > +++ b/mm/percpu.c
> > > @@ -178,6 +178,12 @@ static LIST_HEAD(pcpu_map_extend_chunks);
> > >   */

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-15 Thread Dennis Zhou

On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote:
> Dennis Zhou  writes:
> 
> > On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
> >> Dennis Zhou  writes:
> >> 
> >> > On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
> >> >> Dennis Zhou  writes:
> >> >> 
> >> >> > Hello,
> >> >> >
> >> >> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
> >> >> >> Miaohe Lin  writes:
> >> >> >> 
> >> >> >> > On 2021/4/14 9:17, Huang, Ying wrote:
> >> >> >> >> Miaohe Lin  writes:
> >> >> >> >> 
> >> >> >> >>> On 2021/4/12 15:24, Huang, Ying wrote:
> >> >> >> >>>> "Huang, Ying"  writes:
> >> >> >> >>>>
> >> >> >> >>>>> Miaohe Lin  writes:
> >> >> >> >>>>>
> >> >> >> >>>>>> We will use percpu-refcount to serialize against concurrent 
> >> >> >> >>>>>> swapoff. This
> >> >> >> >>>>>> patch adds the percpu_ref support for later fixup.
> >> >> >> >>>>>>
> >> >> >> >>>>>> Signed-off-by: Miaohe Lin 
> >> >> >> >>>>>> ---
> >> >> >> >>>>>>  include/linux/swap.h |  2 ++
> >> >> >> >>>>>>  mm/swapfile.c| 25 ++---
> >> >> >> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
> >> >> >> >>>>>>
> >> >> >> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> >> >>>>>> index 144727041e78..849ba5265c11 100644
> >> >> >> >>>>>> --- a/include/linux/swap.h
> >> >> >> >>>>>> +++ b/include/linux/swap.h
> >> >> >> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
> >> >> >> >>>>>>   * The in-memory structure used to track swap areas.
> >> >> >> >>>>>>   */
> >> >> >> >>>>>>  struct swap_info_struct {
> >> >> >> >>>>>> + struct percpu_ref users;/* serialization 
> >> >> >> >>>>>> against concurrent swapoff */
> >> >> >> >>>>>>   unsigned long   flags;  /* SWP_USED etc: see 
> >> >> >> >>>>>> above */
> >> >> >> >>>>>>   signed shortprio;   /* swap priority of 
> >> >> >> >>>>>> this type */
> >> >> >> >>>>>>   struct plist_node list; /* entry in 
> >> >> >> >>>>>> swap_active_head */
> >> >> >> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
> >> >> >> >>>>>>   struct block_device *bdev;  /* swap device or bdev 
> >> >> >> >>>>>> of swap file */
> >> >> >> >>>>>>   struct file *swap_file; /* seldom referenced */
> >> >> >> >>>>>>   unsigned int old_block_size;/* seldom referenced */
> >> >> >> >>>>>> + struct completion comp; /* seldom referenced */
> >> >> >> >>>>>>  #ifdef CONFIG_FRONTSWAP
> >> >> >> >>>>>>   unsigned long *frontswap_map;   /* frontswap in-use, 
> >> >> >> >>>>>> one bit per page */
> >> >> >> >>>>>>   atomic_t frontswap_pages;   /* frontswap pages 
> >> >> >> >>>>>> in-use counter */
> >> >> >> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> >> >>>>>> index 149e77454e3c..724173cd7d0c 100644
> >> >> >> >>>>>> --- a/mm/swapfile.c
> >> >> >> >>>>>> +++ b/mm/swapfile.c
> >> >> >> >>>>>>

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Dennis Zhou

Hello,

On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> Hello Roman,
> 
> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> 
> My results of the percpu_test are as follows:
> Intel KVM 4CPU:4G
> Vanilla 5.12-rc6
> # ./percpu_test.sh
> Percpu: 1952 kB
> Percpu:   219648 kB
> Percpu:   219648 kB
> 
> 5.12-rc6 + with patchset applied
> # ./percpu_test.sh
> Percpu: 2080 kB
> Percpu:   219712 kB
> Percpu:    72672 kB
> 
> I'm able to see improvement comparable to that of what you're see too.
> 
> However, on POWERPC I'm unable to reproduce these improvements with the 
> patchset in the same configuration
> 
> POWER9 KVM 4CPU:4G
> Vanilla 5.12-rc6
> # ./percpu_test.sh
> Percpu: 5888 kB
> Percpu:   118272 kB
> Percpu:   118272 kB
> 
> 5.12-rc6 + with patchset applied
> # ./percpu_test.sh
> Percpu: 6144 kB
> Percpu:   119040 kB
> Percpu:   119040 kB
> 
> I'm wondering if there's any architectural specific code that needs plumbing
> here?
> 

There shouldn't be. Can you send me the percpu_stats debug output before
and after?

> I will also look through the code to find the reason why POWER isn't
> depopulating pages.
> 
> Thank you,
> Pratik
> 
> On 08/04/21 9:27 am, Roman Gushchin wrote:
> > In our production experience the percpu memory allocator is sometimes 
> > struggling
> > with returning the memory to the system. A typical example is a creation of
> > several thousands memory cgroups (each has several chunks of the percpu data
> > used for vmstats, vmevents, ref counters etc). Deletion and complete 
> > releasing
> > of these cgroups doesn't always lead to a shrinkage of the percpu memory,
> > so that sometimes there are several GB's of memory wasted.
> > 
> > The underlying problem is the fragmentation: to release an underlying chunk
> > all percpu allocations should be released first. The percpu allocator tends
> > to top up chunks to improve the utilization. It means new small-ish 
> > allocations
> > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
> > effectively pinning them in memory.
> > 
> > This patchset solves this problem by implementing a partial depopulation
> > of percpu chunks: chunks with many empty pages are being asynchronously
> > depopulated and the pages are returned to the system.
> > 
> > To illustrate the problem the following script can be used:
> > 
> > --
> > #!/bin/bash
> > 
> > cd /sys/fs/cgroup
> > 
> > mkdir percpu_test
> > echo "+memory" > percpu_test/cgroup.subtree_control
> > 
> > cat /proc/meminfo | grep Percpu
> > 
> > for i in `seq 1 1000`; do
> >  mkdir percpu_test/cg_"${i}"
> >  for j in `seq 1 10`; do
> > mkdir percpu_test/cg_"${i}"_"${j}"
> >  done
> > done
> > 
> > cat /proc/meminfo | grep Percpu
> > 
> > for i in `seq 1 1000`; do
> >  for j in `seq 1 10`; do
> > rmdir percpu_test/cg_"${i}"_"${j}"
> >  done
> > done
> > 
> > sleep 10
> > 
> > cat /proc/meminfo | grep Percpu
> > 
> > for i in `seq 1 1000`; do
> >  rmdir percpu_test/cg_"${i}"
> > done
> > 
> > rmdir percpu_test
> > --
> > 
> > It creates 11000 memory cgroups and removes every 10 out of 11.
> > It prints the initial size of the percpu memory, the size after
> > creating all cgroups and the size after deleting most of them.
> > 
> > Results:
> >vanilla:
> >  ./percpu_test.sh
> >  Percpu: 7488 kB
> >  Percpu:   481152 kB
> >  Percpu:   481152 kB
> > 
> >with this patchset applied:
> >  ./percpu_test.sh
> >  Percpu: 7488 kB
> >  Percpu:   481408 kB
> >  Percpu:   135552 kB
> > 
> > So the total size of the percpu memory was reduced by more than 3.5 times.
> > 
> > v3:
> >- introduced pcpu_check_chunk_hint()
> >- fixed a bug related to the hint check
> >- minor cosmetic changes
> >- s/pretends/fixes (cc Vlastimil)
> > 
> > v2:
> >- depopulated chunks are sidelined
> >- depopulation happens in the reverse order
> >- depopulate list made per-chunk type
> >- better results due to better heuristics
> > 
> > v1:
> >- depopulation heuristics changed and optimized
> >- chunks are put into a separate list, depopulation scan this list
> >- chunk->isolated is introduced, chunk->depopulate is dropped
> >- rearranged patches a bit
> >- fixed a panic discovered by krobot
> >- made pcpu_nr_empty_pop_pages per chunk type
> >- minor fixes
> > 
> > rfc:
> >https://lwn.net/Articles/850508/
> > 
> > 
> > Roman Gushchin (6):
> >percpu: fix a comment about the chunks ordering
> >percpu: split __pcpu_balance_workfn()
> >percpu: make pcpu_nr_empty_pop_pages per chunk type
> >percpu: generalize pcpu_balance_populated()
> >percpu: factor out pcpu_check_chunk_hint()
> >percpu: implement partial chunk depopulation
> > 
> >   mm/percpu-internal.h |   4 +

Re: [PATCH v3 1/6] percpu: fix a comment about the chunks ordering

2021-04-16 Thread Dennis Zhou

Hello,

On Wed, Apr 07, 2021 at 08:57:31PM -0700, Roman Gushchin wrote:
> Since the commit 3e54097beb22 ("percpu: manage chunks based on
> contig_bits instead of free_bytes") chunks are sorted based on the
> size of the biggest continuous free area instead of the total number
> of free bytes. Update the corresponding comment to reflect this.
> 
> Signed-off-by: Roman Gushchin 
> ---
>  mm/percpu.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 6596a0a4286e..2f27123bb489 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -99,7 +99,10 @@
>  
>  #include "percpu-internal.h"
>  
> -/* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
> +/*
> + * The slots are sorted by the size of the biggest continuous free area.
> + * 1-31 bytes share the same slot.
> + */
>  #define PCPU_SLOT_BASE_SHIFT 5
>  /* chunks in slots below this are subject to being sidelined on failed alloc 
> */
>  #define PCPU_SLOT_FAIL_THRESHOLD 3
> -- 
> 2.30.2
> 

I've applied this to for-5.14.

Thanks,
Dennis

Re: [PATCH v3 2/6] percpu: split __pcpu_balance_workfn()

2021-04-16 Thread Dennis Zhou

Hello,

On Wed, Apr 07, 2021 at 08:57:32PM -0700, Roman Gushchin wrote:
> __pcpu_balance_workfn() became fairly big and hard to follow, but in
> fact it consists of two fully independent parts, responsible for
> the destruction of excessive free chunks and population of necessarily
> amount of free pages.
> 
> In order to simplify the code and prepare for adding of a new
> functionality, split it in two functions:
> 
>   1) pcpu_balance_free,
>   2) pcpu_balance_populated.
> 
> Move the taking/releasing of the pcpu_alloc_mutex to an upper level
> to keep the current synchronization in place.
> 
> Signed-off-by: Roman Gushchin 
> Reviewed-by: Dennis Zhou 
> ---
>  mm/percpu.c | 46 +-
>  1 file changed, 29 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 2f27123bb489..7e31e1b8725f 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1933,31 +1933,22 @@ void __percpu *__alloc_reserved_percpu(size_t size, 
> size_t align)
>  }
>  
>  /**
> - * __pcpu_balance_workfn - manage the amount of free chunks and populated 
> pages
> + * pcpu_balance_free - manage the amount of free chunks
>   * @type: chunk type
>   *
> - * Reclaim all fully free chunks except for the first one.  This is also
> - * responsible for maintaining the pool of empty populated pages.  However,
> - * it is possible that this is called when physical memory is scarce causing
> - * OOM killer to be triggered.  We should avoid doing so until an actual
> - * allocation causes the failure as it is possible that requests can be
> - * serviced from already backed regions.
> + * Reclaim all fully free chunks except for the first one.
>   */
> -static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
> +static void pcpu_balance_free(enum pcpu_chunk_type type)
>  {
> - /* gfp flags passed to underlying allocators */
> - const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>   LIST_HEAD(to_free);
>   struct list_head *pcpu_slot = pcpu_chunk_list(type);
>   struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
>   struct pcpu_chunk *chunk, *next;
> - int slot, nr_to_pop, ret;
>  
>   /*
>* There's no reason to keep around multiple unused chunks and VM
>* areas can be scarce.  Destroy all free chunks except for one.
>*/
> - mutex_lock(&pcpu_alloc_mutex);
>   spin_lock_irq(&pcpu_lock);
>  
>   list_for_each_entry_safe(chunk, next, free_head, list) {
> @@ -1985,6 +1976,25 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
> type)
>   pcpu_destroy_chunk(chunk);
>   cond_resched();
>   }
> +}
> +
> +/**
> + * pcpu_balance_populated - manage the amount of populated pages
> + * @type: chunk type
> + *
> + * Maintain a certain amount of populated pages to satisfy atomic 
> allocations.
> + * It is possible that this is called when physical memory is scarce causing
> + * OOM killer to be triggered.  We should avoid doing so until an actual
> + * allocation causes the failure as it is possible that requests can be
> + * serviced from already backed regions.
> + */
> +static void pcpu_balance_populated(enum pcpu_chunk_type type)
> +{
> + /* gfp flags passed to underlying allocators */
> + const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> + struct list_head *pcpu_slot = pcpu_chunk_list(type);
> + struct pcpu_chunk *chunk;
> + int slot, nr_to_pop, ret;
>  
>   /*
>* Ensure there are certain number of free populated pages for
> @@ -2054,22 +2064,24 @@ static void __pcpu_balance_workfn(enum 
> pcpu_chunk_type type)
>   goto retry_pop;
>   }
>   }
> -
> - mutex_unlock(&pcpu_alloc_mutex);
>  }
>  
>  /**
>   * pcpu_balance_workfn - manage the amount of free chunks and populated pages
>   * @work: unused
>   *
> - * Call __pcpu_balance_workfn() for each chunk type.
> + * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
>   */
>  static void pcpu_balance_workfn(struct work_struct *work)
>  {
>   enum pcpu_chunk_type type;
>  
> - for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> - __pcpu_balance_workfn(type);
> + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
> + mutex_lock(&pcpu_alloc_mutex);
> + pcpu_balance_free(type);
> + pcpu_balance_populated(type);
> + mutex_unlock(&pcpu_alloc_mutex);
> + }
>  }
>  
>  /**
> -- 
> 2.30.2
> 
> 

I've applied this to for-5.14.

Thanks,
Dennis

Re: [PATCH v3 3/6] percpu: make pcpu_nr_empty_pop_pages per chunk type

2021-04-16 Thread Dennis Zhou

Hello,

On Wed, Apr 07, 2021 at 08:57:33PM -0700, Roman Gushchin wrote:
> nr_empty_pop_pages is used to guarantee that there are some free
> populated pages to satisfy atomic allocations. Accounted and
> non-accounted allocations are using separate sets of chunks,
> so both need to have a surplus of empty pages.
> 
> This commit makes pcpu_nr_empty_pop_pages and the corresponding logic
> per chunk type.
> 
> Signed-off-by: Roman Gushchin 
> ---
>  mm/percpu-internal.h |  2 +-
>  mm/percpu-stats.c|  9 +++--
>  mm/percpu.c  | 14 +++---
>  3 files changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 18b768ac7dca..095d7eaa0db4 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -87,7 +87,7 @@ extern spinlock_t pcpu_lock;
>  
>  extern struct list_head *pcpu_chunk_lists;
>  extern int pcpu_nr_slots;
> -extern int pcpu_nr_empty_pop_pages;
> +extern int pcpu_nr_empty_pop_pages[];
>  
>  extern struct pcpu_chunk *pcpu_first_chunk;
>  extern struct pcpu_chunk *pcpu_reserved_chunk;
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index c8400a2adbc2..f6026dbcdf6b 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -145,6 +145,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>   int slot, max_nr_alloc;
>   int *buffer;
>   enum pcpu_chunk_type type;
> + int nr_empty_pop_pages;
>  
>  alloc_buffer:
>   spin_lock_irq(&pcpu_lock);
> @@ -165,7 +166,11 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>   goto alloc_buffer;
>   }
>  
> -#define PL(X) \
> + nr_empty_pop_pages = 0;
> + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
> + nr_empty_pop_pages += pcpu_nr_empty_pop_pages[type];
> +
> +#define PL(X)
> \
>   seq_printf(m, "  %-20s: %12lld\n", #X, (long long int)pcpu_stats_ai.X)
>  
>   seq_printf(m,
> @@ -196,7 +201,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
>   PU(nr_max_chunks);
>   PU(min_alloc_size);
>   PU(max_alloc_size);
> - P("empty_pop_pages", pcpu_nr_empty_pop_pages);
> + P("empty_pop_pages", nr_empty_pop_pages);
>   seq_putc(m, '\n');
>  
>  #undef PU
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 7e31e1b8725f..61339b3d9337 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -176,10 +176,10 @@ struct list_head *pcpu_chunk_lists __ro_after_init; /* 
> chunk list slots */
>  static LIST_HEAD(pcpu_map_extend_chunks);
>  
>  /*
> - * The number of empty populated pages, protected by pcpu_lock.  The
> - * reserved chunk doesn't contribute to the count.
> + * The number of empty populated pages by chunk type, protected by pcpu_lock.
> + * The reserved chunk doesn't contribute to the count.
>   */
> -int pcpu_nr_empty_pop_pages;
> +int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
>  
>  /*
>   * The number of populated pages in use by the allocator, protected by
> @@ -559,7 +559,7 @@ static inline void pcpu_update_empty_pages(struct 
> pcpu_chunk *chunk, int nr)
>  {
>   chunk->nr_empty_pop_pages += nr;
>   if (chunk != pcpu_reserved_chunk)
> - pcpu_nr_empty_pop_pages += nr;
> + pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
>  }
>  
>  /*
> @@ -1835,7 +1835,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
> align, bool reserved,
>   mutex_unlock(&pcpu_alloc_mutex);
>   }
>  
> - if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)
> + if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
>   pcpu_schedule_balance_work();
>  
>   /* clear the areas and return address relative to base address */
> @@ -2013,7 +2013,7 @@ static void pcpu_balance_populated(enum pcpu_chunk_type 
> type)
>   pcpu_atomic_alloc_failed = false;
>   } else {
>   nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
> -   pcpu_nr_empty_pop_pages,
> +   pcpu_nr_empty_pop_pages[type],
> 0, PCPU_EMPTY_POP_PAGES_HIGH);
>   }
>  
> @@ -2595,7 +2595,7 @@ void __init pcpu_setup_first_chunk(const struct 
> pcpu_alloc_info *ai,
>  
>   /* link the first chunk in */
>   pcpu_first_chunk = chunk;
> - pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
> + pcpu_nr_empty_pop_pages[PCPU_CHUNK_ROOT] = 
> pcpu_first_chunk->nr_empty_pop_pages;
>   pcpu_chunk_relocate(pcpu_first_chunk, -1);
>  
>   /* include all regions of the first chunk */
> -- 
> 2.30.2
> 

This turns out to have been a more pressing issue. Thanks for fixing
this. I ran this to Linus for v5.12-rc7 [1].

https://lore.kernel.org/lkml/yhhs618esvkhy...@google.com/

Thanks,
Dennis

Re: [PATCH v3 4/6] percpu: generalize pcpu_balance_populated()

2021-04-16 Thread Dennis Zhou

Hello,

On Wed, Apr 07, 2021 at 08:57:34PM -0700, Roman Gushchin wrote:
> To prepare for the depopulation of percpu chunks, split out the
> populating part of the pcpu_balance_populated() into the new
> pcpu_grow_populated() (with an intention to add
> pcpu_shrink_populated() in the next commit).
> 
> The goal of pcpu_balance_populated() is to determine whether
> there is a shortage or an excessive amount of empty percpu pages
> and call into the corresponding function.
> 
> pcpu_grow_populated() takes a desired number of pages as an argument
> (nr_to_pop). If it creates a new chunk, nr_to_pop should be updated
> to reflect that the new chunk could be created already populated.
> Otherwise an infinite loop might appear.
> 
> Signed-off-by: Roman Gushchin 
> ---
>  mm/percpu.c | 63 +
>  1 file changed, 39 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 61339b3d9337..e20119668c42 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1979,7 +1979,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
>  }
>  
>  /**
> - * pcpu_balance_populated - manage the amount of populated pages
> + * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
>   * @type: chunk type
>   *
>   * Maintain a certain amount of populated pages to satisfy atomic 
> allocations.
> @@ -1988,35 +1988,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type 
> type)
>   * allocation causes the failure as it is possible that requests can be
>   * serviced from already backed regions.
>   */
> -static void pcpu_balance_populated(enum pcpu_chunk_type type)
> +static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
>  {
>   /* gfp flags passed to underlying allocators */
>   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
>   struct list_head *pcpu_slot = pcpu_chunk_list(type);
>   struct pcpu_chunk *chunk;
> - int slot, nr_to_pop, ret;
> + int slot, ret;
>  
> - /*
> -  * Ensure there are certain number of free populated pages for
> -  * atomic allocs.  Fill up from the most packed so that atomic
> -  * allocs don't increase fragmentation.  If atomic allocation
> -  * failed previously, always populate the maximum amount.  This
> -  * should prevent atomic allocs larger than PAGE_SIZE from keeping
> -  * failing indefinitely; however, large atomic allocs are not
> -  * something we support properly and can be highly unreliable and
> -  * inefficient.
> -  */
>  retry_pop:
> - if (pcpu_atomic_alloc_failed) {
> - nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
> - /* best effort anyway, don't worry about synchronization */
> - pcpu_atomic_alloc_failed = false;
> - } else {
> - nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
> -   pcpu_nr_empty_pop_pages[type],
> -   0, PCPU_EMPTY_POP_PAGES_HIGH);
> - }
> -
>   for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) 
> {
>   unsigned int nr_unpop = 0, rs, re;
>  
> @@ -2060,12 +2040,47 @@ static void pcpu_balance_populated(enum 
> pcpu_chunk_type type)
>   if (chunk) {
>   spin_lock_irq(&pcpu_lock);
>   pcpu_chunk_relocate(chunk, -1);
> + nr_to_pop = max_t(int, 0, nr_to_pop - 
> chunk->nr_populated);
>   spin_unlock_irq(&pcpu_lock);
> - goto retry_pop;
> + if (nr_to_pop)
> + goto retry_pop;
>   }
>   }
>  }
>  
> +/**
> + * pcpu_balance_populated - manage the amount of populated pages
> + * @type: chunk type
> + *
> + * Populate or depopulate chunks to maintain a certain amount
> + * of free pages to satisfy atomic allocations, but not waste
> + * large amounts of memory.
> + */
> +static void pcpu_balance_populated(enum pcpu_chunk_type type)
> +{
> + int nr_to_pop;
> +
> + /*
> +  * Ensure there are certain number of free populated pages for
> +  * atomic allocs.  Fill up from the most packed so that atomic
> +  * allocs don't increase fragmentation.  If atomic allocation
> +  * failed previously, always populate the maximum amount.  This
> +  * should prevent atomic allocs larger than PAGE_SIZE from keeping
> +  * failing indefinitely; however, large atomic allocs are not
> +  * something we support properly and can be highly unreliable and
> +  * inefficient.
> +  */
> + if (pcpu_atomic_alloc_failed) {
> + nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
> + /* best effort anyway, don't worry about synchronization */
> + pcpu_atomic_alloc_failed = false;
> + pcpu_grow_populated(type, nr_to_pop);
> + } else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
> + nr_to_po

Re: [PATCH v3 5/6] percpu: factor out pcpu_check_chunk_hint()

2021-04-16 Thread Dennis Zhou

Hello,

On Wed, Apr 07, 2021 at 08:57:35PM -0700, Roman Gushchin wrote:
> Factor out the pcpu_check_chunk_hint() helper, which will be useful
> in the future. The new function checks if the allocation can likely
> fit the given chunk.
> 
> Signed-off-by: Roman Gushchin 
> ---
>  mm/percpu.c | 30 +-
>  1 file changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index e20119668c42..357fd6994278 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -306,6 +306,26 @@ static unsigned long pcpu_block_off_to_off(int index, 
> int off)
>   return index * PCPU_BITMAP_BLOCK_BITS + off;
>  }
>  
> +/**
> + * pcpu_check_chunk_hint - check that allocation can fit a chunk
> + * @chunk_md: chunk's block

nit for consistency: @block: block of interest

> + * @bits: size of request in allocation units
> + * @align: alignment of area (max PAGE_SIZE)
> + *
> + * Check to see if the allocation can fit in the chunk's contig hint.
> + * This is an optimization to prevent scanning by assuming if it
> + * cannot fit in the global hint, there is memory pressure and creating
> + * a new chunk would happen soon.
> + */

It occurred to me, That I converged block_md and chunk_md to be the same
object as 1 is just a degenerative case of the other. Can we rename this
to be pcpu_check_block_hint() and have it take in pcpu_block_md?

> +static bool pcpu_check_chunk_hint(struct pcpu_block_md *chunk_md, int bits,
> +   size_t align)
> +{
> + int bit_off = ALIGN(chunk_md->contig_hint_start, align) -
> + chunk_md->contig_hint_start;
> +
> + return bit_off + bits <= chunk_md->contig_hint;
> +}
> +
>  /*
>   * pcpu_next_hint - determine which hint to use
>   * @block: block of interest
> @@ -1065,15 +1085,7 @@ static int pcpu_find_block_fit(struct pcpu_chunk 
> *chunk, int alloc_bits,
>   struct pcpu_block_md *chunk_md = &chunk->chunk_md;
>   int bit_off, bits, next_off;
>  
> - /*
> -  * Check to see if the allocation can fit in the chunk's contig hint.
> -  * This is an optimization to prevent scanning by assuming if it
> -  * cannot fit in the global hint, there is memory pressure and creating
> -  * a new chunk would happen soon.
> -  */
> - bit_off = ALIGN(chunk_md->contig_hint_start, align) -
> -   chunk_md->contig_hint_start;
> - if (bit_off + alloc_bits > chunk_md->contig_hint)
> + if (!pcpu_check_chunk_hint(chunk_md, alloc_bits, align))
>   return -1;
>  
>   bit_off = pcpu_next_hint(chunk_md, alloc_bits);
> -- 
> 2.30.2
> 

Thanks,
Dennis

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Dennis Zhou

Hello,

On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:
> 
> 
> On 17/04/21 12:39 am, Roman Gushchin wrote:
> > On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
> > > 
> > > On 17/04/21 12:04 am, Roman Gushchin wrote:
> > > > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> > > > > On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > > > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > > > > > Hello Dennis,
> > > > > > > 
> > > > > > > I apologize for the clutter of logs before, I'm pasting the logs 
> > > > > > > of before and
> > > > > > > after the percpu test in the case of the patchset being applied 
> > > > > > > on 5.12-rc6 and
> > > > > > > the vanilla kernel 5.12-rc6.
> > > > > > > 
> > > > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > > > > > Hello Roman,
> > > > > > > > > 
> > > > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM 
> > > > > > > > > setup.
> > > > > > > > > 
> > > > > > > > > My results of the percpu_test are as follows:
> > > > > > > > > Intel KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu: 1952 kB
> > > > > > > > > Percpu:   219648 kB
> > > > > > > > > Percpu:   219648 kB
> > > > > > > > > 
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu: 2080 kB
> > > > > > > > > Percpu:   219712 kB
> > > > > > > > > Percpu:    72672 kB
> > > > > > > > > 
> > > > > > > > > I'm able to see improvement comparable to that of what you're 
> > > > > > > > > see too.
> > > > > > > > > 
> > > > > > > > > However, on POWERPC I'm unable to reproduce these 
> > > > > > > > > improvements with the patchset in the same configuration
> > > > > > > > > 
> > > > > > > > > POWER9 KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu: 5888 kB
> > > > > > > > > Percpu:   118272 kB
> > > > > > > > > Percpu:   118272 kB
> > > > > > > > > 
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu: 6144 kB
> > > > > > > > > Percpu:   119040 kB
> > > > > > > > > Percpu:   119040 kB
> > > > > > > > > 
> > > > > > > > > I'm wondering if there's any architectural specific code that 
> > > > > > > > > needs plumbing
> > > > > > > > > here?
> > > > > > > > > 
> > > > > > > > There shouldn't be. Can you send me the percpu_stats debug 
> > > > > > > > output before
> > > > > > > > and after?
> > > > > > > I'll paste the whole debug stats before and after here.
> > > > > > > 5.12-rc6 + patchset
> > > > > > > -BEFORE-
> > > > > > > Percpu Memory Statistics
> > > > > > > Allocation Info:
> > > > > > Hm, this looks highly suspicious. Here is your stats in a more 
> > > > > > compact form:
> > > > > > 
> > > > > > Vanilla
> > > > > > 
> > > > > > nr_alloc: 903

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Dennis Zhou

Hello,

On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
> > On 2021/4/14 9:17, Huang, Ying wrote:
> >> Miaohe Lin  writes:
> >> 
> >>> On 2021/4/12 15:24, Huang, Ying wrote:
>  "Huang, Ying"  writes:
> 
> > Miaohe Lin  writes:
> >
> >> We will use percpu-refcount to serialize against concurrent swapoff. 
> >> This
> >> patch adds the percpu_ref support for later fixup.
> >>
> >> Signed-off-by: Miaohe Lin 
> >> ---
> >>  include/linux/swap.h |  2 ++
> >>  mm/swapfile.c| 25 ++---
> >>  2 files changed, 24 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> index 144727041e78..849ba5265c11 100644
> >> --- a/include/linux/swap.h
> >> +++ b/include/linux/swap.h
> >> @@ -240,6 +240,7 @@ struct swap_cluster_list {
> >>   * The in-memory structure used to track swap areas.
> >>   */
> >>  struct swap_info_struct {
> >> +  struct percpu_ref users;/* serialization against 
> >> concurrent swapoff */
> >>unsigned long   flags;  /* SWP_USED etc: see above */
> >>signed shortprio;   /* swap priority of this type */
> >>struct plist_node list; /* entry in swap_active_head */
> >> @@ -260,6 +261,7 @@ struct swap_info_struct {
> >>struct block_device *bdev;  /* swap device or bdev of swap 
> >> file */
> >>struct file *swap_file; /* seldom referenced */
> >>unsigned int old_block_size;/* seldom referenced */
> >> +  struct completion comp; /* seldom referenced */
> >>  #ifdef CONFIG_FRONTSWAP
> >>unsigned long *frontswap_map;   /* frontswap in-use, one bit 
> >> per page */
> >>atomic_t frontswap_pages;   /* frontswap pages in-use 
> >> counter */
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index 149e77454e3c..724173cd7d0c 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -39,6 +39,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  
> >>  #include 
> >>  #include 
> >> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct 
> >> *work)
> >>spin_unlock(&si->lock);
> >>  }
> >>  
> >> +static void swap_users_ref_free(struct percpu_ref *ref)
> >> +{
> >> +  struct swap_info_struct *si;
> >> +
> >> +  si = container_of(ref, struct swap_info_struct, users);
> >> +  complete(&si->comp);
> >> +  percpu_ref_exit(&si->users);
> >
> > Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
> > get_swap_device(), better to add comments there.
> 
>  I just noticed that the comments of percpu_ref_tryget_live() says,
> 
>   * This function is safe to call as long as @ref is between init and 
>  exit.
> 
>  While we need to call get_swap_device() almost at any time, so it's
>  better to avoid to call percpu_ref_exit() at all.  This will waste some
>  memory, but we need to follow the API definition to avoid potential
>  issues in the long term.
> >>>
> >>> I have to admit that I'am not really familiar with percpu_ref. So I read 
> >>> the
> >>> implementation code of the percpu_ref and found percpu_ref_tryget_live() 
> >>> could
> >>> be called after exit now. But you're right we need to follow the API 
> >>> definition
> >>> to avoid potential issues in the long term.
> >>>
> 
>  And we need to call percpu_ref_init() before insert the swap_info_struct
>  into the swap_info[].
> >>>
> >>> If we remove the call to percpu_ref_exit(), we should not use 
> >>> percpu_ref_init()
> >>> here because *percpu_ref->data is assumed to be NULL* in 
> >>> percpu_ref_init() while
> >>> this is not the case as we do not call percpu_ref_exit(). Maybe 
> >>> percpu_ref_reinit()
> >>> or percpu_ref_resurrect() will do the work.
> >>>
> >>> One more thing, how could I distinguish the killed percpu_ref from newly 
> >>> allocated one?
> >>> It seems percpu_ref_is_dying is only safe to call when @ref is between 
> >>> init and exit.
> >>> Maybe I could do this in alloc_swap_info()?
> >> 
> >> Yes.  In alloc_swap_info(), you can distinguish newly allocated and
> >> reused swap_info_struct.
> >> 
> 
> >> +}
> >> +
> >>  static void alloc_cluster(struct swap_info_struct *si, unsigned long 
> >> idx)
> >>  {
> >>struct swap_cluster_info *ci = si->cluster_info;
> >> @@ -2500,7 +2510,7 @@ static void enable_swap_info(struct 
> >> swap_info_struct *p, int prio,
> >> * Guarantee swap_map, cluster_info, etc. fields are valid
> >> * between get/put_swap_device() if SWP_VALID bit is set
> >> */
> >> -  synchronize_rcu();
>

1 2 3 >

1 - 100 of 250 matches

Mail list logo