On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe <[email protected]> wrote:
> On 06/29/2017 02:40 AM, Ming Lei wrote:
>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe <[email protected]> wrote:
>>> On 06/28/2017 03:12 PM, Brian King wrote:
>>>> This patch converts the in_flight counter in struct hd_struct from a
>>>> pair of atomics to a pair of percpu counters. This eliminates a couple
>>>> of atomics from the hot path. When running this on a Power system, to
>>>> a single null_blk device with 80 submission queues, irq mode 0, with
>>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
>>>
>>> This has been done before, but I've never really liked it. The reason is
>>> that it means that reading the part stat inflight count now has to
>>> iterate over every possible CPU. Did you use partitions in your testing?
>>> How many CPUs were configured? When I last tested this a few years ago
>>> on even a quad core nehalem (which is notoriously shitty for cross-node
>>> latencies), it was a net loss.
>>
>> One year ago, I saw null_blk's IOPS can be decreased to 10%
>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
>> 96 cores, and dual numa nodes) too, the performance can be
>> recovered basically if per numa-node counter is introduced and
>> used in this case, but the patch was never posted out.
>> If anyone is interested in that, I can rebase the patch on current
>> block tree and post out. I guess the performance issue might be
>> related with system cache coherency implementation more or less.
>> This issue on ARM64 can be observed with the following userspace
>> atomic counting test too:
>>
>> http://kernel.ubuntu.com/~ming/test/cache/
>
> How well did the per-node thing work? Doesn't seem to me like it would
Last time, on ARM64, I remembered that the IOPS was basically recovered,
but now I don't have a such machine to test. Could Brian test the attached patch
to see if it works on big Power machine?
And the idea is simple, just make the atomic counter per-node.
> go far enough. And per CPU is too much. One potential improvement would
> be to change the part_stat_read() to just loop online CPUs, instead of
> all possible CPUs. When CPUs go on/offline, use that as the slow path to
> ensure the stats are sane. Often there's a huge difference between
> NR_CPUS configured and what the system has. As Brian states, RH ships
> with 2048, while I doubt a lot of customers actually run that...
One observation I saw on arm64 dual socket before is that atomic inc/dec on
counter stored in local numa node is much cheaper than cross-node, that is
why I tried the per-node counter. And wrt. in-flight atomic counter, both inc
and dec should happen on CPUs belonging to same numa node in case of
blk-mq.
>
> Outside of coming up with a more clever data structure that is fully
> CPU topology aware, one thing that could work is just having X cache
> line separated read/write inflight counters per node, where X is some
> suitable value (like 4). That prevents us from having cross node
> traffic, and it also keeps the cross cpu traffic fairly low. That should
> provide a nice balance between cost of incrementing the inflight
> counting, and the cost of looping for reading it.
>
> And that brings me to the next part...
>
>>> I do agree that we should do something about it, and it's one of those
>>> items I've highlighted in talks about blk-mq on pending issues to fix
>>> up. It's just not great as it currently stands, but I don't think per
>>> CPU counters is the right way to fix it, at least not for the inflight
>>> counter.
>>
>> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
>> we can use some blk-mq knowledge(tagset?) to figure out the
>> 'in_flight' counter. I thought about it before, but never got a
>> perfect solution, and looks it is a bit hard, :-)
>
> The tags are already a bit spread out, so it's worth a shot. That would
> remove the need to do anything in the inc/dec path, as the tags already
> do that. The inlight count could be easily retrieved with
> sbitmap_weight(). The only issue here is that we need separate read and
> write counters, and the weight would obviously only get us the total
> count. But we can have a slower path for that, just iterate the tags and
> count them. The fast path only cares about total count.
>
> Let me try that out real quick.
>
> --
> Jens Axboe
>
Thanks,
Ming Lei
diff --git a/block/partition-generic.c b/block/partition-generic.c
index c5ec8246e25e..bd6644bf9643 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -140,8 +140,9 @@ ssize_t part_inflight_show(struct device *dev,
{
struct hd_struct *p = dev_to_part(dev);
- return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
- atomic_read(&p->in_flight[1]));
+ return sprintf(buf, "%8u %8u\n",
+ pnode_counter_read_all(&p->in_flight[0]),
+ pnode_counter_read_all(&p->in_flight[1]));
}
#ifdef CONFIG_FAIL_MAKE_REQUEST
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 96bd13e581cd..aac0b2235410 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -521,7 +521,7 @@ static void start_io_acct(struct dm_io *io)
cpu = part_stat_lock();
part_round_stats(cpu, &dm_disk(md)->part0);
part_stat_unlock();
- atomic_set(&dm_disk(md)->part0.in_flight[rw],
+ pnode_counter_set(&dm_disk(md)->part0.in_flight[rw],
atomic_inc_return(&md->pending[rw]));
if (unlikely(dm_stats_used(&md->stats)))
@@ -550,7 +550,7 @@ static void end_io_acct(struct dm_io *io)
* a flush.
*/
pending = atomic_dec_return(&md->pending[rw]);
- atomic_set(&dm_disk(md)->part0.in_flight[rw], pending);
+ pnode_counter_set(&dm_disk(md)->part0.in_flight[rw], pending);
pending += atomic_read(&md->pending[rw^0x1]);
/* nudge anyone waiting on suspend queue */
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index e619fae2f037..40c9bc74a120 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -15,6 +15,7 @@
#include <linux/slab.h>
#include <linux/percpu-refcount.h>
#include <linux/uuid.h>
+#include <linux/pernode_counter.h>
#ifdef CONFIG_BLOCK
@@ -120,7 +121,7 @@ struct hd_struct {
int make_it_fail;
#endif
unsigned long stamp;
- atomic_t in_flight[2];
+ struct pnode_counter in_flight[2];
#ifdef CONFIG_SMP
struct disk_stats __percpu *dkstats;
#else
@@ -364,21 +365,22 @@ static inline void free_part_stats(struct hd_struct *part)
static inline void part_inc_in_flight(struct hd_struct *part, int rw)
{
- atomic_inc(&part->in_flight[rw]);
+ pnode_counter_inc(&part->in_flight[rw]);
if (part->partno)
- atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
+ pnode_counter_inc(&part_to_disk(part)->part0.in_flight[rw]);
}
static inline void part_dec_in_flight(struct hd_struct *part, int rw)
{
- atomic_dec(&part->in_flight[rw]);
+ pnode_counter_dec(&part->in_flight[rw]);
if (part->partno)
- atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
+ pnode_counter_dec(&part_to_disk(part)->part0.in_flight[rw]);
}
static inline int part_in_flight(struct hd_struct *part)
{
- return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]);
+ return pnode_counter_read_all(&part->in_flight[0]) + \
+ pnode_counter_read_all(&part->in_flight[1]);
}
static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)
@@ -627,11 +629,34 @@ extern ssize_t part_fail_store(struct device *dev,
const char *buf, size_t count);
#endif /* CONFIG_FAIL_MAKE_REQUEST */
+static inline int hd_counter_init(struct hd_struct *part)
+{
+ if (pnode_counter_init(&part->in_flight[0], GFP_KERNEL))
+ return -ENOMEM;
+ if (pnode_counter_init(&part->in_flight[1], GFP_KERNEL)) {
+ pnode_counter_deinit(&part->in_flight[0]);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static inline void hd_counter_deinit(struct hd_struct *part)
+{
+ pnode_counter_deinit(&part->in_flight[0]);
+ pnode_counter_deinit(&part->in_flight[1]);
+}
+
static inline int hd_ref_init(struct hd_struct *part)
{
+ if (hd_counter_init(part))
+ return -ENOMEM;
+
if (percpu_ref_init(&part->ref, __delete_partition, 0,
- GFP_KERNEL))
+ GFP_KERNEL)) {
+ hd_counter_deinit(part);
return -ENOMEM;
+ }
return 0;
}
@@ -659,6 +684,7 @@ static inline void hd_free_part(struct hd_struct *part)
{
free_part_stats(part);
free_part_info(part);
+ hd_counter_deinit(part);
percpu_ref_exit(&part->ref);
}
diff --git a/include/linux/pernode_counter.h b/include/linux/pernode_counter.h
new file mode 100644
index 000000000000..127639fbc25f
--- /dev/null
+++ b/include/linux/pernode_counter.h
@@ -0,0 +1,118 @@
+#ifndef _LINUX_PERNODE_COUNTER_H
+#define _LINUX_PERNODE_COUNTER_H
+/*
+ * A simple per node atomic counter for use in block io accounting.
+ */
+
+#include <linux/smp.h>
+#include <linux/percpu.h>
+#include <linux/types.h>
+#include <linux/gfp.h>
+
+struct node_counter {
+ atomic_t counter_in_node;
+};
+
+struct pnode_counter {
+ struct node_counter * __percpu *counter;
+ struct node_counter **nodes;
+};
+
+static inline int pnode_counter_init(struct pnode_counter *pnc, gfp_t gfp)
+{
+ struct node_counter **nodes;
+ int i, j, cpu;
+
+ nodes = kzalloc(nr_node_ids * sizeof(struct node_counter *), gfp);
+ if (!nodes)
+ goto err_nodes;
+
+ for_each_node(i) {
+ nodes[i] = kzalloc_node(sizeof(struct node_counter), gfp, i);
+ if (!nodes[i])
+ goto err_node_counter;
+ }
+
+ pnc->counter = alloc_percpu_gfp(struct node_counter *, gfp);
+ if (!pnc->counter)
+ goto err_node_counter;
+
+ for_each_possible_cpu(cpu)
+ *per_cpu_ptr(pnc->counter, cpu) = nodes[cpu_to_node(cpu)];
+
+ pnc->nodes = nodes;
+
+ return 0;
+
+ err_node_counter:
+ for (j = 0; j < i; j++)
+ kfree(nodes[j]);
+ kfree(nodes);
+ err_nodes:
+ return -ENOMEM;
+}
+
+static inline void pnode_counter_deinit(struct pnode_counter *pnc)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu);
+
+ kfree(node);
+ *per_cpu_ptr(pnc->counter, cpu) = NULL;
+ }
+ free_percpu(pnc->counter);
+ kfree(pnc->nodes);
+}
+
+static inline void pnode_counter_inc(struct pnode_counter *pnc)
+{
+ struct node_counter *node = this_cpu_read(*pnc->counter);
+
+ atomic_inc(&node->counter_in_node);
+}
+
+static inline void pnode_counter_inc_cpu(struct pnode_counter *pnc, int cpu)
+{
+ struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu);
+
+ atomic_inc(&node->counter_in_node);
+}
+
+static inline void pnode_counter_dec(struct pnode_counter *pnc)
+{
+ struct node_counter *node = this_cpu_read(*pnc->counter);
+
+ atomic_dec(&node->counter_in_node);
+}
+
+static inline void pnode_counter_dec_cpu(struct pnode_counter *pnc, int cpu)
+{
+ struct node_counter *node = *per_cpu_ptr(pnc->counter, cpu);
+
+ atomic_dec(&node->counter_in_node);
+}
+
+static inline void pnode_counter_set(struct pnode_counter *pnc, int val)
+{
+ int i;
+ struct node_counter *node = this_cpu_read(*pnc->counter);
+
+ for_each_node(i)
+ atomic_set(&pnc->nodes[i]->counter_in_node, 0);
+ atomic_set(&node->counter_in_node, val);
+}
+
+static inline long pnode_counter_read_all(struct pnode_counter *pnc)
+{
+ int i;
+ long val = 0;
+
+ for_each_node(i)
+ val += atomic_read(&pnc->nodes[i]->counter_in_node);
+
+ return val;
+}
+
+#endif /* _LINUX_PERNODE_COUNTER_H */
--
dm-devel mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/dm-devel