Hello, Shaohua. On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote: > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
Just a nit. blk-throttle is both bw and iops based. > weight based. It would be great there is a unified iocontroller for the two. > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option > for blk-mq. It's time to have a scalable iocontroller supporting both > bandwidth/weight based control and working with blk-mq. > > blk-throttling is a good candidate, it works for both blk-mq and legacy queue. > It has a global lock which is scaring for scalability, but it's not terrible > in > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. > Enabling > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect > this isn't a big problem for today's workload. This patchset then try to make > a > unified iocontroller. I'm leveraging blk-throttling. Have you tried with some level, say 5, of nesting? IIRC, how it implements hierarchical control is rather braindead (and yeah I'm responsible for the damage). > The idea is pretty simple. If we know disk total bandwidth, we can calculate > cgroup bandwidth according to its weight. blk-throttling can use the > calculated > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO > pattern. Long history is meaningless. The simple algorithm in patch 1 works > pretty well when IO pattern changes. So, that part is fine but I don't think it makes sense to make weight based control either bandwidth or iops based. The fundamental problem is that it's a false choice. It's like asking someone who wants a car to choose between accelerator and brake. It's a choice without a good answer. Both are wrong. Also note that there's an inherent difference from the currently implemented absolute limits. Absolute limits can be combined. Weights based on different metrics can't be. Even with modern SSDs, both iops and bandwidth play major roles in deciding how costly each IO is and I'm fairly confident that this is fundamental enough to be the case for quite a while. I *think* the cost model can be approximated from measurements. Devices are becoming more and more predictable in their behaviors after all. For weight based distribution, the unit of distribution should be IO time, not bandwidth or iops. Thanks. -- tejun

