On Fri, Jan 22, 2016 at 02:09:10PM -0500, Vivek Goyal wrote: > On Fri, Jan 22, 2016 at 10:00:19AM -0800, Shaohua Li wrote: > > On Fri, Jan 22, 2016 at 10:52:36AM -0500, Vivek Goyal wrote: > > > On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote: > > > > Hello, Shaohua. > > > > > > > > On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote: > > > > > > The thing is that most of the possible contentions can be removed by > > > > > > implementing per-cpu cache which shouldn't be too difficult. 10% > > > > > > extra cost on current gen hardware is already pretty high. > > > > > > > > > > I did think about this. per-cpu cache does sound straightforward, but > > > > > it > > > > > could severely impact fairness. For example, we give each cpu a > > > > > budget, > > > > > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the > > > > > lock. > > > > > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, > > > > > which > > > > > breaks fairness very much. I have no idea how this can be fixed. > > > > > > > > Let's say per-cgroup buffer budget B is calculated as, say, 100ms > > > > worth of IO cost (or bandwidth or iops) available to the cgroup. In > > > > practice, this may have to be adjusted down depending on the number of > > > > cgroups performing active IOs. For a given cgroup, B can be > > > > distributed among the CPUs that are actively issuing IOs in that > > > > cgroup. It will degenerate to round robin of small budget if there > > > > are too many active for the budget available but for most cases this > > > > will cut down most of cross-CPU traffic. > > > > > > > > > > They're way more predictable than rotational devices when measured > > > > > > over a period. I don't think we'll be able to measure anything > > > > > > meaningful at individual command level but aggregate numbers should > > > > > > be > > > > > > fairly stable. A simple approximation of IO cost such as fixed cost > > > > > > per IO + cost proportional to IO size would do a far better job than > > > > > > just depending on bandwidth or iops and that requires approximating > > > > > > two variables over time. I'm not sure how easy / feasible that > > > > > > actually would be tho. > > > > > > > > > > It still sounds like IO time, otherwise I can't imagine we can measure > > > > > the cost. If we use some sort of aggregate number, it likes a > > > > > variation > > > > > of bandwidth. eg cost = bandwidth/ios. > > > > > > > > I think cost of an IO can be approxmiated by a fixed per-IO cost + > > > > cost proportional to the size, so > > > > > > > > cost = F + R * size > > > > > > > > > > Hi Tejun, > > > > > > May be we can throw in a cost differentiation for IO direction also here. > > > This still will not take care of cost based on IO pattern, but that's > > > another level of complexity which can be added to keep track of IO pattern > > > of cgroup and bump up cost accordingly. > > > > > > Here are some random thoughts basically adding some more details to your > > > idea. > > > I am not sure whether it makes sense or not or how difficult it is to > > > implement it. > > > > > > Assume we ensure fairness in a time interval of T and have total of N > > > tokens for IO in that time interval T. When a new inteval starts, we > > > distribute these N tokens to the pending cgroups based on their weight and > > > proportional share. And keep on distributing N tokens after each time > > > interval. > > > > > > We will have to come up with some sort of cost matrix to determine how > > > many > > > tokens should be charged per IO (cost per IO). And how to adjust that cost > > > dynamically. > > > > > > Both N and T will be variable and will have to be adjusted continuously. > > > For N we could start with some initial number. If we distributed too many > > > tokens then device can handle in time T, then in next cycle we will have > > > to reduce the value of N and distribute less tokens. If we distributed > > > too less tokens and device is fast and finished in less time than T, > > > then we can start next cycle sooner and distribute more tokens for next > > > cycle. So based on device throughput in a certain time interval, number > > > of tokens issued for next cycle will vary. > > > > Note, we don't know if we dispatch too many/too less tokens. A device > > with large queue depth can accept all requests. If queue depth is 1, > > things would be easy. > > If device accepts too many requests then we will keep on increasing tokens > and cgroups will keep on submitting IOs in the proportion of their weight. > Once queue is full, then we will hit a wall and we will start decreasing > number of tokens. So I guess this should still work.
queue will never be full. Typical application drives < 32 IO depth. NVMe SSD queue can have 64k queue depth for each hardware queue according to the spec. > One problem with deep queue depths will be though that a heavy writer > will be able to fill up the queue in a very short interval and block > small readers behind it. I guess until and unless devices start doing > some prioritization of IO, this problem will be hard to solve. Driving > smaller queue depth is not an option as it makes the bandwidth drop. yep, this is the problem. Disk accepts new requests even all pending requests already exhaust its resources. > > > Initially I guess cost could be fixed also. That is say, 5 tokens for each > > > IO plus 1 token for each 4KB of IO size. If we underestimate the cost of > > > IO, then N tokens will not be consumed in time T and next time we will > > > distribute less tokens. If we overestimate the cost of IO, then N tokens > > > will finish fast and next time we will give more. So exact cost of IO > > > might not be a huge factor. > > > > we still need know the R. any idea for this? > > Hmm..., thinking loud. Will following work. > > Can we keep track of average bw and average iops of the queue. And then > use that to come up with per IO cost and BW cost. > > Say average queue bandwidth is ABW and average IOPS is AIOPS. > > So in interval T, all cgroup cumulatively can dispatch T * ABW size IO. > > A cgroup's fractional cost of IO = IO_size/(T * ABW) > > As we are supposed to dispatch N tokens in time T, cgroups cost of IO > in terms of tokens will be > > Cgroup_cost_BW = (N * IO_Size)/(T * ABW) > > Similarly, a cgroup's per IO cost based on IOPS will be. > > Cgroup_Cost_IOPS = (N * 1) /(T * AIOPS) > > So per IO per cgroup we could charge following tokens. > > Charged_tokens = Cgroup_cost_BW + Cgroup_cost_IO > > As we are charging cgroup twice (once based on bandwidth and once based > on IOPS), may be we can half the effective cost. > > Effectivey_charged_tokens = (Cgroup_cost_BW + Cgroup_cost_IO)/2 So the cost = Cgroup_cost_BW * A + Cgroup_cost_IO * B bandwidth based: A = 1, B = 0 IOPS based: A = 0, B = 1 The proposal: A = 1/2, B = 1/2 I'm sure people will invent other A/B combinations. It's hard to say which one is better. Maybe we really should have simple ones first, eg, either bandwidth based or IOPS based, and have a knob to choose. That pretty much shows the powerless from kernel side, but that's something we can offer. Thanks, Shaohua

