On Fri, Jan 22, 2016 at 10:52:36AM -0500, Vivek Goyal wrote: > On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote: > > Hello, Shaohua. > > > > On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote: > > > > The thing is that most of the possible contentions can be removed by > > > > implementing per-cpu cache which shouldn't be too difficult. 10% > > > > extra cost on current gen hardware is already pretty high. > > > > > > I did think about this. per-cpu cache does sound straightforward, but it > > > could severely impact fairness. For example, we give each cpu a budget, > > > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock. > > > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which > > > breaks fairness very much. I have no idea how this can be fixed. > > > > Let's say per-cgroup buffer budget B is calculated as, say, 100ms > > worth of IO cost (or bandwidth or iops) available to the cgroup. In > > practice, this may have to be adjusted down depending on the number of > > cgroups performing active IOs. For a given cgroup, B can be > > distributed among the CPUs that are actively issuing IOs in that > > cgroup. It will degenerate to round robin of small budget if there > > are too many active for the budget available but for most cases this > > will cut down most of cross-CPU traffic. > > > > > > They're way more predictable than rotational devices when measured > > > > over a period. I don't think we'll be able to measure anything > > > > meaningful at individual command level but aggregate numbers should be > > > > fairly stable. A simple approximation of IO cost such as fixed cost > > > > per IO + cost proportional to IO size would do a far better job than > > > > just depending on bandwidth or iops and that requires approximating > > > > two variables over time. I'm not sure how easy / feasible that > > > > actually would be tho. > > > > > > It still sounds like IO time, otherwise I can't imagine we can measure > > > the cost. If we use some sort of aggregate number, it likes a variation > > > of bandwidth. eg cost = bandwidth/ios. > > > > I think cost of an IO can be approxmiated by a fixed per-IO cost + > > cost proportional to the size, so > > > > cost = F + R * size > > > > Hi Tejun, > > May be we can throw in a cost differentiation for IO direction also here. > This still will not take care of cost based on IO pattern, but that's > another level of complexity which can be added to keep track of IO pattern > of cgroup and bump up cost accordingly. > > Here are some random thoughts basically adding some more details to your idea. > I am not sure whether it makes sense or not or how difficult it is to > implement it. > > Assume we ensure fairness in a time interval of T and have total of N > tokens for IO in that time interval T. When a new inteval starts, we > distribute these N tokens to the pending cgroups based on their weight and > proportional share. And keep on distributing N tokens after each time > interval. > > We will have to come up with some sort of cost matrix to determine how many > tokens should be charged per IO (cost per IO). And how to adjust that cost > dynamically. > > Both N and T will be variable and will have to be adjusted continuously. > For N we could start with some initial number. If we distributed too many > tokens then device can handle in time T, then in next cycle we will have > to reduce the value of N and distribute less tokens. If we distributed > too less tokens and device is fast and finished in less time than T, > then we can start next cycle sooner and distribute more tokens for next > cycle. So based on device throughput in a certain time interval, number > of tokens issued for next cycle will vary.
Note, we don't know if we dispatch too many/too less tokens. A device with large queue depth can accept all requests. If queue depth is 1, things would be easy. > Initially I guess cost could be fixed also. That is say, 5 tokens for each > IO plus 1 token for each 4KB of IO size. If we underestimate the cost of > IO, then N tokens will not be consumed in time T and next time we will > distribute less tokens. If we overestimate the cost of IO, then N tokens > will finish fast and next time we will give more. So exact cost of IO > might not be a huge factor. we still need know the R. any idea for this? Thanks, SHaohua

