[perf-discuss] Re: RE: Re: Re: maxphys and sd_max_xfer_size

Dave Fisk Wed, 14 Jun 2006 20:21:01 -0700

The specific question in this thread is just one instance of the more general 
question of how is I/O performance determined, how can you understand it, and 
how can you plan for and achieve it with determinism? I will attempt to answer 
this general question as simply as possible.


I/O throughput is a function of load level and response time. Response time is 
a function of load level with respect to each I/O size and access type 
combination in the workload for a given resource and in relative proportion.  
This is much simpler than it sounds. For example, “50 threads of 80% 8 KB 
Random Read 20% 64 KB Sequential Write” is a complete specification for an I/O 
workload and composition (it is complete because the sum is 100%). If, when 
applied to a specific resource the response time for 50 threads of 8 KB RW is 
30 ms. and the response time for 50 threads of 64 KB SW is 15 ms. then the 
combined work load to this resource will have a response time of 0.80 * 30 + 
0.20 * 15 =  27 ms. Note that the workload description is independent of 
response time until the resource has been specified, and that the total load 
level is divided by the number of resources units, so that while the 
composition in terms of I/O size and access type is a given, the load level is 
managed by how many resource units you configure. If 50 threads are spread over 
5 units of resource the new response time for the given composition is based on 
10 threads instead of 50 and will be lower in direct proportion to the 
difference in load level at the unit of resource. The response time for each 
I/O size access type combination weighted by its relative portion for a given 
load level determines the overall response time, which in turn determines the 
instantaneous throughput, which is load level divided by response time by 
Littles Law, in this example 50 / 0.027  = 1851 IOPS@ 27 ms. In all cases, you 
can decompose the I/O workload into proportions of I/O size and access type 
combination in proportions that sum to 100% and the composite response time is 
the weighted average of these response times. The instantaneous throughput is 
the total load level divided by this composite response time. The instantaneous 
throughput per load level is then weighted by the probability density function 
of load level and integrated over the range of load level. You can get the 
probability density function empirically, or estimate it with a Poisson 
distribution. However, for most cases, a simple calculation of the mean is all 
that is needed to see if you are where you need to be in terms of the target 
SLA. The distribution is only needed if you want to estimate variation and 
other modes. As a side note, this analysis is based on the gradient field of 
response time. Thus, it represents a conservative field, and the above 
mentioned integral, which defines Work in the formal sense, depends only on the 
start and end positions of load level; it is path independent. Intuitively, 
this means the ebb and flow of the workload within the boundaries so defined by 
the proportions of I/O size and access type combination overall do not concern 
us. All we need are the relative portions of the composition and the range of 
load level and we can define the expected throughout with great accuracy, at 
least, for any one system state.

To design a configuration to deliver a specific response time for a given load 
level and composition you divide the load level by the number of resources 
needed to get the per resource load level in the desired range for the given 
composition. The total aggregate load level divided by the response time so 
obtained determines the instantaneous throughput of the system. The total 
throughput is then determined by how sustained each load level is, which is 
where the distribution comes in, but again, you do not need the distribution to 
set expectation for the mean; just use the average load level divided by the 
average response time and that is that mean expected capability, in terms of 
IOPS, of the configuration. How much of that capability is used is relative to 
the arrival rate and defines capability utilization. A Lamborghini going 1 MPH 
is 100% busy but far from 100% capability.

To understand I/O performance is to know how many threads of what I/O size and 
access type combination are being serviced by what and how many of a given 
resource.  When you configure filesystems, and volume managers, and HW RAID 
devices, all of these facilities transform the workload. So the load level and 
composition of the workload issued by the application is very different then 
the load level and composition issued to the resource. This is the general 
answer to any I/O performance question to which the current thread is one 
specific example.

As for ZFS, ZFS changes everything ;-) I think of ZFS as first, a masterpiece 
of functionality and ease of use. From a performance predictability standpoint 
however, it is extremely challenging, as the workload issued to the pool is not 
a proper transformation of the workload issued to the file systems. There is a 
ton of speculative pre-fetch going on. In fact, the actual requested I/O 
contained in any given read at the pool level is very small compared to the 
amount which is speculative. And for writes, they are always converted to full 
stripes of FS recsize divided by the number of drives in a vdev, then coalesced 
vertically into 128 KB to 32 KB per disk, depending on the recsize, and written 
sequentially.  

The same rules for I/O throughput as discussed above still apply to ZFS, but 
with ZFS it is not clear how much of any given I/O is the request I/O and how 
much is from the COW for write, or speculative pre fetch on read. It will take 
some time to fully understand it, but I think it will be well worth the effort, 
it is a remarkable advancement in filesystem technology.


Regards,
The ORtera man ;-)
 
 
This message posted from opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

[perf-discuss] Re: RE: Re: Re: maxphys and sd_max_xfer_size

Reply via email to