> I am not sure how zfs would know the rate of the > underlying disk storage
Easy: Is the buffer growing? :-) If the amount of data in the buffer is growing, you need to throttle back a bit until the disks catch up. Don't stop writes until the buffer is empty, just slow them down to match the rate at which you're clearing data from the buffer. In your case I'd expect to see ZFS buffer the early part of the write (so you'd see a very quick initial burst), but from then on you would want a continual stream of data to disk, at a steady rate. To the client it should respond just like storing to disk, the only difference is there's actually a small delay before the data hits the disk, which will be proportional to the buffer size. ZFS won't have so much opportunity to optimize writes, but you wouldn't get such stuttering performance. However, reading through the other messages, if it's a known bug and ZFS blocking reads while writing, there may not be any need for this idea. But then, that bug has been open since 2006, is flagged as fix in progress, and was planned for snv_51.... o_0. So it probably is worth having this discussion. And I may be completely wrong here, but reading that bug, it sounds like ZFS issues a whole bunch of writes at once as it clears the buffer, which ties in with the experiences of stalling actually being caused by reads being blocked. I'm guessing given ZFS's aims it made sense to code it that way - if you're going to queue a bunch of transactions to make them efficient on disk, you don't want to interrupt that batch with a bunch of other (less efficient) reads. But the unintended side effect of this is that ZFS's attempt to optimize writes will causes jerky read and write behaviour any time you have a large amount of writes going on, and when you should be pushing the disks to 100% usage you're never going to reach that as it's always going to have 5s of inactivity, followed by 5s of running the disks flat out. In fact, I wonder if it's a simple as the disks ending up doing 5s of reads, a delay for processing, 5s of writes, 5s of reads, etc... It's probably efficient, but it's going to *feel* horrible, a 5s delay is easily noticeable by the end user, and is a deal breaker for many applications. In situations like that, 5s is a *huge* amount of time, especially so if you're writing to a disk or storage device which has it's own caching! Might it be possible to keep the 5s buffer for ordering transactions, but then commit that as a larger number of small transactions instead of one huge one? The number of transactions could even be based on how busy the system is - if there are a lot of reads coming in, I'd be quite happy to split that into 50 transactions. On 10GbE, 5s is potentially 6.25GB of data. Even split into 50 transactions you're writing 128MB at a time, and that sounds plenty big enough to me! Either way, something needs to be done. If we move to ZFS our users are not going to be impressed with 5s delays on the storage system. Finally, I do have one question for the ZFS guys: How does the L2ARC interact with this? Are reads from the L2ARC blocked, or will they happen in parallel with the writes to the main storage? I suspect that a large L2ARC (potentially made up of SSD disks) would eliminate this problem the majority of the time. -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss