Hi everyone, This is my first post to zfs-discuss, so be gentle with me :-)
I've been doing some testing with ZFS - in particular, in checkpointing the large, proprietary in-memory database which is a key part of the application I work on. In doing this I've found what seems to be some fairly unhelpful write throttling behaviour from ZFS. In summary, the environment is: * An x4600 with 8 CPUs and 128GBytes of memory * A 50GByte in-memory database * A big, fast disk array (a 6140 with a LUN comprised of 4 SATA drives) * Running Solaris 10 update 4 (problems initially seen on U3 so I got it patched) The problems happen when I checkpoint the database, which involves putting that database on disk as quickly as possible, using the write(2) system call. The first time the checkpoint is run, it's quick - about 160MBytes/sec, even though the disk array is only sustaining 80MBytes/sec. So we're dirtying stuff in the ARC (and growing the ARC) at a pretty impressive rate. After letting the IO subside, running the checkpoint again results in very different behaviour. It starts running very quickly, again at 160MByte/sec (with the underlying device doing 80MBytes/sec), and after a while (presumably once the ARC is full) things go badly wrong. In particular, a write(2) system call hangs for 6-7 minutes, apparently until all the outstanding IO is done. Any reads from that device also take a huge amount of time, making the box very unresponsive. Obviously this isn't good behaviour, but it's particularly unfortunate given that this checkpoint is stuff that I don't want to retain in any kind of cache anyway - in fact, preferably I wouldn't pollute the ARC with it in the first place. But it seems directio(3C) doesn't work with ZFS (unsurprisingly as I guess this is implemented in segmap), and madvise(..., MADV_DONTNEED) doesn't drop data from the ARC (again, I guess, as it's working on segmap/segvn). Of course, limiting the ARC size to something fairly small makes it behave much better. But this isn't really the answer. I also tried using O_DSYNC, which stops the pathological behaviour but makes things pretty slow - I only get a maximum of about 20MBytes/sec, which is obviously much less than the hardware can sustain. It sounds like we could do with different write throttling behaviour to head this sort of thing off. Of course, the ideal would be to have some way of telling ZFS not to bother keeping pages in the ARC. The latter appears to be bug 6429855. But the underlying behaviour doesn't really seem desirable; are there plans afoot to do any work on ZFS write throttling to address this kind of thing? Regards, -- Philip Beevers Fidessa Infrastructure Development mailto:[EMAIL PROTECTED] phone: +44 1483 206571 ******************************************************************************************************************************************************************************************** This message is intended only for the stated addressee(s) and may be confidential. Access to this email by anyone else is unauthorised. Any opinions expressed in this email do not necessarily reflect the opinions of Fidessa. Any unauthorised disclosure, use or dissemination, either whole or in part is prohibited. If you are not the intended recipient of this message, please notify the sender immediately. Fidessa plc - Registered office: Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom Registered in England no. 3781700 VAT registration no. 688 9008 78 Fidessa group plc - Registered Office: Dukes Court, Duke Street, Woking, Surrey, GU21 5BH, United Kingdom Registered in England no. 3234176 VAT registration no. 688 9008 78 _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss