On Aug 30, 2007, at 12:33 PM, Jeffrey W. Baker wrote: > On Thu, 2007-08-30 at 12:07 -0700, eric kustarz wrote: >> Hey jwb, >> >> Thanks for taking up the task, its benchmarking so i've got some >> questions... >> >> What does it mean to have an external vs. internal journal for ZFS? > > This is my first use of ZFS, so be gentle. External == ZIL on a > separate device, e.g. > > zpool create tank c2t0d0 log c2t1d0
Ok, cool!, that's the way to do it. I'm always curious to see if people know about some of the new features in ZFS. (and then there's the game of matching lingo - "separate intent log" <-> "external journal"). So the ZIL will be responsible for handling "synchronous" operations (O_DYSNC writes, file creates over NFS, fsync, etc). I actually don't see anything in the tests you ran that would stress this aspect (looks like randomio is doing 1% fsyncs). If you did, then you'd want to have more log devices (ie: a stripe of them). > >> Can you show the output of 'zpool status' when using software RAID >> vs. hardware RAID for ZFS? > > I blew away the hardware RAID but here's the one for software: Ok, for the hardware RAID config to do a fair comparison, you'd just want to do just a RAID-0 in ZFS, so something like: # zpool create tank c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t6d0 We call this "dynamic striping" in ZFS. > > # zpool status > pool: tank > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c2t0d0 ONLINE 0 0 0 > c2t1d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c2t2d0 ONLINE 0 0 0 > c2t3d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c2t4d0 ONLINE 0 0 0 > c2t5d0 ONLINE 0 0 0 > logs ONLINE 0 0 0 > c2t6d0 ONLINE 0 0 0 > > errors: No known data errors > > iostat shows balanced reads and writes across t[0-5], so I assume this > is working. Cool, makes sense. > >> The hardware RAID has a cache on the controller. ZFS will flush the >> "cache" when pushing out a txg (essentially before writing out the >> uberblock and after writing out the uberblock). When you have a non- >> volatile cache with battery backing (such as your setup), its safe to >> disable that via putting 'set zfs:zfs_nocacheflush = 1' in /etc/ >> system and rebooting. > > Do you think this would matter? There's no reason to believe that the > RAID controller respects the flush commands, is there? As far as the > operating system is concerned, the flush means that data is in > non-volatile storage, and the RAID controller's cache/disk > configuration > is opaque. From my experience dealing with some Hitachi and LSI devices, it makes a big difference (of course depending on the workload). ZFS needs to flush the cache for every transaction group (aka "txg") and for ZIL operations. The txg happens about very 5 seconds. The ZIL operations are of course dependent on the workload. So a workload that does lots of synchronous writes will triggers lots of ZIL operations, which will trigger lots of cache flushes. For ZFS, we can safely enable the write cache on a disk - and part of that requires we flush the write cache at specific times. However, syncing the non-volatile cache on a controller (with battery backup) doesn't make sense (and some devices will actually flush their cache), and can really hurt performance for workloads that flush a lot. > >> What parameters did you give bonnie++? compiled 64bit, right? > > Uh, whoops. As I freely admit this is my first encounter with > opensolaris, I just built the software on the assumption that it would > be 64-bit by default. But it looks like all my benchmarks were built > 32-bit. Yow. I'd better redo them with -m64, eh? > > [time passes] > > Well, results are _substantially_ worse with bonnie++ recompiled at > 64-bit. Way, way worse. 54MB/s linear reads, 23MB/s linear writes, > 33MB/s mixed. Hmm, what are you parameters? > >> For the randomio test, it looks like you used an io_size of 4KB. Are >> those aligned? random? How big is the '/dev/sdb' file? > > Randomio does aligned reads and writes. I'm not sure what you mean > by /dev/sdb? The file upon which randomio operates is 4GiB. Sorry i was grabbing "dev/sb" from the "http://arctic.org/~dean/ randomio/" link (that was kinda silly). Ok cool, just making sure the file wasn't completely cacheable. Another thing to know about ZFS is that it has a variable block size (that maxes out at 128KB). And since ZFS is COW, we can grow the block size on demand. For instance, if you just create a small file, say 1B, you're block size is 512B. If you go over to 513B, we double you to 1KB, etc. Why it matters here (and you see this especially on databases) is that this particular benchmark is doing aligned random 2KB reads/ writes. If the file is big, then all of its blocks will max out at the biggest allowable block size for that file system (which by default is 128KB). Which means, if you need to read in 2KB and have to go to disk, then you're really reading in 128KB. Most other filesystems have a blocksize of 8KB. We added a special property (recordsize) to accommodate workloads/ apps like this benchmark. By setting the recordsize property to 2K, that will make the maximum blocksize 2KB (instead of 128KB) for that file system. You'll see a nice win. To set it, try: fsh-hake# zfs set recordsize=2k tank fsh-hake# zfs get recordsize tank NAME PROPERTY VALUE SOURCE tank recordsize 2K local fsh-hake# > >> Do you have the parameters given to FFSB? > > The parameters are linked on my page. Whoops, my bad. Let me go take a look. eric _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss