Re: [zfs-discuss] ZFS, XFS, and EXT4 compared

eric kustarz Thu, 30 Aug 2007 13:07:46 -0700

On Aug 30, 2007, at 12:33 PM, Jeffrey W. Baker wrote:

> On Thu, 2007-08-30 at 12:07 -0700, eric kustarz wrote:
>> Hey jwb,
>>
>> Thanks for taking up the task, its benchmarking so i've got some
>> questions...
>>
>> What does it mean to have an external vs. internal journal for ZFS?
>
> This is my first use of ZFS, so be gentle.  External == ZIL on a
> separate device, e.g.
>
> zpool create tank c2t0d0 log c2t1d0


Ok, cool!, that's the way to do it.  I'm always curious to see if  
people know about some of the new features in ZFS.  (and then there's  
the game of matching lingo - "separate intent log" <-> "external  
journal").

So the ZIL will be responsible for handling "synchronous" operations  
(O_DYSNC writes, file creates over NFS, fsync, etc).  I actually  
don't see anything in the tests you ran that would stress this aspect  
(looks like randomio is doing 1% fsyncs).  If you did, then you'd  
want to have more log devices (ie: a stripe of them).


>
>> Can you show the output of 'zpool status' when using software RAID
>> vs. hardware RAID for ZFS?
>
> I blew away the hardware RAID but here's the one for software:

Ok, for the hardware RAID config to do a fair comparison, you'd just  
want to do just a RAID-0 in ZFS, so something like:
# zpool create tank c2t0d0 c2t1d0 c2t2d0 c2t3d0 c2t4d0 c2t5d0 c2t6d0

We call this "dynamic striping" in ZFS.

>
> # zpool status
>   pool: tank
>  state: ONLINE
>  scrub: none requested
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         tank        ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c2t0d0  ONLINE       0     0     0
>             c2t1d0  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c2t2d0  ONLINE       0     0     0
>             c2t3d0  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             c2t4d0  ONLINE       0     0     0
>             c2t5d0  ONLINE       0     0     0
>         logs        ONLINE       0     0     0
>           c2t6d0    ONLINE       0     0     0
>
> errors: No known data errors
>
> iostat shows balanced reads and writes across t[0-5], so I assume this
> is working.

Cool, makes sense.

>
>> The hardware RAID has a cache on the controller.  ZFS will flush the
>> "cache" when pushing out a txg (essentially before writing out the
>> uberblock and after writing out the uberblock).  When you have a non-
>> volatile cache with battery backing (such as your setup), its safe to
>> disable that via putting 'set zfs:zfs_nocacheflush = 1' in /etc/
>> system and rebooting.
>
> Do you think this would matter?  There's no reason to believe that the
> RAID controller respects the flush commands, is there?  As far as the
> operating system is concerned, the flush means that data is in
> non-volatile storage, and the RAID controller's cache/disk  
> configuration
> is opaque.

 From my experience dealing with some Hitachi and LSI devices, it  
makes a big difference (of course depending on the workload).  ZFS  
needs to flush the cache for every transaction group (aka "txg") and  
for ZIL operations.  The txg happens about very 5 seconds.  The ZIL  
operations are of course dependent on the workload.  So a workload  
that does lots of synchronous writes will triggers lots of ZIL  
operations, which will trigger lots of cache flushes.

For ZFS, we can safely enable the write cache on a disk - and part of  
that requires we flush the write cache at specific times.  However,  
syncing the non-volatile cache on a controller (with battery backup)  
doesn't make sense (and some devices will actually flush their  
cache), and can really hurt performance for workloads that flush a lot.


>
>> What parameters did you give bonnie++?  compiled 64bit, right?
>
> Uh, whoops.  As I freely admit this is my first encounter with
> opensolaris, I just built the software on the assumption that it would
> be 64-bit by default.  But it looks like all my benchmarks were built
> 32-bit.  Yow.  I'd better redo them with -m64, eh?
>
> [time passes]
>
> Well, results are _substantially_ worse with bonnie++ recompiled at
> 64-bit.  Way, way worse.  54MB/s linear reads, 23MB/s linear writes,
> 33MB/s mixed.

Hmm, what are you parameters?

>
>> For the randomio test, it looks like you used an io_size of 4KB.  Are
>> those aligned?  random?  How big is the '/dev/sdb' file?
>
> Randomio does aligned reads and writes.  I'm not sure what you mean
> by /dev/sdb?  The file upon which randomio operates is 4GiB.

Sorry i was grabbing "dev/sb" from the "http://arctic.org/~dean/ 
randomio/" link (that was kinda silly).  Ok cool, just making sure  
the file wasn't completely cacheable.

Another thing to know about ZFS is that it has a variable block size  
(that maxes out at 128KB).  And since ZFS is COW, we can grow the  
block size on demand.  For instance, if you just create a small file,  
say 1B, you're block size is 512B.  If you go over to 513B, we double  
you to 1KB, etc.

Why it matters here (and you see this especially on databases) is  
that this particular benchmark is doing aligned random 2KB reads/ 
writes.  If the file is big, then all of its blocks will max out at  
the biggest allowable block size for that file system (which by  
default is 128KB).  Which means, if you need to read in 2KB and have  
to go to disk, then you're really reading in 128KB.  Most other  
filesystems have a blocksize of 8KB.

We added a special property (recordsize) to accommodate workloads/ 
apps like this benchmark.  By setting the recordsize property to 2K,  
that will make the maximum blocksize 2KB (instead of 128KB) for that  
file system.  You'll see a nice win.  To set it, try:
fsh-hake# zfs set recordsize=2k tank
fsh-hake# zfs get recordsize tank
NAME  PROPERTY    VALUE    SOURCE
tank     recordsize  2K       local
fsh-hake#


>
>> Do you have the parameters given to FFSB?
>
> The parameters are linked on my page.

Whoops, my bad.  Let me go take a look.

eric

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, XFS, and EXT4 compared

Reply via email to