I was going though this posting and it seems that were is some "personal 
tension" :). 

However going back to the technical problem of scrubbing a 200 TB pool I think 
this issue needs to be addressed. 

One warning up front: This writing is rather long, and if you like to jump to 
the part dealing with Scrub, jump to "Scrub implementation" below.

>From my perspective: 

  - ZFS is great for huge amounts of data 

Thats what it was made for with 128bit and jbod design in mind. So ZFS is 
perfect for internet multi media in terms of scalability. 

  - ZFS is great for commodity hardware

Ok you should use 24x7 drives, but 2 TB 7200 disks are ok for internet media 
mass storage. We want huge amounts of data stored and in the internet age 
nobody pays for this. So you must use low cost hardware (well it must be 
compatible) - but you should not need enterprise components - thats what we 
have ZFS as clever software for. For mass storage internet services, the 
alternative is NOT EMC, NetApp (remember nobody pays a lot for the services 
because you can get it free at google) - the alternative is Linux based HW raid 
(with its well known limitations) and home grown solutions. Those do not have 
the nice ZFS features mentioned below.

  - ZFS guarantees data intrity by self-healing silent data corruption (thats 
what the checksums are for) - But only if you have redundancy. 

There are a lot of posts on the net saying when people will notice the bad 
blocks - it happens when a disk in a raid5 failes, and they have to resilver 
everything. Then you detect the missing redundancy. So people use Raid6 and 
"hope" that everything works. Or people do scrubs on their advanced raid 
controllers (if they provide internal checksumming). 

The same problem exists for huge, passive, raidz1 data sets in ZFS. If you do 
not scrub the array regularly, chances are higher that you will have a bad 
block during resilvering and then ZFS can not help. For active data sets the 
problem is not as critical, because on every read the checksum is verified - 
but still - because once in arc cache noboy checks - the problem exists. So we 
need scrub! 

  - ZFS can do many other nice things 

There's compression, dedupe etc .. however I look at them as "nice to have. 

  - ZFS needs proper pool design 

Using ZFS right is not easy, sizing the system is even more complicated. There 
are a lot of threads reagarding pool design - the easiest is to say "do a lot 
of mirrors", cause then the read performance really scales. However in internet 
mass media services, you cant - too expensive - because mirrored ZFS is more 
expensive then HW Raid 6 with Linux. How much members to a vdev ? multiple 
pools or single pools ? 

  - ZFS is open and community based 

... well lets see how this goes wth Oracle "financing" the whole stuff :)

And some of those points make ZFS a hit for internet service provider and mass 
media requirements (VOD etc.)!

So whats my point you may ask ? 

My experience with ZFS is that some points are simply not addressed well enough 
yet - BUT - ZFS is a living piece of software and thanks to the many great 
people developing it, it evolves faster then all the other storage solutions. 
So for the longer term - I believe ZFS will (hopefully) have all 
"enterprice-ness" it needs and it will revolutionize the storage industry (like 
cisco did in the 70's). I really believe that. 

>From my perspective some of the points not addressed well in ZFS are:

  - pool defragmentation - you need this for a COW filesystem 

I think the ZFS developers are working on this with the background rewriter. So 
I hope it will come 2010. With the rewriter on disk layout can be optimized for 
read performance for sequencial workloads - also for raidz1 and raidz2 - 
meaning ZFS can compete with Raid5 and Raid6 - also for wider vdevs. And wider 
vdevs mean more effective capacity. If the vdev read-ahead cache is working 
nice with a sequencially aligned on disk layout then - (from disk) read 
performance will be great.

  - IO priorization for zvols / zfs filesystems (aka Storage QoS)

Unfortunately you can not prioritize I/O to zfs filesystems and zvols right 
now. I think this is one of the features that make ZFS not suitable for 1st 
tier storage (like EMC Symmetrix or NetApp FAS6000 series).  You need 
priorization here - because your SAP system really is more important than my 
MP3 web server :)

  - Deduplication not ready for production

Currently dedup is nice, but the DDT table handling and memory sizing is tricky 
and hardly usable for larger pools (my perspective). The DDT is handled like 
any other component - meaning user I/O can push the DDT out of the arc (and the 
L2ARC) - even with (primarycache=secondarycache)=metadata. For typical mass 
media storage applications, the working set is much larger then the memory (and 
L2ARC) meaning your DDT will come from disk - causing real performance 
degration.  

This is especially true for COMSTAR environments with small block sized (8k 
etc.) Here the DDT becomes really huge. Once it fits not into memory anymore - 
bummer.

There is a open BUG for this: 
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566 and I hope 
it will be addressed soon.

  - Scrub implementation 

And thats the point mentioned in this thread. Currently you can not manage 
scrub very well. In pre snv_133 scrub was very agressive. When scrub was 
running, you really had a dramatic perfromance penalty (more full strokese 
etc., latency goes up). With post snv_133 scrub was less agressive  - sounds 
nice - but this makes scrubbing take longer. This is bad if your disk failed. 
So either way it is not optimal and I believe a system can not automate this 
process very well. To schedule scrub I/O right, the storage system must 
"predict" the I/O access pattern for the future and I believe this is not 
possible. So scrub management must be manageable by the storage user. 

For very large pools another problem comes up. You simply can not scrub a 
200TB++ pool over the weekend - it WILL take longer. However your users will 
come in on monday and they want to work. 

Currently scrub can not be prioritized, paused, aborted or resumed. This makes 
it very difficult to make the scrub managebale. If scrub could be prioritized, 
would be pausable, resumable and abortable the admin (or a management software 
like NexentaStor) could schedule the scrub according to the user policy (ok we 
scrub weekdays 18:00 - 20:00 low prio, 20:00 - 04:00 at high prio, 04:00 - 
18:00 we do not scrub at all) BUT if a device is degraded  - we resilver with 
maximum priority)

So I think this feature would be VERY essential because this makes the nice 
features of ZFS (checksumming, data integrity) really usuable for enterprise 
use cases AND large data sets. Otherwise the features are "nice - but you cant 
really use them with more then 24TB". 

Conclustion (in the context of Scrub): 

  - People want 200TB++ pools for mass media applications
  - People will use huge low cost drives to storage huge amounts fo data
  - People dont use mirrors because 50% eff. capacity is not enough 
  - People NEED to scrub to repair bad blocks because of the cheap drives with 
high capacity
    --> Scrub / Resilver management needs to be improved.

There are (a lot) of open bugs for this: 

  - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6743992
  - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 
...
  - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6888481

So I hope this will be adressed for scrub and resilver. 

Regards, 
Robert 

P.s. Sorry for the rather long writing :)
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to