I was going though this posting and it seems that were is some "personal tension" :).
However going back to the technical problem of scrubbing a 200 TB pool I think this issue needs to be addressed. One warning up front: This writing is rather long, and if you like to jump to the part dealing with Scrub, jump to "Scrub implementation" below. >From my perspective: - ZFS is great for huge amounts of data Thats what it was made for with 128bit and jbod design in mind. So ZFS is perfect for internet multi media in terms of scalability. - ZFS is great for commodity hardware Ok you should use 24x7 drives, but 2 TB 7200 disks are ok for internet media mass storage. We want huge amounts of data stored and in the internet age nobody pays for this. So you must use low cost hardware (well it must be compatible) - but you should not need enterprise components - thats what we have ZFS as clever software for. For mass storage internet services, the alternative is NOT EMC, NetApp (remember nobody pays a lot for the services because you can get it free at google) - the alternative is Linux based HW raid (with its well known limitations) and home grown solutions. Those do not have the nice ZFS features mentioned below. - ZFS guarantees data intrity by self-healing silent data corruption (thats what the checksums are for) - But only if you have redundancy. There are a lot of posts on the net saying when people will notice the bad blocks - it happens when a disk in a raid5 failes, and they have to resilver everything. Then you detect the missing redundancy. So people use Raid6 and "hope" that everything works. Or people do scrubs on their advanced raid controllers (if they provide internal checksumming). The same problem exists for huge, passive, raidz1 data sets in ZFS. If you do not scrub the array regularly, chances are higher that you will have a bad block during resilvering and then ZFS can not help. For active data sets the problem is not as critical, because on every read the checksum is verified - but still - because once in arc cache noboy checks - the problem exists. So we need scrub! - ZFS can do many other nice things There's compression, dedupe etc .. however I look at them as "nice to have. - ZFS needs proper pool design Using ZFS right is not easy, sizing the system is even more complicated. There are a lot of threads reagarding pool design - the easiest is to say "do a lot of mirrors", cause then the read performance really scales. However in internet mass media services, you cant - too expensive - because mirrored ZFS is more expensive then HW Raid 6 with Linux. How much members to a vdev ? multiple pools or single pools ? - ZFS is open and community based ... well lets see how this goes wth Oracle "financing" the whole stuff :) And some of those points make ZFS a hit for internet service provider and mass media requirements (VOD etc.)! So whats my point you may ask ? My experience with ZFS is that some points are simply not addressed well enough yet - BUT - ZFS is a living piece of software and thanks to the many great people developing it, it evolves faster then all the other storage solutions. So for the longer term - I believe ZFS will (hopefully) have all "enterprice-ness" it needs and it will revolutionize the storage industry (like cisco did in the 70's). I really believe that. >From my perspective some of the points not addressed well in ZFS are: - pool defragmentation - you need this for a COW filesystem I think the ZFS developers are working on this with the background rewriter. So I hope it will come 2010. With the rewriter on disk layout can be optimized for read performance for sequencial workloads - also for raidz1 and raidz2 - meaning ZFS can compete with Raid5 and Raid6 - also for wider vdevs. And wider vdevs mean more effective capacity. If the vdev read-ahead cache is working nice with a sequencially aligned on disk layout then - (from disk) read performance will be great. - IO priorization for zvols / zfs filesystems (aka Storage QoS) Unfortunately you can not prioritize I/O to zfs filesystems and zvols right now. I think this is one of the features that make ZFS not suitable for 1st tier storage (like EMC Symmetrix or NetApp FAS6000 series). You need priorization here - because your SAP system really is more important than my MP3 web server :) - Deduplication not ready for production Currently dedup is nice, but the DDT table handling and memory sizing is tricky and hardly usable for larger pools (my perspective). The DDT is handled like any other component - meaning user I/O can push the DDT out of the arc (and the L2ARC) - even with (primarycache=secondarycache)=metadata. For typical mass media storage applications, the working set is much larger then the memory (and L2ARC) meaning your DDT will come from disk - causing real performance degration. This is especially true for COMSTAR environments with small block sized (8k etc.) Here the DDT becomes really huge. Once it fits not into memory anymore - bummer. There is a open BUG for this: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566 and I hope it will be addressed soon. - Scrub implementation And thats the point mentioned in this thread. Currently you can not manage scrub very well. In pre snv_133 scrub was very agressive. When scrub was running, you really had a dramatic perfromance penalty (more full strokese etc., latency goes up). With post snv_133 scrub was less agressive - sounds nice - but this makes scrubbing take longer. This is bad if your disk failed. So either way it is not optimal and I believe a system can not automate this process very well. To schedule scrub I/O right, the storage system must "predict" the I/O access pattern for the future and I believe this is not possible. So scrub management must be manageable by the storage user. For very large pools another problem comes up. You simply can not scrub a 200TB++ pool over the weekend - it WILL take longer. However your users will come in on monday and they want to work. Currently scrub can not be prioritized, paused, aborted or resumed. This makes it very difficult to make the scrub managebale. If scrub could be prioritized, would be pausable, resumable and abortable the admin (or a management software like NexentaStor) could schedule the scrub according to the user policy (ok we scrub weekdays 18:00 - 20:00 low prio, 20:00 - 04:00 at high prio, 04:00 - 18:00 we do not scrub at all) BUT if a device is degraded - we resilver with maximum priority) So I think this feature would be VERY essential because this makes the nice features of ZFS (checksumming, data integrity) really usuable for enterprise use cases AND large data sets. Otherwise the features are "nice - but you cant really use them with more then 24TB". Conclustion (in the context of Scrub): - People want 200TB++ pools for mass media applications - People will use huge low cost drives to storage huge amounts fo data - People dont use mirrors because 50% eff. capacity is not enough - People NEED to scrub to repair bad blocks because of the cheap drives with high capacity --> Scrub / Resilver management needs to be improved. There are (a lot) of open bugs for this: - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6743992 - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 ... - http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6888481 So I hope this will be adressed for scrub and resilver. Regards, Robert P.s. Sorry for the rather long writing :) -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss