Jim Dunham wrote: > This is just one scenario for deploying the 48 disks of x4500. The > blog listed below offers another option, by mirroring the bitmaps > across all available disks, bring the total disk count back up to 46, > (or 44, if 2x HSP) leaving the other two for a mirrored root disk. > http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless > I know your blog entry, Jim. And I still admire your skills in calculations within shell scripts (I just gave each soft partition 100 Megabytes of space, finished ;-) ). But after some thinking, I didn't consider using a slice on the same disk for bitmaps. Not just because of performance issues, that's not a valid reason. Again, the desaster scenarios make me think. In this case, the complexity of administration.
You know, the x64 Solaris boxes are basically competing against Linux boxes all day. The X4500 is very attractive replacement for the typical Linux file server, consisting of a server, a hardware RAID controller and several cheap and stupid fibre-channeled SATA JBODs for less than $5,000 each. Double this to have a cluster. In our case, the X4500 is competing against more than 60 of those clusters with a total of 360 JBODs. The X4500's main advantage isn't the price per gigabyte (the price is exactly the same!), like most members of the sales department may expect, the real advantage is the gigabyte per rack unit. But there are several disadvantages, for instance: not being able to access the hard drives from the front and needing a ladder and a screwdriver instead, or, most important for the typical data center, the *operator* is not able to replace a disk like he's used to: pulling the old disc out, putting the new disc in, resync starting, finished. You'll always have to wait until the next morning, until a Solaris administrator is available again (which may impact your high availability concepts) or get an Solaris administrator into the company 24/7 a day (which raises the TCO of the Solaris boxes). Well, and what I want to say: if you place the bitmap volume on the same disk, this situation even gets worse. The problem is the involvement of SVM. Building the soft partition again makes the handling even more complex and the case harder to handle for operators. It's the best way to make sure that the disk will be replaced, but not added to the zpool during the night - and replacing it during regular working hours isn't an option too, because syncing 500 GB over a 1 GBit/s interface during daytime just isn't possible without putting the guaranteed service times to a risk. Having to take care about soft partitions just isn't idiot-proof enough. And *poof* there's a good chance the TCO of a X4500 is considered being too high. >> a) A disk in the primary fails. What happens? A HSP jumps in and 500 GB >> will be rebuilt. These 500 GB are synced over a single 1 GBit/s >> crossover cable. This takes a bit of time and is 100% unnecessary > > > But it is necessary! As soon as the HSP disk kicks in, not only is the > disk being rebuilt by ZFS, but newly allocated ZFS data will also > being written to this HSP disk. So although it may appear that there > is wasted replication cost (of which there is), the instant that ZFS > writes new data to this HSP disk, the old replicated disk is instantly > inconsistent, and there is no means to fix. It's necessary from your point of view, Jim. But not in the minds of the customers. Even worse, it could be considered a design flaw - not in AVS, but in ZFS. Just have a look how the usual Linux dude works. He doesn't use AVS, he uses a kernel module called DRBD. It does basically the same, it replicates one raw device to another over a network interface, like AVS does. But the linux dude has one advantage: he doesn't have ZFS. Yes, as impossible as it may sound, it is an advantage. Why? Because he never has to mirror 40 or 46 devices, because his lame file systems depend on a hardware RAID controller! Same goes with UFS, of course. There's only ONE replicated device, no matter how many discs are involved. And so, it's definitely NOT necessary to sync a disc when a HSP kicks in, because this disc failure will never be reported to the host, it's handled by the RAID controller. As a result, no replication will take place, because AVS simply isn't involved. We even tried to deploy ZFS upon SVM RAID5 stripes to get rid of this problem, just to learn how much the RAID 5 performance of SVM sucks ... a cluster of six USB sticks was faster than the Thumpers. I consider this a big design flaw of ZFS. I'm not very familiar with the code, but I still have hope that there'll be a parameter which allows to get rid of the cache flushes. ZFS, and the X4500, are typical examples of different departments not really working together, e.g. they have a wonderful file system, but there are no storages who supports it. Or a great X4500, a 11-24 TB file server for $40,000, but no options to make it highly available like the $1,000 boxes. AVS is, in my opinion, clearly one of the components which suffers from it. The Sun marketing and Jonathan still have a long way to go. But, on the other hand, difficult customers like me and my company are always happy to point out some difficulties and to help resolving them :-) > For all that is good (or bad) about AVS, the fact that it works by > simply interposing itself on the Solaris I/O data path is great, as it > works with any Solaris block storage. Of course this also means that > it has not filesystem, database or host-spare knowledge, which means > that at times AVS will be inefficient at what it does. > I don't think that there's a problem with AVS and its concepts. In my opinion, ZFS has to do the homework. At least it should be aware of the fact that AVS is involved. Or has been, when it comes to recovering data from a zpool - simply saying "the discs belong exclusively to the local ZFS, and no other mechanisms can write onto the discs, so let's panic and lose all the terabytes of important data" just isn't valid. It may be easy and comfortable for the ZFS development department, but it doesn't refelct the real world - and not even Suns software portfolio. The AVS integration into Nevada makes this even worse and I hope there'll be something like fsck in the future, something which allows me to recover the files with correct checksums from a zpool, instead of simply hearing the sales droids repeat "There can't be any errors, NEVER!" over and over again :-) > >> - and >> it will become much worse in the future, because the disk capacities >> rocket up into the sky, while the performance isn't improved as much. > > Larger disk capacities are now worse in this scenario, then they are > with controller-based replication, ZFS send / receive, etc. Actually > it is quite efficient. If the disk that failed was one 5% full, when > the HSP disk is switch and being rebuilt, old 5% of the entire disk > will have to be replicated. If at the time ZFS and AVS were deployed > on this server, if they HSP disks (containing uninitialized data) were > also configured as equal with "sndradm -E ...", then there would be > not initial replication cost, and when swapped into use, only the cost > of replicating the actual ZFS in use data. That's interesting. Because, together with your "data and bitmap volume on the same disk" scenario, the bitmap volume would be lost. A full sync of the disc would be necessary then, even if only 5% are in use. Am I correct? > >> During this time, your service misses redundancy. > > Absolute not. If all of the ZFS in use and ZFS HSP disks are > configured under AVS, there is never a time of lost redundancy. > I'm sure there is, as soon as a disc crashed in the secondary and the primary disc is in logging mode for several hours. I bet you'll lose your HA as soon as the primary crashes before the secondary is in sync again, because the global ZFS metadata weren't logged, but updated. I think to avoid this, the primary would have to sent the entire replication group into logging mode - but then it would get even worse, because you'll lose your redundancy for days until the secondary is 100% in sync again and the regular replicating state becomes active (a full sync of a X4500 takes at least 5 days, and only when you don't have Sun Cluster with exlclusive interconnect interfaces up and running). Linux/DRBD: Some data will be missing and you'll have fun fsck'ing for two hours. ZFS: The secondary is not consistent, zpool is FAULTED, all data is lost, you have a downtime while recovering from backup tapes, plus a week with reduced redundancy because of the time needed for resyncing the restored data. You want three cluster nodes in most deployment scenarios, not just two, believe me ;-) It doesn't matter much if you only use several easy to restore videos. But I talk about file servers which host several billion inodes, like the file servers which host the mail headers, bodies and attachments for a million Yahoo users, a terabyte of moving data each day which cannot be backuped to tape. >> And we're not talking >> about some minutes during this time. Well, and now try to imagine what >> will happen if another disks fails during this rebuild, this time in the >> secondary ... > > If I was truly counting on AVS, I would be glad this happened! Getting > replication configured right, be it AVS or some other option, means > that when disks, systems, networks, etc., fail, there is always a > period of degraded system performance, but it is better then no system > performance. > That's correct. But don't forget that it's always a very small step from "degraded" to "faulted". In particular when it comes to high availability scenarios in data centers, because in such scenarios you'll always have to rely on other people with less know-how and motivation. It's easy to accept a degraded state as long as you're in your office. But with an X4500, your degraded state may potentially last longer than a weekend and when you're directly responsible for the mail of millions of user and you know that any non-availability will place your name on Slashdot (or the name of your CEO, wich equals placing your head on a scaffold), I'm sure you'll think twice about using ZFS with AVS or letting the linux dudes continue to play with their inefficient boxes :-) > But if a disaster happened on the primary node, and a decision was > made to ZFS import the storage pool on the secondary, ZFS will detect > the inconsistency, mark the drive as failed, swap in the secondary HSP > disk. Later, when the primary site comes back, and a reverse > synchronization is done to restore writes that happened on the > secondary, the primary ZFS file system will become aware that a HSP > swap occurred, and continue on right where the secondary node left off. I'll try that as soon as I have a chance again (which means: as soon as Sun gets the Sun Cluster working on a X4500). >> c) You *must* force every single `zfs import <zpool>` on the secondary >> host. Always. > > Correct, but this is the case even without AVS! If one configured ZFS > on SAN based storage and your primary node crashed, one would need to > force every single `zfs import <zpool>`. This is not an AVS issue, but > a ZFS protection. Right. Too bad ZFS reacts this way. I have to admit that you made me nervous once, when you wrote that forcing zpool imports would be a bad idea ... [X] Zfsck now! Let's organize a petition. :-) > Correct, but this is the case even without AVS! Take the same SAN > based storage scenario above, go to a secondary system on your SAN, > and force every single `zfs import <zpool>`. > Yes, but on a SAN, I don't have to worry about zpool inconsistency, because the zpool always resides on the same devices. > In the case of a SAN, where the same physical disk would be written to > by both hosts, you would likely get complete data loss, but with AVS, > where ZFS is actually on two physical disk, and AVS is tracking > writes, even if they are inconsistent writes, AVS can and will recover > if an update sync is done. My problem is that there's no ZFS mechanism which allows me to verify the zpool consistency before I actually try to import it. Like I said before: AVS does it right, just ZFS doesn't (and otherwise it wouldn't make sense to discuss it on this mailinglist anyway :-) ). It could really help me with AVS if there was something like "zpool check <zpool>", something for checking a zpool before an import. I could do a cronjob which puts the secondary host into logging mode, run a "zpool check" and continue with the replication a few hours afterwards. Would let me sleep better and I wouldn't have to pray to the IT gods before an import. ou know, I saw literally *hundreds* of kernel panics during my tests, that made me nervous. I have scripts which do the job now, but I saw the risks and the things which can go wrong if someone else without my experience does it (like the infamous "forgetting to manually place the secondary in the logging mode before trying to import a zpool"). > Your are quite correct in that although ZFS is intuitively easy to > use, AVS is painfully complex. Of course the mindset of AVS and ZFS > are as distant apart as they are in the alphabet. :-O > AVS was easy to learn and isn't very difficult to work with. All you need is 1 or 2 months of testing experience. Very easy with UFS. > With AVS in Nevada, there is now an opportunity for leveraging the > ease of use of ZFS, with AVS. Being also the iSCSI Target project > lead, I see a lot of value in the ZFS option "set shareiscsi=on", to > get end users in using iSCSI. > Too bad the X4500 has too few PCI slots to consider buying iSCSI cards. The two existing slots are already needed for the Sun Cluster interconnect. I think iSCSI won't be real option unless the servers are shipped with it onboard, like it has been done in the past with the SCSI or ethernet interfaces. > I would like to see "set replication=AVS:<secondary host>", > configuring a locally named ZFS storage pool to the same named pair on > some remote host. Starting down this path would afford things like ZFS > replication monitoring, similar to what ZFS does with each of its own > vdevs. Yes! Jim, I think we'll become friends :-) Who do I have to send the bribe money to? -- Ralf Ramge Senior Solaris Administrator, SCNA, SCSA Tel. +49-721-91374-3963 [EMAIL PROTECTED] - http://web.de/ 1&1 Internet AG Brauerstraße 48 76135 Karlsruhe Amtsgericht Montabaur HRB 6484 Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss Aufsichtsratsvorsitzender: Michael Scheeren _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss