Jim Dunham wrote:
> This is just one scenario for deploying the 48 disks of x4500. The 
> blog listed below offers another option, by mirroring the bitmaps 
> across all available disks, bring the total disk count back up to 46, 
> (or 44, if 2x HSP) leaving the other two for a mirrored root disk.  
> http://blogs.sun.com/AVS/entry/avs_and_zfs_seamless
>
I know your blog entry, Jim. And I still admire your skills in 
calculations within shell scripts (I just gave each soft partition 100 
Megabytes of space, finished ;-) ). But after some thinking, I didn't 
consider using a slice on the same disk for bitmaps. Not just because of 
performance issues, that's not a valid reason. Again, the desaster 
scenarios make me think. In this case, the complexity of administration.

You know, the x64 Solaris boxes are basically competing against Linux 
boxes all day. The X4500 is very attractive replacement for the typical 
Linux file server, consisting of a server, a hardware RAID controller 
and several cheap and stupid fibre-channeled SATA JBODs for less than 
$5,000 each. Double this to have a cluster. In our case, the X4500 is 
competing against more than 60 of those clusters with a total of 360 
JBODs. The X4500's main advantage isn't the price per gigabyte (the 
price is exactly the same!), like most members of the sales department 
may expect, the real advantage is the gigabyte per rack unit. But there 
are several disadvantages, for instance: not being able to access the 
hard drives from the front and needing a ladder and a screwdriver 
instead, or, most important for the typical data center, the *operator* 
is not able to replace a disk like he's used to: pulling the old disc 
out, putting the new disc in, resync starting, finished. You'll always 
have to wait until the next morning, until a Solaris administrator is 
available again (which may impact your high availability concepts) or 
get an Solaris administrator into the company 24/7 a day (which raises 
the TCO of the Solaris boxes).
Well, and what I want to say: if you place the bitmap volume on the same 
disk, this situation even gets worse. The problem is the involvement of 
SVM. Building the soft partition again makes the handling even more 
complex and the case harder to handle for operators. It's the best way 
to make sure that the disk will be replaced, but not added to the zpool 
during the night - and replacing it during regular working hours isn't 
an option too, because syncing 500 GB over a 1 GBit/s interface during 
daytime just isn't possible without putting the guaranteed service times 
to a risk. Having to take care about soft partitions just isn't 
idiot-proof enough. And *poof* there's a good chance the TCO of a X4500 
is considered being too high.

>> a) A disk in the primary fails. What happens? A HSP jumps in and 500 GB
>> will be rebuilt. These 500 GB are synced over a single 1 GBit/s
>> crossover cable. This takes a bit of time and is 100% unnecessary
>
>
> But it is necessary! As soon as the HSP disk kicks in, not only is the 
> disk being rebuilt by ZFS, but newly allocated ZFS data will also 
> being written to this HSP disk. So although it may appear that there 
> is wasted replication cost (of which there is), the instant that ZFS 
> writes new data to this HSP disk, the old replicated disk is instantly 
> inconsistent, and there is no means to fix.
It's necessary from your point of view, Jim. But not in the minds of the 
customers. Even worse, it could be considered a design flaw - not in 
AVS, but in ZFS.

Just have a look how the usual Linux dude works. He doesn't use AVS, he 
uses a kernel module called DRBD. It does basically the same, it 
replicates one raw device to another over a network interface, like AVS 
does. But the linux dude has one advantage: he doesn't have ZFS. Yes, as 
impossible as it may sound, it is an advantage. Why? Because he never 
has to mirror 40 or 46 devices, because his lame file systems depend on 
a hardware RAID controller! Same goes with UFS, of course. There's only 
ONE replicated device, no matter how many discs are involved.
And so, it's definitely NOT necessary to sync a disc when a HSP kicks 
in, because this disc failure will never be reported to the host, it's 
handled by the RAID controller. As a result, no replication will take 
place, because AVS simply isn't involved. We even tried to deploy ZFS 
upon SVM RAID5 stripes to get rid of this problem, just to learn how 
much the RAID 5 performance of SVM sucks ... a cluster of six USB sticks 
was faster than the Thumpers.


I consider this a big design flaw of ZFS. I'm not very familiar with the 
code, but I still have hope that there'll be a parameter which allows to 
get rid of the cache flushes. ZFS, and the X4500, are typical examples 
of different departments not really working together, e.g. they have a 
wonderful file system, but there are no storages who supports it. Or a 
great X4500, a 11-24 TB file server for $40,000, but no options to make 
it highly available like the $1,000 boxes. AVS is, in my opinion, 
clearly one of the components which suffers from it. The Sun marketing 
and Jonathan still have a long way to go. But, on the other hand, 
difficult customers like me and my company are always happy to point out 
some difficulties and to help resolving them :-)

> For all that is good (or bad) about AVS, the fact that it works by 
> simply interposing itself on the Solaris I/O data path is great, as it 
> works with any Solaris block storage. Of course this also means that 
> it has not filesystem, database or host-spare knowledge, which means 
> that at times AVS will be inefficient at what it does.
>
I don't think that there's a problem with AVS and its concepts. In my 
opinion, ZFS has to do the homework. At least it should be aware of the 
fact that AVS is involved. Or has been, when it comes to recovering data 
from a zpool - simply saying "the discs belong exclusively to the local 
ZFS, and no other mechanisms can write onto the discs, so let's panic 
and lose all the terabytes of important data" just isn't valid. It may 
be easy and comfortable for the ZFS development department, but it 
doesn't refelct the real world - and not even Suns software portfolio. 
The AVS integration into Nevada makes this even worse and I hope 
there'll be something like fsck in the future, something which allows me 
to recover the files with correct checksums from a zpool, instead of 
simply hearing the sales droids repeat "There can't be any errors, 
NEVER!" over and over again :-)

>
>> - and
>> it will become much worse in the future, because the disk capacities
>> rocket up into the sky, while the performance isn't improved as much.
>
> Larger disk capacities are now worse in this scenario, then they are 
> with controller-based replication, ZFS send / receive, etc. Actually 
> it is quite efficient. If the disk that failed was one 5% full, when 
> the HSP disk is switch and being rebuilt, old 5% of the entire disk 
> will have to be replicated. If at the time ZFS and AVS were deployed 
> on this server, if they HSP disks (containing uninitialized data) were 
> also configured as equal with "sndradm -E ...", then there would be 
> not initial replication cost, and when swapped into use, only the cost 
> of replicating the actual ZFS in use data.
That's interesting. Because, together with your "data and bitmap volume 
on the same disk" scenario, the bitmap volume would be lost. A full sync 
of the disc would be necessary then, even if only 5% are in use. Am I 
correct?

>
>> During this time, your service misses redundancy.
>
> Absolute not. If all of the ZFS in use and ZFS HSP disks are 
> configured under AVS, there is never a time of lost redundancy.
>
I'm sure there is, as soon as a disc crashed in the secondary and the 
primary disc is in logging mode for several hours. I bet you'll lose 
your HA as soon as the primary crashes before the secondary is in sync 
again, because the global ZFS metadata weren't logged, but updated. I 
think to avoid this, the primary would have to sent the entire 
replication group into logging mode - but then it would get even worse, 
because you'll lose your redundancy for days until the secondary is 100% 
in sync again and the regular replicating state becomes active (a full 
sync of a X4500 takes at least 5 days, and only when you don't have Sun 
Cluster with exlclusive interconnect interfaces up and running).

Linux/DRBD: Some data will be missing and you'll have fun fsck'ing for 
two hours.
ZFS: The secondary is not consistent, zpool is FAULTED, all data is 
lost, you have a downtime while recovering from backup tapes, plus a 
week with reduced redundancy because of the time needed for resyncing 
the restored data. You want three cluster nodes in most deployment 
scenarios, not just two, believe me ;-) It doesn't matter much if you 
only use several easy to restore videos. But I talk about file servers 
which host several billion inodes, like the file servers which host the 
mail headers, bodies and attachments for a million Yahoo users, a 
terabyte of moving data each day which cannot be backuped to tape.

>> And we're not talking
>> about some minutes during this time. Well, and now try to imagine what
>> will happen if another disks fails during this rebuild, this time in the
>> secondary ...
>
> If I was truly counting on AVS, I would be glad this happened! Getting 
> replication configured right, be it AVS or some other option, means 
> that when disks, systems, networks, etc., fail, there is always a 
> period of degraded system performance, but it is better then no system 
> performance.
>
That's correct. But don't forget that it's always a very small step from 
"degraded" to "faulted". In particular when it comes to high 
availability scenarios in data centers, because in such scenarios you'll 
always have to rely on other people with less know-how and motivation. 
It's easy to accept a degraded state as long as you're in your office. 
But with an X4500, your degraded state may potentially last longer than 
a weekend and when you're directly responsible for the mail of millions 
of user and you know that any non-availability will place your name on 
Slashdot (or the name of your CEO, wich equals placing your head on a 
scaffold), I'm sure you'll think twice about using ZFS with AVS or 
letting the linux dudes continue to play with their inefficient boxes :-)

> But if a disaster happened on the primary node, and a decision was 
> made to ZFS import the storage pool on the secondary, ZFS will detect 
> the inconsistency, mark the drive as failed, swap in the secondary HSP 
> disk. Later, when the primary site comes back, and a reverse 
> synchronization is done to restore writes that happened on the 
> secondary, the primary ZFS file system will become aware that a HSP 
> swap occurred, and continue on right where the secondary node left off.
I'll try that as soon as I have a chance again (which means: as soon as 
Sun gets the Sun Cluster working on a X4500).

>> c) You *must* force every single `zfs import <zpool>` on the secondary
>> host. Always.
>
> Correct, but this is the case even without AVS! If one configured ZFS 
> on SAN based storage and your primary node crashed, one would need to 
> force every single `zfs import <zpool>`. This is not an AVS issue, but 
> a ZFS protection.
Right. Too bad ZFS reacts this way.

I have to admit that you made me nervous once, when you wrote that 
forcing zpool imports would be a bad idea ...

[X] Zfsck now! Let's organize a petition. :-)

> Correct, but this is the case even without AVS! Take the same SAN 
> based storage scenario above, go to a secondary system on your SAN, 
> and force every single `zfs import <zpool>`.
>
Yes, but on a SAN, I don't have to worry about zpool inconsistency, 
because the zpool always resides on the same devices.

> In the case of a SAN, where the same physical disk would be written to 
> by both hosts, you would likely get complete data loss, but with AVS, 
> where ZFS is actually on two physical disk, and AVS is tracking 
> writes, even if they are inconsistent writes, AVS can and will recover 
> if an update sync is done.
My problem is that there's no ZFS mechanism which allows me to verify 
the zpool consistency before I actually try to import it. Like I said 
before: AVS does it right, just ZFS doesn't (and otherwise it wouldn't 
make sense to discuss it on this mailinglist anyway :-) ).

It could really help me with AVS if there was something like "zpool 
check <zpool>", something for checking a zpool before an import. I could 
do a cronjob which puts the secondary host into logging mode, run a 
"zpool check" and continue with the replication  a few hours afterwards. 
Would let me sleep better and I wouldn't have to pray to the IT gods 
before an import. ou know, I saw literally *hundreds* of kernel panics 
during my tests, that made me nervous. I have scripts which do the job 
now, but I saw the risks and the things which can go wrong if someone 
else without my experience does it (like the infamous "forgetting to 
manually place the secondary in the logging mode before trying to import 
a zpool").

> Your are quite correct in that although ZFS is intuitively easy to 
> use, AVS is painfully complex. Of course the mindset of AVS and ZFS 
> are as distant apart as they are in the alphabet. :-O
>
AVS was easy to learn and isn't very difficult to work with. All you 
need is 1 or 2 months of testing experience. Very easy with UFS.

> With AVS in Nevada, there is now an opportunity for leveraging the 
> ease of use of ZFS, with AVS. Being also the iSCSI Target project 
> lead, I see a lot of value in the ZFS option "set shareiscsi=on", to 
> get end users in using iSCSI.
>
Too bad the X4500 has too few PCI slots to consider buying iSCSI cards. 
The two existing slots are already needed for the Sun Cluster 
interconnect. I think iSCSI won't be real option unless the servers are 
shipped with it onboard, like it has been done in the past with the SCSI 
or ethernet interfaces.

> I would like to see "set replication=AVS:<secondary host>", 
> configuring a locally named ZFS storage pool to the same named pair on 
> some remote host. Starting down this path would afford things like ZFS 
> replication monitoring, similar to what ZFS does with each of its own 
> vdevs.
Yes! Jim, I think we'll become friends :-) Who do I have to send the 
bribe money to?

-- 

Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963 
[EMAIL PROTECTED] - http://web.de/

1&1 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
Aufsichtsratsvorsitzender: Michael Scheeren

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to