On 01/26/2016 02:59 AM, Dr. David Alan Gilbert wrote: > * Wen Congyang (we...@cn.fujitsu.com) wrote: >> On 01/23/2016 03:35 AM, Dr. David Alan Gilbert wrote: >>> Hi, >>> I've been looking at what's needed to add a new secondary after >>> a primary failed; from the block side it doesn't look as hard >>> as I'd expected, perhaps you can tell me if I'm missing something! >>> >>> The normal primary setup is: >>> >>> quorum >>> Real disk >>> nbd client >> >> quorum >> real disk >> replication >> nbd client >> >>> >>> The normal secondary setup is: >>> replication >>> active-disk >>> hidden-disk >>> Real-disk >> >> IIRC, we can do it like this: >> quorum >> replication >> active-disk >> hidden-disk >> real-disk > > Yes. > >>> With a couple of minor code hacks; I changed the secondary to be: >>> >>> quorum >>> replication >>> active-disk >>> hidden-disk >>> Real-disk >>> dummy-disk >> >> after failover, >> quorum >> replicaion(old, mode is secondary) >> active-disk >> hidden-disk* >> real-disk* >> replication(new, mode is primary) >> nbd-client > > Do you need to keep the old secondary-replication? > Does that just pass straight through?
Yes, the old secondary-replication can work in the newest mode. For example, we don't start colo again after failover, we do nothing. > >> In the newest version, we active commit active-disk to real-disk. >> So it will be: >> quorum >> replicaion(old, mode is secondary) >> active-disk(it is real disk now) >> replication(new, mode is primary) >> nbd-client > > How does that active-commit work? I didn't think you > could change the real disk until you had the full checkpoint, > since you don't know whether the primary or secondaries > changes need to be written? I start the active-commit work when doing failover. After failover, the primary changes after last checkpoint should be dropped(How to cancel the inprogress write ops?). > >>> and then after the primary fails, I start a new secondary >>> on another host and then on the old secondary do: >>> >>> nbd_server_stop >>> stop >>> x_block_change top-quorum -d children.0 # deletes use of real >>> disk, leaves dummy >>> drive_del active-disk0 >>> x_block_change top-quorum -a node-real-disk >>> x_block_change top-quorum -d children.1 # Seems to have deleted >>> the dummy?!, the disk is now child 0 >>> drive_add buddy >>> driver=replication,mode=primary,file.driver=nbd,file.host=ibpair,file.port=8889,file.export=colo-disk0,node-name=nbd-client,if=none,cache=none >>> x_block_change top-quorum -a nbd-client >>> c >>> migrate_set_capability x-colo on >>> migrate -d -b tcp:ibpair:8888 >>> >>> and I think that means what was the secondary, has the same disk >>> structure as a normal primary. >>> That's not quite happy yet, and I've not figured out why - but the >>> order/structure of the block devices looks right? >>> >>> Notes: >>> a) The dummy serves two purposes, 1) it works around the segfault >>> I reported in the other mail, 2) when I delete the real disk in the >>> first x_block_change it means the quorum still has 1 disk so doesn't >>> get upset. >> >> I don't understand the purpose 2. > > quorum wont allow you to delete all it's members ('The number of children > cannot be lower than the vote threshold 1') > and it's very tricky getting the order correct with add/delete; for example > I tried: > > drive_add buddy > driver=replication,mode=primary,file.driver=nbd,file.host=ibpair,file.port=8889,file.export=colo-disk0,node-name=nbd-client,if=none,cache=none > # gets children.1 > x_block_change top-quorum -a nbd-client > # deletes the secondary replication > x_block_change top-quorum -d children.0 > drive_del active-disk0 The active-disk0 contains some data, and you should not delete it. If we do active-commit after failover, the active-disk0 is the real disk. > # ends up as children.0 but in the 2nd slot > x_block_change top-quorum -a node-real-disk > > info block shows me: > top-quorum (#block615): json:{"children": [ > {"driver": "replication", "mode": "primary", "file": {"port": "8889", > "host": "ibpair", "driver": "nbd", "export": "colo-disk0"}}, > {"driver": "raw", "file": {"driver": "file", "filename": > "/home/localvms/bugzilla.raw"}} > ], > "driver": "quorum", "blkverify": false, "rewrite-corrupted": false, > "vote-threshold": 1} (quorum) > Cache mode: writeback > > that has the replication first and the file second; that's the opposite > from the normal primary startup - does it matter? it is OK. But reading from children.0 always fails and will read data from children.1 > > I can't add node-real-disk until I drive_del active-disk0 (which > previously used it); and I can't drive_del until I remove > it from the quorum; but I can't remove that from the quorum first, > because that leaves an empty quorum. > >>> b) I had to remove the restriction in quorum_start_replication >>> on which mode it would run in. >> >> IIRC, this check will be removed. >> >>> c) I'm not really sure everything knows it's in secondary mode yet, and >>> I'm not convinced whether the replication is doing the right thing. >>> d) The migrate -d -b eventually fails on the destination, not worked >>> out why >>> yet. >> >> Can you give me the error message? > > I need to repeat it to check; it was something like a bad flag from the block > migration > code; it happened after the block migration hit 100%. IIRC, we find some block migration's bug, and fix it. It may be a new bug. > >>> e) Adding/deleting children on quorum is hard having to use the >>> children.0/1 >>> notation when you've added children using node names - it's worrying >>> which number is which; is there a way to give them a name? >> >> No. I think we can improve 'info block' output. > > Yes, that would be good; I thought it was the order in the list; but after > debugging it today I'm not convinced it is; I think it always keeps the same > name - so for example if you start off with [children.0, children.1]; then > delete children.0 you now have [children.1]; if you then add a new > child I *think* that becomes children.0 but you end up with > [children.1,children.0] Note that: quorum fifo mode cares this order. I think it is better to read the older child first. Thanks Wen Congyang > >>> f) I've not thought about the colo-proxy that much yet - I guess that >>> existing connections need to keep their sequence number offset but >>> new connections made by what is now the primary dont need to do >>> anything >>> special. >> >> Hailiang or Zhijian can answer this question. > > Thanks, > >> Thanks >> Wen Congyang >> >>> >>> Dave >>> -- >>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >>> >>> >>> . >>> >> >> >> > -- > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > > > . >