Re: [Qemu-devel] COLO: how to flip a secondary to a primary?

Wen Congyang Mon, 25 Jan 2016 17:07:45 -0800

On 01/26/2016 02:59 AM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>> On 01/23/2016 03:35 AM, Dr. David Alan Gilbert wrote:
>>> Hi,
>>>   I've been looking at what's needed to add a new secondary after
>>> a primary failed; from the block side it doesn't look as hard
>>> as I'd expected, perhaps you can tell me if I'm missing something!
>>>
>>> The normal primary setup is:
>>>
>>>    quorum
>>>       Real disk
>>>       nbd client
>>
>> quorum
>>    real disk
>>    replication
>>       nbd client
>>
>>>
>>> The normal secondary setup is:
>>>    replication
>>>       active-disk
>>>       hidden-disk
>>>       Real-disk
>>
>> IIRC, we can do it like this:
>> quorum
>>    replication
>>       active-disk
>>       hidden-disk
>>       real-disk
> 
> Yes.
> 
>>> With a couple of minor code hacks; I changed the secondary to be:
>>>
>>>    quorum
>>>       replication
>>>         active-disk
>>>         hidden-disk
>>>         Real-disk
>>>       dummy-disk
>>
>> after failover,
>> quorum
>>    replicaion(old, mode is secondary)
>>      active-disk
>>      hidden-disk*
>>      real-disk*
>>    replication(new, mode is primary)
>>      nbd-client
> 
> Do you need to keep the old secondary-replication?
> Does that just pass straight through?


Yes, the old secondary-replication can work in the newest mode.
For example, we don't start colo again after failover, we do nothing.

> 
>> In the newest version, we active commit active-disk to real-disk.
>> So it will be:
>> quorum
>>    replicaion(old, mode is secondary)
>>      active-disk(it is real disk now)
>>    replication(new, mode is primary)
>>      nbd-client
> 
> How does that active-commit work?  I didn't think you
> could change the real disk until you had the full checkpoint,
> since you don't know whether the primary or secondaries
> changes need to be written?

I start the active-commit work when doing failover. After failover,
the primary changes after last checkpoint should be dropped(How to cancel
the inprogress write ops?).

> 
>>> and then after the primary fails, I start a new secondary
>>> on another host and then on the old secondary do:
>>>
>>>   nbd_server_stop
>>>   stop
>>>   x_block_change top-quorum -d children.0         # deletes use of real 
>>> disk, leaves dummy
>>>   drive_del active-disk0
>>>   x_block_change top-quorum -a node-real-disk
>>>   x_block_change top-quorum -d children.1         # Seems to have deleted 
>>> the dummy?!, the disk is now child 0
>>>   drive_add buddy 
>>> driver=replication,mode=primary,file.driver=nbd,file.host=ibpair,file.port=8889,file.export=colo-disk0,node-name=nbd-client,if=none,cache=none
>>>   x_block_change top-quorum -a nbd-client
>>>   c
>>>   migrate_set_capability x-colo on
>>>   migrate -d -b tcp:ibpair:8888
>>>
>>> and I think that means what was the secondary, has the same disk
>>> structure as a normal primary.
>>> That's not quite happy yet, and I've not figured out why - but the
>>> order/structure of the block devices looks right?
>>>
>>> Notes:
>>>    a) The dummy serves two purposes, 1) it works around the segfault
>>>       I reported in the other mail, 2) when I delete the real disk in the
>>>       first x_block_change it means the quorum still has 1 disk so doesn't
>>>       get upset.
>>
>> I don't understand the purpose 2.
> 
> quorum wont allow you to delete all it's members ('The number of children 
> cannot be lower than the vote threshold 1')
> and it's very tricky getting the order correct with add/delete; for example
> I tried:
> 
> drive_add buddy 
> driver=replication,mode=primary,file.driver=nbd,file.host=ibpair,file.port=8889,file.export=colo-disk0,node-name=nbd-client,if=none,cache=none
> # gets children.1
> x_block_change top-quorum -a nbd-client
> # deletes the secondary replication
> x_block_change top-quorum -d children.0
> drive_del active-disk0

The active-disk0 contains some data, and you should not delete it.
If we do active-commit after failover, the active-disk0 is the real disk.

> # ends up as children.0 but in the 2nd slot
> x_block_change top-quorum -a node-real-disk
> 
> info block shows me:
> top-quorum (#block615): json:{"children": [
>     {"driver": "replication", "mode": "primary", "file": {"port": "8889", 
> "host": "ibpair", "driver": "nbd", "export": "colo-disk0"}},
>     {"driver": "raw", "file": {"driver": "file", "filename": 
> "/home/localvms/bugzilla.raw"}}
>    ],
>    "driver": "quorum", "blkverify": false, "rewrite-corrupted": false, 
> "vote-threshold": 1} (quorum)
>     Cache mode:       writeback
> 
> that has the replication first and the file second; that's the opposite
> from the normal primary startup - does it matter?

it is OK. But reading from children.0 always fails and will read data from 
children.1

> 
> I can't add node-real-disk until I drive_del active-disk0 (which
> previously used it);  and I can't drive_del until I remove
> it from the quorum; but I can't remove that from the quorum first,
> because that leaves an empty quorum.
> 
>>>    b) I had to remove the restriction in quorum_start_replication
>>>       on which mode it would run in. 
>>
>> IIRC, this check will be removed.
>>
>>>    c) I'm not really sure everything knows it's in secondary mode yet, and
>>>       I'm not convinced whether the replication is doing the right thing.
>>>    d) The migrate -d -b   eventually fails on the destination, not worked 
>>> out why
>>>       yet.
>>
>> Can you give me the error message?
> 
> I need to repeat it to check; it was something like a bad flag from the block 
> migration
> code; it happened after the block migration hit 100%.

IIRC, we find some block migration's bug, and fix it. It may be a new bug.

> 
>>>    e) Adding/deleting children on quorum is hard having to use the 
>>> children.0/1
>>>       notation when you've added children using node names - it's worrying
>>>       which number is which; is there a way to give them a name?
>>
>> No. I think we can improve 'info block' output.
> 
> Yes, that would be good; I thought it was the order in the list; but after
> debugging it today I'm not convinced it is; I think it always keeps the same
> name - so for example if you start off with [children.0, children.1]; then
> delete children.0 you now have [children.1];  if you then add a new
> child I *think* that becomes children.0 but you end up with 
> [children.1,children.0]

Note that: quorum fifo mode cares this order. I think it is better to read
the older child first.

Thanks
Wen Congyang

> 
>>>    f) I've not thought about the colo-proxy that much yet - I guess that
>>>       existing connections need to keep their sequence number offset but
>>>       new connections made by what is now the primary dont need to do 
>>> anything
>>>       special.
>>
>> Hailiang or Zhijian can answer this question.
> 
> Thanks,
> 
>> Thanks
>> Wen Congyang
>>
>>>
>>> Dave
>>> --
>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>>>
>>>
>>> .
>>>
>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 
> 
> .
>

Re: [Qemu-devel] COLO: how to flip a secondary to a primary?

Reply via email to