* Stefan Hajnoczi (stefa...@redhat.com) wrote: > On Wed, Dec 02, 2015 at 01:31:46PM +0800, Wen Congyang wrote: > > +== Failure Handling == > > +There are 6 internal errors when block replication is running: > > +1. I/O error on primary disk > > +2. Forwarding primary write requests failed > > +3. Backup failed > > +4. I/O error on secondary disk > > +5. I/O error on active disk > > +6. Making active disk or hidden disk empty failed > > +In case 1 and 5, we just report the error to the disk layer. In case 2, 3, > > +4 and 6, we just report block replication's error to FT/HA manager (which > > +decides when to do a new checkpoint, when to do failover). > > +There is no internal error when doing failover. > > Not sure this is true. > > Below it says the following for failover: "We will flush the Disk buffer > into Secondary Disk and stop block replication". Flushing the disk > buffer can result in I/O errors. This means that failover operations > are not guaranteed to succeed. > > In practice I think this is similar to a successful failover followed by > immediately getting I/O errors on the new Primary Disk. It means that > right after failover there is another failure and the system may not be > able to continue.
Yes, I think that's true. > So this really only matters in the case where there is a new Secondary > ready after failover. In that case the user might expect failover to > continue to the new Secondary (Host 3): > > [X] [X] > Host 1 <-> Host 2 <-> Host 3 Since COLO is just doing a 1+1 redundency, I think it's not expecting to cope with a double host failure; it's going to take some time (seconds?) to sync Host 3 back in when you add it after a failover and the aim would be not to have distrubed the application for that long, so it should already be running on Host 2 during that resync. Dave -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK