On Mon, Nov 24, 2025 at 09:38:57AM +0100, Lukas Straub wrote: > On Thu, 6 Nov 2025 11:21:56 +0800 > Zhang Chen <[email protected]> wrote: > > > On Thu, Nov 6, 2025 at 9:10 AM Zhijian Li (Fujitsu) > > <[email protected]> wrote: > > > > > > > > > > > > On 06/11/2025 04:58, Peter Xu wrote: > > > > On Tue, Nov 04, 2025 at 09:36:06AM +0800, Li Zhijian wrote: > > > >> Commit 4881411136 ("migration: Always set DEVICE state") set a new > > > >> DEVICE > > > >> state before completed during migration, which broke the original > > > >> transition > > > >> to COLO. The migration flow for precopy has changed to: > > > >> active -> pre-switchover -> device -> completed. > > > >> > > > >> This patch updates the transition state to ensure that the Pre-COLO > > > >> state corresponds to DEVICE state correctly. > > > >> > > > >> Fixes: 4881411136 ("migration: Always set DEVICE state") > > > >> Signed-off-by: Li Zhijian <[email protected]> > > > >> --- > > > >> [...] > > > > > > > > Thanks a lot for fixing it, Zhijian. It means I broke COLO already for > > > > 10.0/10.1.. > > > > > > > > Hailiang/Chen, do you still know anyone who is using COLO, especially in > > > > enterprise? I don't expect any individual using it.. It definitely > > > > complicates migration logics all over the places. Fabiano and I > > > > discussed > > > > a few times on removing legacy code and COLO was always in the list. > > > > > > > > We used to discuss RDMA obsoletion too, that's when Huawei developers at > > > > least tried to re-implement the whole RDMA using rsocket, that didn't > > > > land > > > > only because of a perf regression. Meanwhile, Zhijian also provided an > > > > unit test, which we rely on recently to not break RDMA at the minimum. > > > > > > > > If we do not have known users, I sincerely want to discuss with you on > > > > obsoletion and removal of COLO from qemu codebase. Do you see feasible? > > > > > > > > Zhijian, do you have any input here? > > > > > > > > > If we don't have any known users, I personally have no objection to > > > removing COLO. > > > > > > From my previous understanding, its use cases are rather limited, and > > > the checkpointing overhead is significant. > > > Moreover, with the continuous development of Cloud Native over the past > > > decade, service-based > > > FT/HA solutions have become very mature, which shrinks the use cases for > > > VM-based FT solutions even further. > > > > > > I think it's worth keeping if we have: > > > > > > - Active users who depend on it. > > > - A unit test for the COLO framework. > > > > > > Thanks > > > Zhijian > > > > > > > > > > Add CC Lukas. > > > > [...] > > Hello Everyone, > > Thanks for bringing this to my attention. > > I will write a migration unit-test and take maintainership of the colo > components.
Thanks. It'll be great to also double check colo docs (e.g. COLO-FT.txt) is still the latest. Another bonus if you could rewrite some into .rst and put it under docs/devel/migration/ if that makes sense to you. > > Peter, what is your plan with refactoring the migration code and where > is the colo code blocking you? No real blocker yet, it was a problem of extra complexity, and nobody that I was aware of is using COLO in production or even experimental environment before I know you're using it. I'm actually curious what's the use case in your setup, feel free to share more if possible. I'd love to learn about it. If you plan to maintain COLO, please feel free to review this series: https://lore.kernel.org/r/[email protected] > > I have quite a few cleanup patches lying around. Are you open to take > these in advance before the next merge window opens? Normally cleanup patches do not qualify into -rc category, but you can send it out first and we can discuss in the thread on the target merge window. Thanks, -- Peter Xu
