On 05/04/2017 13:01, Kevin Wolf wrote: > Am 04.04.2017 um 17:09 hat Paolo Bonzini geschrieben: >> On 04/04/2017 16:53, Kevin Wolf wrote: >>>> The big question is how this fits into release management. We have >>>> another important regression from the op blocker work and only a week >>>> to go before the last rc. Are we going to delay 2.9 arbitrarily? Are >>>> we going to shorten the 2.10 development period correspondingly? (I >>>> vote yes and yes, FWIW). >>> Which is the other regression? >> >> The assertion failure for snapshot_blkdev with iothreads. > > Ah, right, I keep forgetting that this started appearing with the op > blocker series because the failure mode is completely different, so it > seems to have been a latent bug somewhere else that was uncovered by it. > > If we're sure that the change of the order in bdrv_append() is what > caused the bug to appear, we can just undo that for 2.9, at the cost of > a messed up graph in the error case when bdrv_set_backing_hd() fails > (because we have no way to undo bdrv_replace_node()).
I don't know if that is enough to fix all of the issues, but the bug is easy to reproduce. The issue is the lack of understanding of what node movement does to quiesce_counter. The invariant is that children cannot have a lower quiesce_counter than parents, I think (paths in the graph can only join in the children direction, right?). Is it checked, and are there violations already? Maybe we need a get_quiesce_counter method in BdrvChildRole, to cover BlockBackend's quiesce_counter? Then we can use that information to adjust the quiesce_counter when nodes move in the graph. The block layer has good tests, but as the internal logic grows more complex we should probably have more C level tests. I'm constantly impressed by the amount of tricky cases that test-replication.c catches in the block job code. Paolo