On 2018-03-30 17:32, Vladimir Sementsov-Ogievskiy wrote: > 30.03.2018 16:31, Vladimir Sementsov-Ogievskiy wrote: >> 29.03.2018 18:09, Vladimir Sementsov-Ogievskiy wrote: >>> 29.03.2018 17:03, Max Reitz wrote: >>>> On 2018-03-29 10:08, Vladimir Sementsov-Ogievskiy wrote: >>>>> 28.03.2018 17:53, Max Reitz wrote: >>>>>> On 2018-03-27 12:11, Vladimir Sementsov-Ogievskiy wrote: >>>> [...] >>>> >>>>>>> isn't it because a lot of cat processes? will check, update loop to >>>>>>> i=0; while check -qcow2 169; do ((i++)); echo $i OK; killall -9 cat; >>>>>>> done >>>>>> Hmm... I know I tried to kill all of the cats, but for some >>>>>> reason that >>>>>> didn't really help yesterday. Seems to help now, for 2.12.0-rc0 at >>>>>> least (that is, before this series). >>>>> reproduced with killing... (without these series, just on master) >>>>> >>>>>> After the whole series, I still get a lot of failures in 169 >>>>>> (mismatching bitmap hash, mostly). >>>>>> >>>>>> And interestingly, if I add an abort(): >>>>>> >>>>>> diff --git a/block/qcow2.c b/block/qcow2.c >>>>>> index 486f3e83b7..9204c1c0ac 100644 >>>>>> --- a/block/qcow2.c >>>>>> +++ b/block/qcow2.c >>>>>> @@ -1481,6 +1481,7 @@ static int coroutine_fn >>>>>> qcow2_do_open(BlockDriverState *bs, QDict *options, } >>>>>> >>>>>> if (bdrv_dirty_bitmap_next(bs, NULL)) { >>>>>> + abort(); >>>>>> /* It's some kind of reopen with already existing dirty >>>>>> bitmaps. There >>>>>> * are no known cases where we need loading bitmaps in >>>>>> such >>>>>> situation, >>>>>> * so it's safer don't load them. >>>>>> >>>>>> Then this fires for a couple of test cases of 169 even without the >>>>>> third >>>>>> patch of this series. >>>>>> >>>>>> I guess bdrv_dirty_bitmap_next() reacts to some bitmaps that >>>>>> migration >>>>>> adds or something? Then this would be the wrong condition, because I >>>>>> guess we still want to load the bitmaps that are in the qcow2 file. >>>>>> >>>>>> I'm not sure whether bdrv_has_readonly_bitmaps() is the correct >>>>>> condition then, either, though. Maybe let's take a step back: We >>>>>> want >>>>>> to load all the bitmaps from the file exactly once, and that is >>>>>> when it >>>>>> is opened the first time. Or that's what I would have thought... Is >>>>>> that even correct? >>>>>> >>>>>> Why do we load the bitmaps when the device is inactive anyway? >>>>>> Shouldn't we load them only once the device is activated? >>>>> Hmm, not sure. May be, we don't need. But we anyway need to load them, >>>>> when opening read-only, and we should correspondingly reopen in >>>>> this case. >>>> Yeah, well, yeah, but the current state seems just wrong. Apparently >>>> there are cases where a qcow2 node may have bitmaps before we try to >>>> load them from the file, so the current condition doesn't work. >>>> >>>> Furthermore, it seems like the current "state machine" is too >>>> complex so >>>> we don't know which cases are possible anymore and what to do when. >>>> >>>> So the first thing we could do is add a field to the BDRVQCow2State >>>> that >>>> tells us whether the bitmaps have been loaded already or not. If not, >>>> we invoke qcow2_load_dirty_bitmaps() and set the value to true. If the >>>> value was true already and the BDS is active and R/W now, we call >>>> qcow2_reopen_bitmaps_rw_hint(). That should solve one problem. >>> >>> good idea, will do. >>> >>>> >>>> The other problem of course is the question whether we should call >>>> qcow2_load_dirty_bitmaps() at all while the drive is still inactive. >>>> You know the migration model better than me, so I'm asking this >>>> question >>>> to you. We can phrase it differently: Do we need to load the bitmaps >>>> before the drive is activated? >>> >>> Now I think that we don't need. At least, we don't have such cases in >>> Virtuozzo (I hope :). >>> >>> Why did I doubt: >>> >>> 1. We have cases, when we want to start vm as inactive, to be able to >>> export it's drive as NBD export, push some data to it and then start >>> the VM (which involves activating) >>> 2. We have cases, when we want to start vm stopped and operate on >>> dirty bitmaps. >>> >>> If just open all images in inactive mode until vm start, it looks >>> like we need bitmaps in inactive mode (for 2.). But it looks like >>> wrong approach anyway. >>> Firstly, I tried to solve (1.) by simply inactivate_all() in case of >>> start vm in paused mode, but it breaks at least (2.), so finally, I >>> solved (1.) by an approach similar with "-incoming defer". So, we >>> have inactive mode in two cases: >>> - incoming migration >>> - push data to vm before start >>> >>> and, in these cases, we don't need to load dirty-bitmaps. >>> >>> Also, inconsistency: now, we remove persistent bitmaps on inactivate. >>> So, it is inconsistent to load the in inactive mode. >>> >>> Ok, I'll try to respin. >> >> finally, what cases we actually have for qcow2_do_open? >> >> 1. INACTIVE -> ACTIVE (through invalidate_cache, we obviously should >> load bitmaps, if we decided that we have no persistent bitmaps in >> INACTIVE mode) >> 2. creating new bdrv state (first open of the image) in INACTIVE mode >> (will not load bitmaps) >> 3. creating new bdrv state (first open of the image) in ACTIVE mode >> (will load bitmaps, maybe read-only if disk is RO) >> >> If only these three cases, it would be enough to just load bitmaps if >> !INACTIVE and do nothing otherwise.
Maybe, but I'd prefer just adding a flag for now. We can think about whether we need bitmaps in inactive mode later, and even then managing the state through a flag doesn't hurt. >> Or, we have some of the following cases too? >> >> 1?. ACTIVE -> ACTIVE (through invalidate_cache, some kind of no-op, we >> should not reload bitmaps) >> 2?. RO -> RW (we should reopen_bitmaps_rw) (or it is possible only >> through bdrv_reopen, which will not touch qcow2_do_open? >> 3?. RW -> RO (reopen_bitmaps_ro ?) >> ? something other?? I don't really know, but I just hope having a flag saves us from having to think about all of this too much. :-) >>> about 169, how often is it reproducible for you? More errors than passes, I guess. But it always leaves cats behind, and that's another sign that something isn't right. >> it becomes very interseting. >> >> persistent-migbitmap-online case failed on line 136. with mismathced >> bitmap sha. This check is not relate to migration, on this line, >> bitmap is already successfully migrated. But for some reason it is >> corrupted after stop/start the VM. How is it possible - I can't >> imagine. But it looks not really related to migration.. May it relate >> to case, when postcopy was not finished or something like this?.. >> maybe fixing wait(RESUME) to something more appropriate will help. But >> it is strange anyway. >> >> persistent-notmigbitmap-offline case failed on ine 128, which is >> "self.vm_b.event_wait("RESUME", timeout=10.0)" with timeout.. can we >> skip RESUME for some reason? maybe just move to MIGRATION event will >> help, will check It would still be interesting to know where the RESUME ended up... > looks like moving from "RESUME" to "MIGRATION" events fixes the whole > issue (I've already 740 successful iterations, which is good enough, but > I will not stop the loop until at least 2000). I'll submit a patch. But > I don't understand, why it fix sha256 mismatch, and why sha256-check > fails with "RESUME" (check success after migration and fail after > stop/start, o_O)... Well, if you have the time, it might be worth investigating. Max