24.10.2019 14:17, Kevin Wolf wrote: > Am 24.10.2019 um 12:46 hat Vladimir Sementsov-Ogievskiy geschrieben: >> 23.10.2019 18:26, Kevin Wolf wrote: >>> qcow2_detect_metadata_preallocation() calls qcow2_get_refcount() which >>> requires s->lock to be taken to protect its accesses to the refcount >>> table and refcount blocks. However, nothing in this code path actually >>> took the lock. This could cause the same cache entry to be used by two >>> requests at the same time, for different tables at different offsets, >>> resulting in image corruption. >>> >>> As it would be preferable to base the detection on consistent data (even >>> though it's just heuristics), let's take the lock not only around the >>> qcow2_get_refcount() calls, but around the whole function. >>> >>> This patch takes the lock in qcow2_co_block_status() earlier and asserts >>> in qcow2_detect_metadata_preallocation() that we hold the lock. >>> >>> Fixes: 69f47505ee66afaa513305de0c1895a224e52c45 >>> Cc: qemu-sta...@nongnu.org >>> Reported-by: Michael Weiser <mich...@weiser.dinsnail.net> >>> Signed-off-by: Kevin Wolf <kw...@redhat.com> >> >> Oh, I'm very sorry. I was going to backport this patch, and found that it's >> fixed in our downstream long ago, even before last upstream version patch >> sent :( > > Seriously? Is your downstream QEMU so divergent from upstream that you > wouldn't notice something like this? I think you really have to improve > something there.
How to improve it? Months and years are passed to bring patches into upstream, of course we have to live with a bunch (now it's nearer to 300 and this is a lot better then 500+ in the recent past) of patches above Rhel release. > > I mean, we have a serious data corruptor in the 4.1 release and I wasted > days to debug this, and you've been sitting on a fix for months without > telling anyone? Of course, it was not my goal to hide the fix, I'll do my best to avoid this in future. Very sorry for your time wasted. > This isn't a memory leak or something, this kills > people's images. It's eating their data. Possibly, I didn't know about data corruption. For some reason I decided that lock should be taken here, I don't remember. Still I should have applied it to upstream version too, of course. > > Modifying an image format driver requires utmost care and I think I'll > try to make sure to allow only few qcow2 changes per release in the > future. Too many changes were made in too short time, and with too > little care, and there are even more qcow2 patches pending. Please check > whether these series actually match your downstream code. The difference is that I didn't move the lock() but add separate lock()/unlock() pair around qcow2_detect_metadata_preallocation(). I think there is no serious difference. > Anyway, we'll > tread very carefully now with the pending patches, even if it means > delaying them for another release or two. What do you mean? What to do with sent patches during several months? > Stability is way more > important for an image format driver than new features and > optimisations. Agree. But I think, the main source of stability is testing. > > Do whatever you need to fix your downstream process, but seriously, this > must not ever happen again. > I don't see how to improve the process to exclude such problems. But I'll of course will do my best to avoid them. -- Best regards, Vladimir