On 01.04.19 09:21, Vladimir Sementsov-Ogievskiy wrote: > 29.03.2019 22:32, Kevin Wolf wrote: >> Am 29.03.2019 um 19:00 hat Vladimir Sementsov-Ogievskiy geschrieben: >>> 29.03.2019 20:58, Vladimir Sementsov-Ogievskiy wrote: >>>> 29.03.2019 20:44, Max Reitz wrote: >>>>> On 29.03.19 18:40, Kevin Wolf wrote: >>>>>> Am 29.03.2019 um 18:30 hat Max Reitz geschrieben: >>>>>>> On 29.03.19 18:24, Kevin Wolf wrote: >>>>>>>> Am 29.03.2019 um 18:15 hat Max Reitz geschrieben: >>>>>>>>> On 29.03.19 12:04, Vladimir Sementsov-Ogievskiy wrote: >>>>>>>>>> bdrv_replace_child() calls bdrv_check_perm() with error_abort on >>>>>>>>>> loosening permissions. However file-locking operations may fail even >>>>>>>>>> in this case, for example on NFS. And this leads to Qemu crash. >>>>>>>>>> >>>>>>>>>> Let's avoid such errors. Note, that we ignore such things anyway on >>>>>>>>>> permission update commit and abort. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy >>>>>>>>>> <vsement...@virtuozzo.com> >>>>>>>>>> --- >>>>>>>>>> block/file-posix.c | 12 ++++++++++++ >>>>>>>>>> 1 file changed, 12 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/block/file-posix.c b/block/file-posix.c >>>>>>>>>> index db4cccbe51..1cf4ee49eb 100644 >>>>>>>>>> --- a/block/file-posix.c >>>>>>>>>> +++ b/block/file-posix.c >>>>>>>>>> @@ -815,6 +815,18 @@ static int >>>>>>>>>> raw_handle_perm_lock(BlockDriverState *bs, >>>>>>>>>> switch (op) { >>>>>>>>>> case RAW_PL_PREPARE: >>>>>>>>>> + if ((s->perm | new_perm) == s->perm && >>>>>>>>>> + (s->shared_perm & new_shared) == s->shared_perm) >>>>>>>>>> + { >>>>>>>>>> + /* >>>>>>>>>> + * We are going to unlock bytes, it should not fail. If >>>>>>>>>> it fail due >>>>>>>>>> + * to some fs-dependent permission-unrelated reasons >>>>>>>>>> (which occurs >>>>>>>>>> + * sometimes on NFS and leads to abort in >>>>>>>>>> bdrv_replace_child) we >>>>>>>>>> + * can't prevent such errors by any check here. And we >>>>>>>>>> ignore them >>>>>>>>>> + * anyway in ABORT and COMMIT. >>>>>>>>>> + */ >>>>>>>>>> + return 0; >>>>>>>>>> + } >>>>>>>>>> ret = raw_apply_lock_bytes(s, s->fd, s->perm | new_perm, >>>>>>>>>> ~s->shared_perm | ~new_shared, >>>>>>>>>> false, errp); >>>>>>>>> >>>>>>>>> Help me understand the exact issue, please. I understand that there >>>>>>>>> are >>>>>>>>> operations like bdrv_replace_child() that pass &error_abort to >>>>>>>>> bdrv_check_perm() because they just loosen the permissions, so it >>>>>>>>> should >>>>>>>>> not fail. >>>>>>>>> >>>>>>>>> However, if the whole effect really would be to loosen permissions, >>>>>>>>> raw_apply_lock_bytes() wouldn't have failed here in PREPARE anyway: >>>>>>>>> @unlock is passed as false, so no bytes will be unlocked. And if >>>>>>>>> permissions are just loosened (as your condition checks), it should >>>>>>>>> not >>>>>>>>> lock any bytes. >>>>>>>>> >>>>>>>>> So why does it attempt lock any bytes in the first place? There must >>>>>>>>> be >>>>>>>>> some discrepancy between s->perm and s->locked_perm, or >>>>>>>>> ~s->shared_perm >>>>>>>>> and s->locked_shared_perm. How does that occur? >>>>>>>> >>>>>>>> I suppose raw_check_lock_bytes() is what is failing, not >>>>>>>> raw_apply_lock_bytes(). >>>>>>> >>>>>>> Hm, maybe in Vladimir's case, but not in e.g. >>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1652572 . >>>>>> >>>>>> This is reported against 3.0, which didn't avoid re-locking permissions >>>>>> that we already hold, so there raw_apply_lock_bytes() can still fail. >>>>> >>>>> That makes sense. Which leaves the question why Vladimir still seems to >>>>> see the error there...? >>>>> >>>> >>>> I'm sorry :(. I'm trying to fix bug based on 2.10, and now I see that is >>>> already fixed >>>> upstream. I don't have a reproducer, only old coredumps. >>>> >>>> So, now it looks like we don't need this patch, as on permission loosening >>>> file-posix >>>> don't call any FS apis, yes? >>>> >>> >>> >>> Ah, you mentioned, that raw_check_lock_bytes is still buggy. >> >> I haven't tried it out, but from looking at the code it seems so. Maybe >> you can reproduce on master just to be sure? >> > > I don't have a reproducer :(
I have one, but it only breaks before 2996ffad3acabe890fbb4f84a069cdc325a68108: First, setup on an NFS mount on /mnt/nfs. Second: $ qemu-img create -f qcow2 /mnt/nfs/foo.qcow2 64M Formatting '/mnt/nfs/foo.qcow2', fmt=qcow2 size=67108864 cluster_size=65536 lazy_refcounts=off refcount_bits=16 $ (sleep 5; echo "{'execute':'qmp_capabilities'}"; \ echo "{'execute':'blockdev-del','arguments':{'node-name':'fmt'}}"; echo "{'execute':'quit'}") \ | x86_64-softmmu/qemu-system-x86_64 -qmp stdio \ -blockdev node-name=proto,driver=file,filename=/mnt/nfs/foo.qcow2 \ -blockdev node-name=fmt,driver=qcow2,file=proto {"QMP": {"version": {"qemu": {"micro": 90, "minor": 0, "major": 3}, "package": "v3.1.0-rc0-71-ga883d6a0bc"}, "capabilities": []}} Before the sleep is done, stop the service on the NFS host: $ systemctl stop nfs-service Once the sleep has run out (you get a {"return": {}} over QMP), start the service again: $ systemctl start nfs-service And then this happens: Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:705: Failed to lock byte 100 [1] 30486 done ( sleep 5; echo "{'execute':'qmp_capabilities'}"; echo ; echo ; ) | 30487 abort (core dumped) x86_64-softmmu/qemu-system-x86_64 -qmp stdio -blockdev -blockdev It works fine after 2996ffad3acabe890fbb4f84a069cdc325a68108. Max
signature.asc
Description: OpenPGP digital signature