On Thu, Sep 22, 2022 at 05:58:25PM +0100, Daniel P. Berrangé wrote: > On Tue, Sep 20, 2022 at 06:37:57PM -0400, Peter Xu wrote: > > In qemu_file_shutdown(), there's a possible race if with current order of > > operation. There're two major things to do: > > > > (1) Do real shutdown() (e.g. shutdown() syscall on socket) > > (2) Update qemufile's last_error > > > > We must do (2) before (1) otherwise there can be a race condition like: > > > > page receiver other thread > > ------------- ------------ > > qemu_get_buffer() > > do shutdown() > > returns 0 (buffer all zero) > > (meanwhile we didn't check this retcode) > > try to detect IO error > > last_error==NULL, IO okay > > install ALL-ZERO page > > set last_error > > --> guest crash! > > > > To fix this, we can also check retval of qemu_get_buffer(), but not all > > APIs can be properly checked and ultimately we still need to go back to > > qemu_file_get_error(). E.g. qemu_get_byte() doesn't return error. > > > > Maybe some day a rework of qemufile API is really needed, but for now keep > > using qemu_file_get_error() and fix it by not allowing that race condition > > to happen. Here shutdown() is indeed special because the last_error was > > emulated. For real -EIO errors it'll always be set when e.g. sendmsg() > > error triggers so we won't miss those ones, only shutdown() is a bit tricky > > here. > > The ultimate answer really is to stop using QEMUFile entirely and just > do migration with the QIOChannel objects directly. The work I did in the > last cycle to remove the QEMUFileOps callbacks was another piece of the > puzzle in moving in that direction, by ensuring that every QEMUFile is > now backed by a QIOChannel. All QEMUFile is doing now is adding the I/O > caching layer and some convenience APIs for I/O operations. > > So the next step would be to add a QIOChannelCached class that can wrap > over another QIOChannel, to add the I/O buffering functionality that > currently exists in QEMUFile. Once that's done, it'd be at the stage > where we could look at how to use QIOChannel APIs for VMstate. It would > likely involve wiring up an Error **errp parameter into a great many > places so we get synchronous error propagation instead of out-of-band > error checking, so a phased transition would need to be figured out.
Yes, Error** sounds very good to have. So far I'm not satisfied with qemufile api majorly because of that error handling, especially on *get() interfaces. Besides that, do you have anything else in mind that would make QIOChannelCached better than qemufile (e.g. on how we do caching)? Thanks, -- Peter Xu