Am 10.11.2015 um 10:01 hat Paolo Bonzini geschrieben: > > > On 10/11/2015 07:14, Fam Zheng wrote: > > On Mon, 11/09 17:29, Kevin Wolf wrote: > >> Am 09.11.2015 um 17:18 hat Paolo Bonzini geschrieben: > >>> > >>> > >>> On 09/11/2015 17:04, Kevin Wolf wrote: > >>>> Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben: > >>>>> The "pnum < nb_sectors" condition in deciding whether to actually copy > >>>>> data is unnecessarily strict, and the qiov initialization is > >>>>> unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard > >>>>> branches. > >>>>> > >>>>> Reorganize mirror_iteration flow so that we: > >>>>> > >>>>> 1) Find the contiguous zero/discarded sectors with > >>>>> bdrv_get_block_status_above() before deciding what to do. We query > >>>>> s->buf_size sized blocks at a time. > >>>>> > >>>>> 2) If the sectors in question are zeroed/discarded and aligned to > >>>>> target cluster, issue zero write or discard accordingly. It's done > >>>>> in mirror_do_zero_or_discard, where we don't add buffer to qiov. > >>>>> > >>>>> 3) Otherwise, do the same loop as before in mirror_do_read. > >>>>> > >>>>> Signed-off-by: Fam Zheng <f...@redhat.com> > >>>> > >>>> I'm not sure where in the patch to comment on this, so I'll just do it > >>>> here right in the beginning. > >>>> > >>>> I'm concerned that we need to be more careful about races in this patch, > >>>> in particular regarding the bitmaps. I think the conditions for the two > >>>> bitmaps are: > >>>> > >>>> * Dirty bitmap: We must clear the bit after finding the next piece of > >>>> data to be mirrored, but before we yield after getting information > >>>> that we use for the decision which kind of operation we need. > >>>> > >>>> In other words, we need to clear the dirty bitmap bit before calling > >>>> bdrv_get_block_status_above(), because that's both the function that > >>>> retrieves information about the next chunk and also a function that > >>>> can yield. > >>>> > >>>> If after this point the data is written to, we need to mirror it > >>>> again. > >>> > >>> With Fam's patch, that's not trivial for two reasons: > >>> > >>> 1) bdrv_get_block_status_above() can return a smaller amount than what > >>> is asked. > >>> > >>> 2) the "read and write" case can handle s->granularity sectors per > >>> iteration (many of them can be coalesced, but still that's how the > >>> iteration works). > >>> > >>> The simplest solution is to perform the query with s->granularity size > >>> rather than s->buf_size. > >> > >> Then we end up with many small operations, that's not what we want. > >> > >> Why can't we mark up to s->buf_size dirty clusters as clean first, then > >> query the status, and mark all of those that we can't handle dirty > >> again? > > > > Then we may end up marking more clusters as dirty than it should be. > > You're both right. > > > Because all bdrv_set_dirty() and bdrv_set_dirty_bitmap() callers are > > coroutine, > > we can introduce a CoMutex to let bitmap reader block bdrv_set_dirty and > > bdrv_set_dirty_bitmap. > > I think this is not necessary. > > I think the following is safe: > > 1) before calling bdrv_get_block_status_above(), find out how many > consecutive bits in the dirty bitmap are 1 > > 2) zero all those bits in the dirty bitmap > > 3) call bdrv_get_block_status_above() with a size equivalent to the > number of dirty bits > > 4) if bdrv_get_block_status_above() only returns a partial result, loop > step (3) until all the dirty bits are processed
Right, you can always implement one iteration with more than one I/O request. And maybe that would be the time to start a coroutine for the requests already in the mirror code instead of complicating the AIO state machine and letting block.c start coroutines. > For full mirroring, this strategy will probably make the first > incremental iteration more expensive. You mean because we issue smaller, interleaved write and write_zeroes requests now instead of only large writes? That's probably right, but getting the right result should be more important than speed. :-) Kevin