On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote: > On 11/12/20 6:40 AM, Peter Lieven wrote: > > > > /* > > > - * Avoid that s->sector_next_status becomes unaligned to the > > > source > > > - * request alignment and/or cluster size to avoid unnecessary > > > read > > > - * cycles. > > > + * Avoid that s->sector_next_status becomes unaligned to the > > > + * source/destination request alignment and/or cluster size to > > > avoid > > > + * unnecessary read/write cycles. > > > */ > > > - tail = (sector_num - src_cur_offset + n) % > > > s->src_alignment[src_cur]; > > > + alignment = MAX(s->src_alignment[src_cur], s->alignment); > > > + assert(is_power_of_2(alignment)); > > > + > > > + tail = (sector_num - src_cur_offset + n) % alignment; > > > if (n > tail) { > > > n -= tail; > > > } > > > > I was also considering including the s->alignment when adding this chance. > > However, you need the least common multiple of both alignments, not the > > maximum, otherwise > > > > you might get misaligned to either source or destination. > > > > > > Why exactly do you need the power of two requirement? > > The power of two requirement ensures that you h ave the least common > multiple of both alignments ;) - Yes, and in addition to that both alignments are already power of two because we align them to it. The assert I added is just in case.
This is how we calculate the destination alignment: s.alignment = MAX(pow2floor(s.min_sparse), DIV_ROUND_UP(out_bs->bl.request_alignment, BDRV_SECTOR_SIZE)); This is how we calculate the source alignments (it is possible to have several source files) s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment, BDRV_SECTOR_SIZE); if (!bdrv_get_info(src_bs, &bdi)) { s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / BDRV_SECTOR_SIZE); } The bl.request_alignment comment mentions that it must be power of 2, and cluster sizes are also very likely to be power of 2 in all drivers we support. An assert for s.src_alignment won't hurt though. Note though that we don't really read the discard alignment. We have max_pdiscard, and pdiscard_alignment, but max_pdiscard is only used to split 'too large' discard requests, and pdiscard_alignment is advisory and used only in a couple of places. Neither are set by file-posix. We do have request_alignment, which file-posix tries to align to the logical block size which is still often 512 for backward compatibility on many devices (even nvme) Now both source and destination alignments in qemu-img convert are based on request_aligment and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter (which otherwise controls the minimum number of blocks to be zero, to consider discarding them in convert) This means that there is no guarantee to have 4K alignment, and besides, with sufficiently fragmented source file, the bdrv_block_status_above can return arbitrary short and unaligned extents which can't be aligned, thus as I said this patch alone can't guarantee that we won't have write and discard sharing the same page. Another thing that can be discussed is why is_allocated_sectors was patched to convert short discards to writes. Probably because underlying hardware ignores them or something? In theory we should detect this and fail back to regular zeroing in this case. Again though, while in case of converting an empty file, this is the only source of writes, and eliminating it, also 'fixes' this case, with sufficiently fragmented source file we can even without this code get a write and discard sharing a page. Best regards, Maxim Levitsky > > However, there ARE devices in practice that have advertised a > non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8, > 63188c24). Which means you probably don't want to assert power-of-2, > and in turn need to worry about least common multiple. >