On Wed, Jun 26, 2019 at 11:10:17AM -0500, Goldwyn Rodrigues wrote:
> On 8:39 26/06, Christoph Hellwig wrote:
> > On Tue, Jun 25, 2019 at 02:14:42PM -0500, Goldwyn Rodrigues wrote:
> > > > I can't say I'm a huge fan of this two iomaps in one method call
> > > > approach. I always though two separate iomap iterations would be nicer,
> > > > but compared to that even the older hack with just the additional
> > > > src_addr seems a little better.
> > >
> > > I am just expanding on your idea of using multiple iterations for the Cow
> > > case
> > > in the hope we can come out of a good design:
> > >
> > > 1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
> > > which calls iomap_begin() for the respective filesystem.
> > > 2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
> > > struct with read addr information.
> > > 3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
> > > and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).
> > > 4. btrfs_iomap_begin() fills up iomap structure with write information.
> > >
> > > Step 3 seems out of place because iomap_apply should be iomap.type
> > > agnostic.
> > > Right?
> > > Should we be adding another flag IOMAP_COW_DONE, just to figure out that
> > > this is the "real" write for iomap_begin to fill iomap?
> > >
> > > If this is not how you imagined, could you elaborate on the dual iteration
> > > sequence?
> >
> > Here are my thoughts from dealing with this from a while ago, all
> > XFS based of course.
> >
> > If iomap_file_buffered_write is called on a page that is inside a COW
> > extent we have the following options:
> >
> > a) the page is updatodate or entirely overwritten. We cn just allocate
> > new COW blocks and return them, and we are done
> > b) the page is not/partially uptodate and not entirely overwritten.
> >
> > The latter case is the interesting one. My thought was that iff the
> > IOMAP_F_SHARED flag is set __iomap_write_begin / iomap_read_page_sync
> > will then have to retreive the source information in some form.
> >
> > My original plan was to just do a nested iomap_apply call, which would
> > need a special nested flag to not duplicate any locking the file
> > system might be holding between ->iomap_begin and ->iomap_end.
> >
> > The upside here is that there is no additional overhead for the non-COW
> > path and the architecture looks relatively clean. The downside is that
> > at least for XFS we usually have to look up the source information
> > anyway before allocating the COW destination extent, so we'd have to
> > cache that information somewhere or redo it, which would be rather
> > pointless. At that point the idea of a srcaddr in the iomap becomes
> > interesting again - while it looks a little ugly from the architectural
> > POV it actually ends up having very practical benefits.
I think it's less complicated to pass both mappings out in a single
->iomap_begin call rather than have this dance where the fs tells iomap
to call back for the read mapping and then iomap calls back for the read
mapping with a special "don't take locks" flag.
For XFS specifically this means we can serve both mappings with a single
ILOCK cycle.
> So, do we move back to the design of adding an extra field of srcaddr?
TLDR: Please no.
> Honestly, I find the design of using an extra field srcaddr in iomap better
> and simpler versus passing additional iomap srcmap or multiple iterations.
Putting my long-range planning hat on, the usage story (from the fs'
perspective) here is:
"iomap wants to know how a file write should map to a disk write. If
we're doing a straight overwrite of disk blocks then I should send back
the relevant mapping. Sometimes I might need the write to go to a
totally different location than where the data is currently stored, so I
need to send back a second mapping."
Because iomap is now a general-purpose API, we need to think about the
read mapping for a moment:
- For all disk-based filesystems we'll want the source address for the
read mapping.
- For filesystems that support "inline data" (which really just means
the fs maintains its own buffers to file data) we'll also need the
inline_data pointer.
- For filesystems that support multiple devices (like btrfs) we'll also
need a pointer to a block_device because we could be writing to a
different device than the one that stores the data. The prime
example I can think of is reading data from disk A in some RAID
stripe and writing to disk B in a different RAID stripe to solve the
RAID5 hole... but you could just be lazy-migrating file data to less
full or newer drives or whatever.
- If we ever encounter a filesystem that supports multiple dax devices
then we'll need a pointer to the dax_device too. (That would be
btrfs, since I thought your larger goal was to enable dax there...)
- We're probably going to need the ability to pass flags for the read
mapping at some point or another, so we need that too.
>From this, you can see that we need half the fields in the existing
struct iomap, and that's how I arrived at the idea of passing to
iomap_begin pointers to two iomaps instead of bolting these five fields
into struct iomap.
In XFS parlance (where the data fork stores mappings for on-disk data
and the cow fork stores mappings for), this means that our iomap_begin
data paths remain fairly straightforward:
xfs_bmapi_read(ip, offset, XFS_DATA_FORK, &imap...);
xfs_bmapi_read(ip, offset, XFS_COW_FORK, &cmap...);
xfs_bmbt_to_iomap(ip, iomap, &imap...);
iomap->type = IOMAP_COW;
xfs_bmbt_to_iomap(ip, srcmap, &cmap...);
(It's more complicated than that, but that's approximately what we do
now.)
> Also, should we add another iomap type IOMAP_COW, or (re)use the flag
> IOMAP_F_SHARED during writes? IOW iomap type vs iomap flag.
I think they need to remain distinct types because the IOMAP_F_SHARED
flag means that the storage is shared amongst multiple owners (which
iomap_fiemap reports to userspace), whereas the IOMAP_COW type means
that RMW is required for unaligned writes.
btrfs has a strong tendency to require copy writes, but that doesn't
mean the extent is necessarily shared.
> Dave/Darrick, what are your thoughts?
I liked where the first 3 patches of this series were heading. :)
--D
>
> --
> Goldwyn