On Mon, May 12, 2025 at 02:51:45PM -0500, John Groves wrote: > On 25/05/06 06:56PM, Miklos Szeredi wrote: > > On Mon, 28 Apr 2025 at 21:00, Darrick J. Wong <djw...@kernel.org> wrote: > > > > > <nod> I don't know what Miklos' opinion is about having multiple > > > fusecmds that do similar things -- on the one hand keeping yours and my > > > efforts separate explodes the amount of userspace abi that everyone must > > > maintain, but on the other hand it then doesn't couple our projects > > > together, which might be a good thing if it turns out that our domain > > > models are /really/ actually quite different. > > > > Sharing the interface at least would definitely be worthwhile, as > > there does not seem to be a great deal of difference between the > > generic one and the famfs specific one. Only implementing part of the > > functionality that the generic one provides would be fine. > > Agreed. I'm coming around to thinking the most practical approach would be > to share the GET_FMAP message/response, but to add a separate response > format for Darrick's use case - when the time comes. In this patch set, > that starts with 'struct fuse_famfs_fmap_header' and is followed by the > approriate extent structures, serialized in the message. Collectively > that's an fmap in message format.
Well in that case I might as well just plumb in the pieces I need as separate fuse commands. fuse_args::opcode is u32, there's plenty of space left. > Side note: the current patch set sends back the logically-variable-sized > fmap in a fixed-size message, but V2 of the series will address that; > I got some help from Bernd there, but haven't finished it yet. > > So the next version of the patch set would, say, add a more generic first > 'struct fmap_header' that would indicate whether the next item would be > 'struct fuse_famfs_fmap_header' (i.e. my/famfs metadata) or some other > to be codified metadata format. I'm going here because I'm dubious that > we even *can* do grand-unified-fmap-metadata (or that we should try). > > This will require versioning the affected structures, unless we think > the fmap-in-message structure can be opaque to the rest of fuse. @miklos, > is there an example to follow regarding struct versioning in > already-existing fuse structures? /me is a n00b, but isn't that a simple matter of making sure that new revisions change the structure size, and then you can key off of that? > > > (Especially because I suspect that interleaving is the norm for memory, > > > whereas we try to avoid that for disk filesystems.) > > > > So interleaved extents are just like normal ones except they repeat, > > right? What about adding a special "repeat last N extent > > descriptions" type of extent? > > It's a bit more than that. The comment at [1] makes it possible to understand > the scheme, but I'd be happy to talk through it with you on a call if that > seems helpful. > > An interleaved extent stripes data spread across N memory devices in raid 0 > format; the space from each device is described by a single simple extent > (so it's contigous), but it's not consumed contiguously - it's consumed in > fixed-sized chunks that precess across the devices. Notwithstanding that I > couldn't explain it very well when we talked about it at LPC, I think I > could make it pretty clear in a pretty brief call now. > > In any case, you have my word that it's actually quite elegant :D > (seriously, but also with a smile...) Admittedly the more I think about the interleaving in famfs vs straight block mappings for disk filesystems, the more I think they ought to be separate interfaces for code that solves different problems. Then both our codebases will remain relatively cohesive. > > > > But the current implementation does not contemplate partially cached > > > > fmaps. > > > > > > > > Adding notification could address revoking them post-haste (is that why > > > > you're thinking about notifications? And if not can you elaborate on > > > > what > > > > you're after there?). > > > > > > Yeah, invalidating the mapping cache at random places. If, say, you > > > implement a clustered filesystem with iomap, the metadata server could > > > inform the fuse server on the local node that a certain range of inode X > > > has been written to, at which point you need to revoke any local leases, > > > invalidate the pagecache, and invalidate the iomapping cache to force > > > the client to requery the server. > > > > > > Or if your fuse server wants to implement its own weird operations (e.g. > > > XFS EXCHANGE-RANGE) this would make that possible without needing to > > > add a bunch of code to fs/fuse/ for the benefit of a single fuse driver. > > > > Wouldn't existing invalidation framework be sufficient? > > > > Thanks, > > Miklos > > My current thinking is that Darrick's use case doesn't need GET_DAXDEV, but > famfs does. I think Darrick's use case has one backing device, and that should > be passed in at mount time. Correct me if you think that might be wrong. Technically speaking iomap can operate on /any/ block or dax device as long as you have a reference to them. Once I get more of the plumbing sorted out I'll start thinking about how to handle multi-device filesystems like XFS which can put file data on more than 1 block device. I was thinking that the fuse server could just send a REGISTER_DEVICE notification to the fuse driver (I know, again with the notifications :)), the kernel replies with a magic cookie, and that's what gets passed in the {read,write,map}_dev field. Right now I reconfigured fuse2fs to present itself as a "fuseblk" driver so that at least we know that inode->i_sb->s_bdev is a valid pointer. It turns out to be useful because the kernel sends FUSE_DESTROY commands synchronously during unmount, which avoids the situation where umount exits but the block device still can't be opened O_EXCL because the fuse server program is still exiting. It may be useful for some day wiring up some of the block device ops to fuse servers. Though I think it might conflict with CONFIG_BLK_DEV_WRITE_MOUNTED=y I just barely got directio writes and pagecache read/write working through iomap today, though I'm still getting used to the fuse inode locking model and sorting through the bugs. :) (I wonder how nasty would it be to pass fds to the fuse kernel driver from fuseblk servers?) > Famfs doesn't necessarily have just one backing dev, which means that famfs > could pass in the *primary* backing dev at mount time, but it would still > need GET_DAXDEV to get the rest. But if I just use GET_FMAP every time, I > only need one way to do this. > > I'll add a few more responses to Darrick's reply... Hehhe onto that message go I. --D > > Thanks, > John > > [1] > https://github.com/cxl-micron-reskit/famfs-linux/blob/c57553c4ca91f0634f137285840ab25be8a87c30/fs/fuse/famfs_kfmap.h#L13 > >