On Thu, 2020-11-12 at 10:40 +0000, Luis Henriques wrote:
> Jeff Layton <jlay...@kernel.org> writes:
> 
> > On Wed, 2020-11-11 at 18:28 +0000, Luis Henriques wrote:
> > > Jeff Layton <jlay...@kernel.org> writes:
> > > 
> > > > On Wed, 2020-11-11 at 15:39 +0000, Luis Henriques wrote:
> > > > > When doing a rename across quota realms, there's a corner case that 
> > > > > isn't
> > > > > handled correctly.  Here's a testcase:
> > > > > 
> > > > >   mkdir files limit
> > > > >   truncate files/file -s 10G
> > > > >   setfattr limit -n ceph.quota.max_bytes -v 1000000
> > > > >   mv files limit/
> > > > > 
> > > > > The above will succeed because ftruncate(2) won't result in an 
> > > > > immediate
> > > > > notification of the MDSs with the new file size, and thus the quota 
> > > > > realms
> > > > > stats won't be updated.
> > > > > 
> > > > > This patch forces a sync with the MDS every time there's an ATTR_SIZE 
> > > > > that
> > > > > sets a new i_size, even if we have Fx caps.
> > > > > 
> > > > > Cc: sta...@vger.kernel.org
> > > > > Fixes: dffdcd71458e ("ceph: allow rename operation under different 
> > > > > quota realms")
> > > > > URL: https://tracker.ceph.com/issues/36593
> > > > > Signed-off-by: Luis Henriques <lhenriq...@suse.de>
> > > > > ---
> > > > >  fs/ceph/inode.c | 11 ++---------
> > > > >  1 file changed, 2 insertions(+), 9 deletions(-)
> > > > > 
> > > > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > > > > index 526faf4778ce..30e3f240ac96 100644
> > > > > --- a/fs/ceph/inode.c
> > > > > +++ b/fs/ceph/inode.c
> > > > > @@ -2136,15 +2136,8 @@ int __ceph_setattr(struct inode *inode, struct 
> > > > > iattr *attr)
> > > > >       if (ia_valid & ATTR_SIZE) {
> > > > >               dout("setattr %p size %lld -> %lld\n", inode,
> > > > >                    inode->i_size, attr->ia_size);
> > > > > -             if ((issued & CEPH_CAP_FILE_EXCL) &&
> > > > > -                 attr->ia_size > inode->i_size) {
> > > > > -                     i_size_write(inode, attr->ia_size);
> > > > > -                     inode->i_blocks = 
> > > > > calc_inode_blocks(attr->ia_size);
> > > > > -                     ci->i_reported_size = attr->ia_size;
> > > > > -                     dirtied |= CEPH_CAP_FILE_EXCL;
> > > > > -                     ia_valid |= ATTR_MTIME;
> > > > > -             } else if ((issued & CEPH_CAP_FILE_SHARED) == 0 ||
> > > > > -                        attr->ia_size != inode->i_size) {
> > > > > +             if ((issued & 
> > > > > (CEPH_CAP_FILE_EXCL|CEPH_CAP_FILE_SHARED)) ||
> > > > > +                 (attr->ia_size != inode->i_size)) {
> > > > >                       req->r_args.setattr.size = 
> > > > > cpu_to_le64(attr->ia_size);
> > > > >                       req->r_args.setattr.old_size =
> > > > >                               cpu_to_le64(inode->i_size);
> > > > 
> > > > Hmm...this makes truncates more expensive when we have caps. I'd rather
> > > > not do that if we can help it.
> > > 
> > > Yeah, as I mentioned in the tracker, there's indeed a performance impact
> > > with this fix.  That's what made me add the RFC in the subject ;-)
> > > 
> > > > What about instead having the client mimic a fsync when there is a
> > > > rename across quota realms? If we can't tell that reliably then we could
> > > > also just do an effective fsync ahead of any cross-directory rename?
> > > 
> > > Ok, thanks for the suggestion.  That may actually work, although it will
> > > make the rename more expensive of course.  I'll test that tomorrow and
> > > eventually follow-up with a patch.
> > > 
> > 
> > Patrick pointed out to me on IRC that since you're moving the parent
> > directory of the truncated file, flushing the caps on the directory
> > won't really help. You'd need to walk the entire subtree and try to
> > flush every dirty inode, or basically do a syncfs() prior to renaming
> > the directory across quotarealms.
> > 
> > I think we probably will need to revert the change to allow cross-
> > quotarealm renames of directories and make those return EXDEV again.
> > Anything else sounds like it's probably going to be too expensive.
> 
> Hmm... that sounds a bit drastic and it would make the kernel client
> behave differently from the fuse client -- from what I could understand
> the fuse client does the sync ATTR_SIZE and thus doesn't have this issue.
> 

True. I'll note that the fuse client is not exactly built for speed,
however.

> Obviously, I agree with you that the performance penalty is too high for
> such a common operation.  But maybe renames across quotarealms aren't that
> common and paying the penalty of doing a full ceph_flush_dirty_caps() is
> acceptable for such cases?
> 

I wouldn't even do that. If someone is renaming a directory across
quotarealms, just return EXDEV. Saying "sorry, you have to copy/unlink"
in this situation seems like it should be acceptable. Are you aware of
any specific use-cases where people are renaming large directories
across quotarealms?
-- 
Jeff Layton <jlay...@kernel.org>

Reply via email to