Re: [RFC PATCH 6/6 6.6] libfs: fix infinite directory reads for offset dir
> On Nov 11, 2024, at 10:20 AM, yangerkun wrote: > > > > 在 2024/11/11 22:39, Chuck Lever III 写道: >>> On Nov 10, 2024, at 9:36 PM, Yu Kuai wrote: >>> >>> Hi, >>> >>> 在 2024/11/11 8:52, c...@kernel.org 写道: >>>> From: yangerkun >>>> [ Upstream commit 64a7ce76fb901bf9f9c36cf5d681328fc0fd4b5a ] >>>> After we switch tmpfs dir operations from simple_dir_operations to >>>> simple_offset_dir_operations, every rename happened will fill new dentry >>>> to dest dir's maple tree(&SHMEM_I(inode)->dir_offsets->mt) with a free >>>> key starting with octx->newx_offset, and then set newx_offset equals to >>>> free key + 1. This will lead to infinite readdir combine with rename >>>> happened at the same time, which fail generic/736 in xfstests(detail show >>>> as below). >>>> 1. create 5000 files(1 2 3...) under one dir >>>> 2. call readdir(man 3 readdir) once, and get one entry >>>> 3. rename(entry, "TEMPFILE"), then rename("TEMPFILE", entry) >>>> 4. loop 2~3, until readdir return nothing or we loop too many >>>>times(tmpfs break test with the second condition) >>>> We choose the same logic what commit 9b378f6ad48cf ("btrfs: fix infinite >>>> directory reads") to fix it, record the last_index when we open dir, and >>>> do not emit the entry which index >= last_index. The file->private_data >>> >>> Please notice this requires last_index should never overflow, otherwise >>> readdir will be messed up. >> It would help your cause if you could be more specific >> than "messed up". >>>> now used in offset dir can use directly to do this, and we also update >>>> the last_index when we llseek the dir file. >>>> Fixes: a2e459555c5f ("shmem: stable directory offsets") >>>> Signed-off-by: yangerkun >>>> Link: >>>> https://lore.kernel.org/r/20240731043835.1828697-1-yanger...@huawei.com >>>> Reviewed-by: Chuck Lever >>>> [brauner: only update last_index after seek when offset is zero like Jan >>>> suggested] >>>> Signed-off-by: Christian Brauner >>>> Link: https://nvd.nist.gov/vuln/detail/CVE-2024-46701 >>>> [ cel: adjusted to apply to origin/linux-6.6.y ] >>>> Signed-off-by: Chuck Lever >>>> --- >>>> fs/libfs.c | 37 + >>>> 1 file changed, 25 insertions(+), 12 deletions(-) >>>> diff --git a/fs/libfs.c b/fs/libfs.c >>>> index a87005c89534..b59ff0dfea1f 100644 >>>> --- a/fs/libfs.c >>>> +++ b/fs/libfs.c >>>> @@ -449,6 +449,14 @@ void simple_offset_destroy(struct offset_ctx *octx) >>>> xa_destroy(&octx->xa); >>>> } >>>> +static int offset_dir_open(struct inode *inode, struct file *file) >>>> +{ >>>> + struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode); >>>> + >>>> + file->private_data = (void *)ctx->next_offset; >>>> + return 0; >>>> +} >>> >>> Looks like xarray is still used. >> That's not going to change, as several folks have already >> explained. >>> I'm in the cc list ,so I assume you saw my set, then I don't know why >>> you're ignoring my concerns. >>> 1) next_offset is 32-bit and can overflow in a long-time running >>> machine. >>> 2) Once next_offset overflows, readdir will skip the files that offset >>> is bigger. > > I'm sorry, I'm a little busy these days, so I haven't responded to this > series of emails. > >> In that case, that entry won't be visible via getdents(3) >> until the directory is re-opened or the process does an >> lseek(fd, 0, SEEK_SET). > > Yes. > >> That is the proper and expected behavior. I suspect you >> will see exactly that behavior with ext4 and 32-bit >> directory offsets, for example. > > Emm... > > For this case like this: > > 1. mkdir /tmp/dir and touch /tmp/dir/file1 /tmp/dir/file2 > 2. open /tmp/dir with fd1 > 3. readdir and get /tmp/dir/file1 > 4. rm /tmp/dir/file2 > 5. touch /tmp/dir/file2 > 4. loop 4~5 for 2^32 times > 5. readdir /tmp/dir with fd1 > > For tmpfs now, we may see no /tmp/dir/file2, since the offset has been > overflow, for ext4 it is ok... So we think this will be a problem. > >> Does that not directly address your concern? Or
Re: [RFC PATCH 6/6 6.6] libfs: fix infinite directory reads for offset dir
> On Nov 10, 2024, at 9:36 PM, Yu Kuai wrote: > > Hi, > > 在 2024/11/11 8:52, c...@kernel.org 写道: >> From: yangerkun >> [ Upstream commit 64a7ce76fb901bf9f9c36cf5d681328fc0fd4b5a ] >> After we switch tmpfs dir operations from simple_dir_operations to >> simple_offset_dir_operations, every rename happened will fill new dentry >> to dest dir's maple tree(&SHMEM_I(inode)->dir_offsets->mt) with a free >> key starting with octx->newx_offset, and then set newx_offset equals to >> free key + 1. This will lead to infinite readdir combine with rename >> happened at the same time, which fail generic/736 in xfstests(detail show >> as below). >> 1. create 5000 files(1 2 3...) under one dir >> 2. call readdir(man 3 readdir) once, and get one entry >> 3. rename(entry, "TEMPFILE"), then rename("TEMPFILE", entry) >> 4. loop 2~3, until readdir return nothing or we loop too many >>times(tmpfs break test with the second condition) >> We choose the same logic what commit 9b378f6ad48cf ("btrfs: fix infinite >> directory reads") to fix it, record the last_index when we open dir, and >> do not emit the entry which index >= last_index. The file->private_data > > Please notice this requires last_index should never overflow, otherwise > readdir will be messed up. It would help your cause if you could be more specific than "messed up". >> now used in offset dir can use directly to do this, and we also update >> the last_index when we llseek the dir file. >> Fixes: a2e459555c5f ("shmem: stable directory offsets") >> Signed-off-by: yangerkun >> Link: https://lore.kernel.org/r/20240731043835.1828697-1-yanger...@huawei.com >> Reviewed-by: Chuck Lever >> [brauner: only update last_index after seek when offset is zero like Jan >> suggested] >> Signed-off-by: Christian Brauner >> Link: https://nvd.nist.gov/vuln/detail/CVE-2024-46701 >> [ cel: adjusted to apply to origin/linux-6.6.y ] >> Signed-off-by: Chuck Lever >> --- >> fs/libfs.c | 37 + >> 1 file changed, 25 insertions(+), 12 deletions(-) >> diff --git a/fs/libfs.c b/fs/libfs.c >> index a87005c89534..b59ff0dfea1f 100644 >> --- a/fs/libfs.c >> +++ b/fs/libfs.c >> @@ -449,6 +449,14 @@ void simple_offset_destroy(struct offset_ctx *octx) >> xa_destroy(&octx->xa); >> } >> +static int offset_dir_open(struct inode *inode, struct file *file) >> +{ >> + struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode); >> + >> + file->private_data = (void *)ctx->next_offset; >> + return 0; >> +} > > Looks like xarray is still used. That's not going to change, as several folks have already explained. > I'm in the cc list ,so I assume you saw my set, then I don't know why > you're ignoring my concerns. > 1) next_offset is 32-bit and can overflow in a long-time running > machine. > 2) Once next_offset overflows, readdir will skip the files that offset > is bigger. In that case, that entry won't be visible via getdents(3) until the directory is re-opened or the process does an lseek(fd, 0, SEEK_SET). That is the proper and expected behavior. I suspect you will see exactly that behavior with ext4 and 32-bit directory offsets, for example. Does that not directly address your concern? Or do you mean that Erkun's patch introduces a new issue? If there is a problem here, please construct a reproducer against this patch set and post it. > Thanks, > Kuai > >> + >> /** >> * offset_dir_llseek - Advance the read position of a directory descriptor >> * @file: an open directory whose position is to be updated >> @@ -462,6 +470,9 @@ void simple_offset_destroy(struct offset_ctx *octx) >> */ >> static loff_t offset_dir_llseek(struct file *file, loff_t offset, int >> whence) >> { >> + struct inode *inode = file->f_inode; >> + struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode); >> + >> switch (whence) { >> case SEEK_CUR: >> offset += file->f_pos; >> @@ -475,8 +486,9 @@ static loff_t offset_dir_llseek(struct file *file, >> loff_t offset, int whence) >> } >> /* In this case, ->private_data is protected by f_pos_lock */ >> - file->private_data = NULL; >> - return vfs_setpos(file, offset, U32_MAX); >> + if (!offset) >> + file->private_data = (void *)ctx->next_offset; >> + return vfs_setpos(file, offset, LONG_MAX); >> } >>static struct dentry *offset_find_next(struct xa_state *xas) >> @@ -505,7 +517,7 @@ static bool offset_dir_emit(struct dir_context *ctx, >> struct dentry *dentry) >> inode->i_ino, fs_umode_to_dtype(inode->i_mode)); >> } >> -static void *offset_iterate_dir(struct inode *inode, struct dir_context >> *ctx) >> +static void offset_iterate_dir(struct inode *inode, struct dir_context >> *ctx, long last_index) >> { >> struct offset_ctx *so_ctx = inode->i_op->get_offset_ctx(inode); >> XA_STATE(xas, &so_ctx->xa, ctx->pos); >> @@ -514,17 +526,21 @@ static void *offset_iterate_dir(struct inode *inode, >> struct dir_context *ctx) >> while (true) { >> dentry = offset_find_next(&xas); >> if (!de
Re: [RFC PATCH 6/6 6.6] libfs: fix infinite directory reads for offset dir
> On Nov 11, 2024, at 10:43 PM, yangerkun wrote: > > > > 在 2024/11/11 23:34, Chuck Lever III 写道: >>> On Nov 11, 2024, at 10:20 AM, yangerkun wrote: >>> >>> >>> >>> 在 2024/11/11 22:39, Chuck Lever III 写道: >>>>> On Nov 10, 2024, at 9:36 PM, Yu Kuai wrote: >>>>> >>>>> Hi, >>>>> >>>>> 在 2024/11/11 8:52, c...@kernel.org 写道: >>>>>> From: yangerkun >>>>>> [ Upstream commit 64a7ce76fb901bf9f9c36cf5d681328fc0fd4b5a ] >>>>>> After we switch tmpfs dir operations from simple_dir_operations to >>>>>> simple_offset_dir_operations, every rename happened will fill new dentry >>>>>> to dest dir's maple tree(&SHMEM_I(inode)->dir_offsets->mt) with a free >>>>>> key starting with octx->newx_offset, and then set newx_offset equals to >>>>>> free key + 1. This will lead to infinite readdir combine with rename >>>>>> happened at the same time, which fail generic/736 in xfstests(detail show >>>>>> as below). >>>>>> 1. create 5000 files(1 2 3...) under one dir >>>>>> 2. call readdir(man 3 readdir) once, and get one entry >>>>>> 3. rename(entry, "TEMPFILE"), then rename("TEMPFILE", entry) >>>>>> 4. loop 2~3, until readdir return nothing or we loop too many >>>>>>times(tmpfs break test with the second condition) >>>>>> We choose the same logic what commit 9b378f6ad48cf ("btrfs: fix infinite >>>>>> directory reads") to fix it, record the last_index when we open dir, and >>>>>> do not emit the entry which index >= last_index. The file->private_data >>>>> >>>>> Please notice this requires last_index should never overflow, otherwise >>>>> readdir will be messed up. >>>> It would help your cause if you could be more specific >>>> than "messed up". >>>>>> now used in offset dir can use directly to do this, and we also update >>>>>> the last_index when we llseek the dir file. >>>>>> Fixes: a2e459555c5f ("shmem: stable directory offsets") >>>>>> Signed-off-by: yangerkun >>>>>> Link: >>>>>> https://lore.kernel.org/r/20240731043835.1828697-1-yanger...@huawei.com >>>>>> Reviewed-by: Chuck Lever >>>>>> [brauner: only update last_index after seek when offset is zero like Jan >>>>>> suggested] >>>>>> Signed-off-by: Christian Brauner >>>>>> Link: https://nvd.nist.gov/vuln/detail/CVE-2024-46701 >>>>>> [ cel: adjusted to apply to origin/linux-6.6.y ] >>>>>> Signed-off-by: Chuck Lever >>>>>> --- >>>>>> fs/libfs.c | 37 + >>>>>> 1 file changed, 25 insertions(+), 12 deletions(-) >>>>>> diff --git a/fs/libfs.c b/fs/libfs.c >>>>>> index a87005c89534..b59ff0dfea1f 100644 >>>>>> --- a/fs/libfs.c >>>>>> +++ b/fs/libfs.c >>>>>> @@ -449,6 +449,14 @@ void simple_offset_destroy(struct offset_ctx *octx) >>>>>> xa_destroy(&octx->xa); >>>>>> } >>>>>> +static int offset_dir_open(struct inode *inode, struct file *file) >>>>>> +{ >>>>>> + struct offset_ctx *ctx = inode->i_op->get_offset_ctx(inode); >>>>>> + >>>>>> + file->private_data = (void *)ctx->next_offset; >>>>>> + return 0; >>>>>> +} >>>>> >>>>> Looks like xarray is still used. >>>> That's not going to change, as several folks have already >>>> explained. >>>>> I'm in the cc list ,so I assume you saw my set, then I don't know why >>>>> you're ignoring my concerns. >>>>> 1) next_offset is 32-bit and can overflow in a long-time running >>>>> machine. >>>>> 2) Once next_offset overflows, readdir will skip the files that offset >>>>> is bigger. >>> >>> I'm sorry, I'm a little busy these days, so I haven't responded to this >>> series of emails. >>> >>>> In that case, that entry won't be visible via getdents(3) >>>> until the directory is re-opened or the process does an >>>>
Re: [PATCH 6.6 00/28] fix CVE-2024-46701
> On Nov 8, 2024, at 8:30 PM, Yu Kuai wrote: > > Hi, > > 在 2024/11/08 21:23, Chuck Lever III 写道: >>> On Nov 7, 2024, at 8:19 PM, Yu Kuai wrote: >>> >>> Hi, >>> >>> 在 2024/11/07 22:41, Chuck Lever 写道: >>>> On Thu, Nov 07, 2024 at 08:57:23AM +0800, Yu Kuai wrote: >>>>> Hi, >>>>> >>>>> 在 2024/11/06 23:19, Chuck Lever III 写道: >>>>>> >>>>>> >>>>>>> On Nov 6, 2024, at 1:16 AM, Greg KH wrote: >>>>>>> >>>>>>> On Thu, Oct 24, 2024 at 09:19:41PM +0800, Yu Kuai wrote: >>>>>>>> From: Yu Kuai >>>>>>>> >>>>>>>> Fix patch is patch 27, relied patches are from: >>>>>> >>>>>> I assume patch 27 is: >>>>>> >>>>>> libfs: fix infinite directory reads for offset dir >>>>>> >>>>>> https://lore.kernel.org/stable/20241024132225.2271667-12-yuku...@huaweicloud.com/ >>>>>> >>>>>> I don't think the Maple tree patches are a hard >>>>>> requirement for this fix. And note that libfs did >>>>>> not use Maple tree originally because I was told >>>>>> at that time that Maple tree was not yet mature. >>>>>> >>>>>> So, a better approach might be to fit the fix >>>>>> onto linux-6.6.y while sticking with xarray. >>>>> >>>>> The painful part is that using xarray is not acceptable, the offet >>>>> is just 32 bit and if it overflows, readdir will read nothing. That's >>>>> why maple_tree has to be used. >>>> A 32-bit range should be entirely adequate for this usage. >>>> - The offset allocator wraps when it reaches the maximum, it >>>>doesn't overflow unless there are actually billions of extant >>>>entries in the directory, which IMO is not likely. >>> >>> Yes, it's not likely, but it's possible, and not hard to trigger for >>> test. >> I question whether such a test reflects any real-world >> workload. >> Besides, there are a number of other limits that will impact >> the ability to create that many entries in one directory. >> The number of inodes in one tmpfs instance is limited, for >> instance. >>> And please notice that the offset will increase for each new file, >>> and file can be removed, while offset stays the same. > > Did you see the above explanation? files can be removed, you don't have > to store that much files to trigger the offset to overflow. >>>> - The offset values are dense, so the directory can use all 2- or >>>>4- billion in the 32-bit integer range before wrapping. >>> >>> A simple math, if user create and remove 1 file in each seconds, it will >>> cost about 130 years to overflow. And if user create and remove 1000 >>> files in each second, it will cost about 1 month to overflow. > The problem is that if the next_offset overflows to 0, then after patch > 27, offset_dir_open() will record the 0, and later offset_readdir will > return directly, while there can be many files. Let me revisit this for a moment. The xa_alloc_cyclic() call in simple_offset_add() has a range limit argument of 2 - U32_MAX. So I'm not clear how an overflow (or, more precisely, the reuse of an offset value) would result in a "0" offset being recorded. The range limit prevents the use of 0 and 1. A "0" offset value would be a bug, I agree, but I don't see how that can happen. >> The question is what happens when there are no more offset >> values available. xa_alloc_cyclic should fail, and file >> creation is supposed to fail at that point. If it doesn't, >> that's a bug that is outside of the use of xarray or Maple. > > Can you show me the code that xa_alloc_cyclic should fail? At least > according to the commets, it will return 1 if the allocation succeeded > after wrapping. > > * Context: Any context. Takes and releases the xa_lock. May sleep if > * the @gfp flags permit. > * Return: 0 if the allocation succeeded without wrapping. 1 if the > * allocation succeeded after wrapping, -ENOMEM if memory could not be > * allocated or -EBUSY if there are no free entries in @limit. > */ > static inline int xa_alloc_cyclic(struct xarray *xa, u32 *id, void *entry, > struct xa_limit limit, u32 *next, gfp_t gfp) I recall (dimly) that directory entry offset value re-use is acceptable and preferred, so I think ignoring a "1" return value from xa_alloc_cyclic() is OK. If there are no unused offset values available, it will return -EBUSY, and file creation will fail. Perhaps Christian or Al can chime in here on whether directory entry offset value re-use is indeed expected to be acceptable. Further, my understanding is that: https://lore.kernel.org/stable/20241024132225.2271667-12-yuku...@huaweicloud.com/ fixes a rename issue that results in an infinite loop, and that's the (only) issue that underlies CVE-2024-46701. You are suggesting that there are other overflow problems with the xarray-based simple_offset implementation. If I can confirm them, then I can get these fixed in v6.6. But so far, I'm not sure I completely understand these other failure modes. Are you suggesting that the above fix /introduces/ the 0 offset problem? -- Chuck Lever
Re: [PATCH 6.6 00/28] fix CVE-2024-46701
> On Nov 6, 2024, at 1:16 AM, Greg KH wrote: > > On Thu, Oct 24, 2024 at 09:19:41PM +0800, Yu Kuai wrote: >> From: Yu Kuai >> >> Fix patch is patch 27, relied patches are from: I assume patch 27 is: libfs: fix infinite directory reads for offset dir https://lore.kernel.org/stable/20241024132225.2271667-12-yuku...@huaweicloud.com/ I don't think the Maple tree patches are a hard requirement for this fix. And note that libfs did not use Maple tree originally because I was told at that time that Maple tree was not yet mature. So, a better approach might be to fit the fix onto linux-6.6.y while sticking with xarray. This is the first I've heard of this CVE. It would help if the patch authors got some notification when these are filed. >> - patches from set [1] to add helpers to maple_tree, the last patch to >> improve fork() performance is not backported; > > So things slowed down? > >> - patches from set [2] to change maple_tree, and follow up fixes; >> - patches from set [3] to convert offset_ctx from xarray to maple_tree; >> >> Please notice that I'm not an expert in this area, and I'm afraid to >> make manual changes. That's why patch 16 revert the commit that is >> different from mainline and will cause conflict backporting new patches. >> patch 28 pick the original mainline patch again. >> >> (And this is what we did to fix the CVE in downstream kernels). >> >> [1] >> https://lore.kernel.org/all/20231027033845.90608-1-zhangpeng...@bytedance.com/ >> [2] >> https://lore.kernel.org/all/20231101171629.3612299-2-liam.howl...@oracle.com/T/ >> [3] >> https://lore.kernel.org/all/170820083431.6328.16233178852085891453.st...@91.116.238.104.host.secureserver.net/ > > This series looks rough. I want to have the maintainers of these > files/subsystems to ack this before being able to take them. > > thanks, > > greg k-h -- Chuck Lever
Re: [PATCH 6.6 00/28] fix CVE-2024-46701
> On Nov 7, 2024, at 8:19 PM, Yu Kuai wrote: > > Hi, > > 在 2024/11/07 22:41, Chuck Lever 写道: >> On Thu, Nov 07, 2024 at 08:57:23AM +0800, Yu Kuai wrote: >>> Hi, >>> >>> 在 2024/11/06 23:19, Chuck Lever III 写道: >>>> >>>> >>>>> On Nov 6, 2024, at 1:16 AM, Greg KH wrote: >>>>> >>>>> On Thu, Oct 24, 2024 at 09:19:41PM +0800, Yu Kuai wrote: >>>>>> From: Yu Kuai >>>>>> >>>>>> Fix patch is patch 27, relied patches are from: >>>> >>>> I assume patch 27 is: >>>> >>>> libfs: fix infinite directory reads for offset dir >>>> >>>> https://lore.kernel.org/stable/20241024132225.2271667-12-yuku...@huaweicloud.com/ >>>> >>>> I don't think the Maple tree patches are a hard >>>> requirement for this fix. And note that libfs did >>>> not use Maple tree originally because I was told >>>> at that time that Maple tree was not yet mature. >>>> >>>> So, a better approach might be to fit the fix >>>> onto linux-6.6.y while sticking with xarray. >>> >>> The painful part is that using xarray is not acceptable, the offet >>> is just 32 bit and if it overflows, readdir will read nothing. That's >>> why maple_tree has to be used. >> A 32-bit range should be entirely adequate for this usage. >> - The offset allocator wraps when it reaches the maximum, it >>doesn't overflow unless there are actually billions of extant >>entries in the directory, which IMO is not likely. > > Yes, it's not likely, but it's possible, and not hard to trigger for > test. I question whether such a test reflects any real-world workload. Besides, there are a number of other limits that will impact the ability to create that many entries in one directory. The number of inodes in one tmpfs instance is limited, for instance. > And please notice that the offset will increase for each new file, > and file can be removed, while offset stays the same. >> - The offset values are dense, so the directory can use all 2- or >>4- billion in the 32-bit integer range before wrapping. > > A simple math, if user create and remove 1 file in each seconds, it will > cost about 130 years to overflow. And if user create and remove 1000 > files in each second, it will cost about 1 month to overflow. The question is what happens when there are no more offset values available. xa_alloc_cyclic should fail, and file creation is supposed to fail at that point. If it doesn't, that's a bug that is outside of the use of xarray or Maple. > maple tree use 64 bit value for the offset, which is impossible to > overflow for the rest of our lifes. >> - No-one complained about this limitation when offset_readdir() was >>first merged. The xarray was replaced for performance reasons, >>not because of the 32-bit range limit. >> It is always possible that I have misunderstood your concern! > > The problem is that if the next_offset overflows to 0, then after patch > 27, offset_dir_open() will record the 0, and later offset_readdir will > return directly, while there can be many files. That's a separate bug that has nothing to do with the maximum number of entries one directory can have. Again, you don't need Maple tree to address that. My understanding from Liam is that backporting Maple into v6.6 is just not practical to do. We must explore alternate ways to address these concerns. -- Chuck Lever