from:"Damien Le Moal"

Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

2016-08-09 Thread Damien Le Moal

ode will not split and align calls on the zones. But upper 
layers
(an FS or a device mapper) can still do all this by themselves if they want/can
support non-constant zone sizes.

The only exception is drives like the Seagate one with only the last zone of a
different size. This case is handled exactly as if all zones are the same size
simply because any operation on the last smaller zone will naturally align as
the checks of operation size against the drive capacity will do the right 
things.

The ioctls work for all cases (drive with constant zone size or not). This is 
again
to allow supporting eventual weird drives at application level. I integrated all
these ioctl into libzbc block device backend driver and everything is fine. 
Can't
tell the difference with direct-to-drive SG_IO accesses. But unlike these, the 
zone
ioctls keep the zone information RB-tree cache up to date.

> 
> I will be updating my patchset accordingly.

I need to cleanup my code and rebase on top of 4.8-rc1. Let me do this and I 
will send
everything for review. If you have any comment on the above, please let me know 
and
I will be happy to incorporate changes.

Best regards.



Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
damien.lem...@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com 

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality 
Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or 
legally privileged information of WDC and/or its affiliates, and are intended 
solely for the use of the individual or entity to which they are addressed. If 
you are not the intended recipient, any disclosure, copying, distribution or 
any action taken or omitted to be taken in reliance on it, is prohibited. If 
you have received this e-mail in error, please notify the sender immediately 
and delete the e-mail in its entirety from your system.

Re: [PATCH v8 1/2 RESEND] Add bio/request flags to issue ZBC/ZAC commands

2016-08-25 Thread Damien Le Moal



Shaun,

On 8/25/16 05:24, Shaun Tancheff wrote:

(RESENDING to include f2fs, fs-devel and dm-devel)

Add op flags to access to zone information as well as open, close
and reset zones:
  - REQ_OP_ZONE_REPORT - Query zone information (Report zones)
  - REQ_OP_ZONE_OPEN - Explicitly open a zone for writing
  - REQ_OP_ZONE_CLOSE - Explicitly close a zone
  - REQ_OP_ZONE_FINISH - Explicitly finish a zone
  - REQ_OP_ZONE_RESET - Reset Write Pointer to start of zone

These op flags can be used to create bio's to control zoned devices
through the block layer.


I still have a hard time seeing the need for the REQ_OP_ZONE_REPORT 
operation assuming that the device queue will hold a zone information 
cache, Hannes RB-tree or your array type, whichever.


Let's try to think simply here: if the disk user (and FS, a device 
mapper or an application doing raw disk accesses) wants to access the 
disk zone information, why would it need to issue a BIO when calling 
blkdev_lookup_zone would exactly give that information straight out of 
memory (so much faster) ? I thought hard about this, but cannot think of 
any value for the BIO-to-disk option. It seems to me to be equivalent to 
systematically doing a page cache read even if the page cache tells us 
that the page is up-to-date...


Moreover, issuing a report zone to the disk may return information that 
is in fact incorrect, as that would not take into account the eventual 
set of write requests that was dispatched but not yet processed by the 
disk (some zone write pointer may be reported with a value lower than 
what the zone cache maintains).


Dealing (and fixing) these inconsistencies would force an update of the 
report zone result using the information of the zone cache, which in 
itself sounds like a good justification of not doing a report zones in 
the first place.


I am fine with the other operations, and in fact having a BIO interface 
for them to send down to the SCSI layer is better than any other method. 
It will causes them to be seen in sd_init_command, which is the path 
taken for read and write commands too. So all zone cache information 
checking and updating can be done in that single place and serialized 
with a spinlock. Maintenance of the zone cache information becomes very 
easy.


Any divergence of the zone cache information with the actual state of 
the disk will likely cause an error (unaligned write or other). Having a 
specific report zone BIO option will not help the disk user recover from 
those. Hannes current implementation make sure that the information of 
the zone for the failed request is automatically updated. That is enough 
to maintain the zone cache information uptodate, and a zone information 
can be marked as "in update" for the user to notice and wait for the 
refreshed information.


The ioctl code for reporting zones does not need the specific request op 
code either. Again, blkdev_lookup_zone can provide zone information, and 
an emulation of the reporting options filtering is also trivial to 
implement on top of that, if really required (I do not think that is 
strongly needed though).


Without the report zone operation, your patch set size would 
significantly shrink and merging with Hannes work becomes very easy too.


Please let me know what you think. If we drop this, we can get a clean 
and full ZBC support patch set ready in no time at all.


Best regards.

--
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
damien.lem...@hgst.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.hgst.com

Re: [PATCH] uapi: linux/blkzoned.h: fix BLKGETZONESZ and BLKGETNRZONES definitions

2018-12-16 Thread Damien Le Moal

On 2018/12/16 10:50, Dmitry V. Levin wrote:
> According to the documentation in include/uapi/asm-generic/ioctl.h,
> _IOW means userspace is writing and kernel is reading, and
> _IOR means userspace is reading and kernel is writing.
> 
> In case of these two ioctls, kernel is writing and userspace is reading,
> so they have to be _IOR instead of _IOW.
> 
> Fixes: 72cd87576d1d8 ("block: Introduce BLKGETZONESZ ioctl")
> Fixes: 65e4e3eee83d7 ("block: Introduce BLKGETNRZONES ioctl")
> Signed-off-by: Dmitry V. Levin 
> ---
> 
> Since both ioctls were introduced after 4.19,
> please make sure they are fixed in 4.20.
> Thanks.
> 
>  include/uapi/linux/blkzoned.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index 8f08ff9bdea0..6fa38d001d84 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -141,7 +141,7 @@ struct blk_zone_range {
>   */
>  #define BLKREPORTZONE_IOWR(0x12, 130, struct blk_zone_report)
>  #define BLKRESETZONE _IOW(0x12, 131, struct blk_zone_range)
> -#define BLKGETZONESZ _IOW(0x12, 132, __u32)
> -#define BLKGETNRZONES_IOW(0x12, 133, __u32)
> +#define BLKGETZONESZ _IOR(0x12, 132, __u32)
> +#define BLKGETNRZONES    _IOR(0x12, 133, __u32)
>  
>  #endif /* _UAPI_BLKZONED_H */
> 

Indeed, my bad.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 09/19] zonefs: remove duplicate cleanup in zonefs_fill_super

2023-09-13 Thread Damien Le Moal

On 9/13/23 20:10, Christoph Hellwig wrote:
> When ->fill_super fails, ->kill_sb is called which already cleans up
> the inodes and zgroups.
> 
> Drop the extra cleanup code in zonefs_fill_super.
> 
> Signed-off-by: Christoph Hellwig 

Looks good to me.

Acked-by: Damien Le Moal 

> ---
>  fs/zonefs/super.c | 21 +
>  1 file changed, 5 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 9d1a9808fbbba6..35b2554ce2ac2e 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -1309,13 +1309,12 @@ static int zonefs_fill_super(struct super_block *sb, 
> void *data, int silent)
>   /* Initialize the zone groups */
>   ret = zonefs_init_zgroups(sb);
>   if (ret)
> - goto cleanup;
> + return ret;
>  
>   /* Create the root directory inode */
> - ret = -ENOMEM;
>   inode = new_inode(sb);
>   if (!inode)
> - goto cleanup;
> + return -ENOMEM;
>  
>   inode->i_ino = bdev_nr_zones(sb->s_bdev);
>   inode->i_mode = S_IFDIR | 0555;
> @@ -1333,7 +1332,7 @@ static int zonefs_fill_super(struct super_block *sb, 
> void *data, int silent)
>  
>   sb->s_root = d_make_root(inode);
>   if (!sb->s_root)
> - goto cleanup;
> + return -ENOMEM;
>  
>   /*
>* Take a reference on the zone groups directory inodes
> @@ -1341,19 +1340,9 @@ static int zonefs_fill_super(struct super_block *sb, 
> void *data, int silent)
>*/
>   ret = zonefs_get_zgroup_inodes(sb);
>   if (ret)
> - goto cleanup;
> -
> - ret = zonefs_sysfs_register(sb);
> - if (ret)
> - goto cleanup;
> -
> - return 0;
> -
> -cleanup:
> - zonefs_release_zgroup_inodes(sb);
> - zonefs_free_zgroups(sb);
> + return ret;
>  
> - return ret;
> + return zonefs_sysfs_register(sb);
>  }
>  
>  static struct dentry *zonefs_mount(struct file_system_type *fs_type,

-- 
Damien Le Moal
Western Digital Research

[PATCH v2 2/2] riscv: Disable text-data gap in flat binaries

2021-04-07 Thread Damien Le Moal

uclibc/gcc combined with elf2flt riscv linker file fully resolve the
PC relative __global_pointer$ value at compile time and do not generate
a relocation entry to set a runtime gp value. As a result, if the
flatbin loader introduces a gap between the text and data sections, the
gp value becomes incorrect and prevent correct execution of a flatbin
executable.

Avoid this problem by enabling CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP
automatically when CONFIG_RISCV is enabled and CONFIG_MMU disabled.

Signed-off-by: Damien Le Moal 
---
 arch/riscv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 0d0cf67359cb..6a85fbbd056e 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -33,6 +33,7 @@ config RISCV
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
+   select BINFMT_FLAT_NO_TEXT_DATA_GAP if !MMU
select CLONE_BACKWARDS
select CLINT_TIMER if !MMU
select COMMON_CLK
-- 
2.30.2

[PATCH v2 1/2] binfmt_flat: allow not offsetting data start

2021-04-07 Thread Damien Le Moal

Commit 2217b9826246 ("binfmt_flat: revert "binfmt_flat: don't offset
the data start"") restored offsetting the start of the data section by
a number of words defined by MAX_SHARED_LIBS. As a result, since
MAX_SHARED_LIBS is never 0, a gap between the text and data sections
always exists. For architectures which cannot support a such gap
between the text and data sections (e.g. riscv nommu), flat binary
programs cannot be executed.

To allow an architecture to request contiguous text and data sections,
introduce the config option CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP.
Using this new option, the macro DATA_GAP_WORDS is conditionally
defined in binfmt_flat.c to MAX_SHARED_LIBS for architectures
tolerating the text-to-data gap (CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP
disabled case) and to 0 when CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP is
enabled. DATA_GAP_WORDS is used in load_flat_file() to calculate the
data section length and start position.

An architecture enabling CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP also
prevents the use of the separate text/data load case (when the flat file
header flags FLAT_FLAG_RAM and FLAT_FLAG_GZIP are not set with NOMMU
kernels) and forces the use of a single RAM region for loading
(equivalent to FLAT_FLAG_RAM being set).

Signed-off-by: Damien Le Moal 
---
 fs/Kconfig.binfmt |  3 +++
 fs/binfmt_flat.c  | 21 +++--
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index c6f1c8c1934e..c6df931d5d45 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -112,6 +112,9 @@ config BINFMT_FLAT_ARGVP_ENVP_ON_STACK
 config BINFMT_FLAT_OLD_ALWAYS_RAM
bool
 
+config BINFMT_FLAT_NO_TEXT_DATA_GAP
+   bool
+
 config BINFMT_FLAT_OLD
bool "Enable support for very old legacy flat binaries"
depends on BINFMT_FLAT
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index b9c658e0548e..2be29bb964b8 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -74,6 +74,12 @@
 #defineMAX_SHARED_LIBS (1)
 #endif
 
+#ifdef CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP
+#define DATA_GAP_WORDS (0)
+#else
+#define DATA_GAP_WORDS (MAX_SHARED_LIBS)
+#endif
+
 struct lib_info {
struct {
unsigned long start_code;   /* Start of text 
segment */
@@ -559,7 +565,10 @@ static int load_flat_file(struct linux_binprm *bprm,
 * case,  and then the fully copied to RAM case which lumps
 * it all together.
 */
-   if (!IS_ENABLED(CONFIG_MMU) && !(flags & 
(FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
+   if (!IS_ENABLED(CONFIG_MMU) &&
+   !IS_ENABLED(CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP) &&
+   !(flags & (FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
+
/*
 * this should give us a ROM ptr,  but if it doesn't we don't
 * really care
@@ -576,7 +585,7 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
 
-   len = data_len + extra + MAX_SHARED_LIBS * sizeof(unsigned 
long);
+   len = data_len + extra + DATA_GAP_WORDS * sizeof(unsigned long);
len = PAGE_ALIGN(len);
realdatastart = vm_mmap(NULL, 0, len,
PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 0);
@@ -591,7 +600,7 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(unsigned long),
+   DATA_GAP_WORDS * sizeof(unsigned long),
FLAT_DATA_ALIGN);
 
pr_debug("Allocated data+bss+stack (%u bytes): %lx\n",
@@ -622,7 +631,7 @@ static int load_flat_file(struct linux_binprm *bprm,
memp_size = len;
} else {
 
-   len = text_len + data_len + extra + MAX_SHARED_LIBS * 
sizeof(u32);
+   len = text_len + data_len + extra + DATA_GAP_WORDS * 
sizeof(u32);
len = PAGE_ALIGN(len);
textpos = vm_mmap(NULL, 0, len,
PROT_READ | PROT_EXEC | PROT_WRITE, MAP_PRIVATE, 0);
@@ -638,7 +647,7 @@ static int load_flat_file(struct linux_binprm *bprm,
 
realdatastart = textpos + ntohl(hdr->data_start);
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(u32),
+   DATA_GAP_WORDS * sizeof(u32),
FLAT_DATA_ALIGN);
 
reloc = (__be32 __user *)
@@ -714,7 +723,7 @@ static int load_flat_file(struct linux_binprm *bprm,
ret = result;
pr_err("Unable to read code+data+bss, errno %d\n", ret);
vm_munmap

[PATCH v2 0/2] Fix binfmt_flat loader for RISC-V

2021-04-07 Thread Damien Le Moal

RISC-V NOMMU flat binaries cannot tolerate a gap between the text and
data section as the toolchain fully resolves at compile time the PC
relative global pointer (__global_pointer$ value loaded in gp register).
Without a relocation entry provided, the flat bin loader cannot fix the
value if a gap is introduced and executables fail to run.

This series fixes this problem by allowing an architecture to request
the flat loader to suppress the gap between the text and data sections.
The first patch fixes binfmt_flat flat_load_file() using the new
configuration option CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP. The second
patch enables this option for RISCV NOMMU builds.

These patches do not change the binfmt_flat loader behavior for other
architectures.

Changes from v1:
* Replace FLAT_TEXT_DATA_NO_GAP macro with
  CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP config option (patch 1).
* Remove the addition of riscv/include/asm/flat.h and set
  CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP for RISCV and !MMU

Damien Le Moal (2):
  binfmt_flat: allow not offsetting data start
  riscv: Disable text-data gap in flat binaries

 arch/riscv/Kconfig |  1 +
 fs/Kconfig.binfmt  |  3 +++
 fs/binfmt_flat.c   | 21 +++--
 3 files changed, 19 insertions(+), 6 deletions(-)

-- 
2.30.2

Re: [PATCH] soc: canaan: Sort the Makefile alphabetically

2021-02-22 Thread Damien Le Moal

On 2021/02/23 11:19, Palmer Dabbelt wrote:
> From: Palmer Dabbelt 
> 
> The rest of these are alphabetically sorted, and leaving it this way
> causes a merge conflict.
> 
> Signed-off-by: Palmer Dabbelt 
> 
> ---
> 
> I missed this when reviewing these patches, but happened across it when
> test merging from Linus' tree.  It goes back a way so I'm hesitant to
> rebase this one out just for cleanliness, but if I have to go back that
> far before sending the merge window PR I'll squash it in.
> ---
>  drivers/soc/Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/soc/Makefile b/drivers/soc/Makefile
> index fa7071246546..34b23645be14 100644
> --- a/drivers/soc/Makefile
> +++ b/drivers/soc/Makefile
> @@ -7,6 +7,7 @@ obj-$(CONFIG_ARCH_ACTIONS)+= actions/
>  obj-y+= aspeed/
>  obj-$(CONFIG_ARCH_AT91)  += atmel/
>  obj-y+= bcm/
> +obj-$(CONFIG_SOC_CANAAN) += canaan/
>  obj-$(CONFIG_ARCH_DOVE)  += dove/
>  obj-$(CONFIG_MACH_DOVE)  += dove/
>  obj-y+= fsl/
> @@ -29,4 +30,3 @@ obj-$(CONFIG_ARCH_U8500)+= ux500/
>  obj-$(CONFIG_PLAT_VERSATILE) += versatile/
>  obj-y+= xilinx/
>  obj-$(CONFIG_ARCH_ZX)    += zte/
> -obj-$(CONFIG_SOC_CANAAN) += canaan/
> 

Yes. Should have sent that... Thanks.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-07-30 Thread Damien Le Moal

t;>>>>>>>>>> + __u32   flags;
>>>>>>>>>>>> + };
>>>>>>>>>>>> + __s64   res64;  /* appending offset for zone append 
>>>>>>>>>>>> */
>>>>>>>>>>>> + };
>>>>>>>>>>>>  };
>>>>>>>>>>>
>>>>>>>>>>> Is this a compatible change, both for now but also going forward? 
>>>>>>>>>>> You
>>>>>>>>>>> could randomly have IORING_CQE_F_BUFFER set, or any other future 
>>>>>>>>>>> flags.
>>>>>>>>>>
>>>>>>>>>> Sorry, I didn't quite understand the concern. CQE_F_BUFFER is not
>>>>>>>>>> used/set for write currently, so it looked compatible at this point.
>>>>>>>>>
>>>>>>>>> Not worried about that, since we won't ever use that for writes. But 
>>>>>>>>> it
>>>>>>>>> is a potential headache down the line for other flags, if they apply 
>>>>>>>>> to
>>>>>>>>> normal writes.
>>>>>>>>>
>>>>>>>>>> Yes, no room for future flags for this operation.
>>>>>>>>>> Do you see any other way to enable this support in io-uring?
>>>>>>>>>
>>>>>>>>> Honestly I think the only viable option is as we discussed previously,
>>>>>>>>> pass in a pointer to a 64-bit type where we can copy the additional
>>>>>>>>> completion information to.
>>>>>>>>
>>>>>>>> TBH, I hate the idea of such overhead/latency at times when SSDs can
>>>>>>>> serve writes in less than 10ms. Any chance you measured how long does 
>>>>>>>> it
>>>>>>>
>>>>>>> 10us? :-)
>>>>>>
>>>>>> Hah, 10us indeed :)
>>>>>>
>>>>>>>
>>>>>>>> take to drag through task_work?
>>>>>>>
>>>>>>> A 64-bit value copy is really not a lot of overhead... But yes, we'd
>>>>>>> need to push the completion through task_work at that point, as we can't
>>>>>>> do it from the completion side. That's not a lot of overhead, and most
>>>>>>> notably, it's overhead that only affects this particular type.
>>>>>>>
>>>>>>> That's not a bad starting point, and something that can always be
>>>>>>> optimized later if need be. But I seriously doubt it'd be anything to
>>>>>>> worry about.
>>>>>>
>>>>>> I probably need to look myself how it's really scheduled, but if you 
>>>>>> don't
>>>>>> mind, here is a quick question: if we do work_add(task) when the task is
>>>>>> running in the userspace, wouldn't the work execution wait until the next
>>>>>> syscall/allotted time ends up?
>>>>>
>>>>> It'll get the task to enter the kernel, just like signal delivery. The 
>>>>> only
>>>>> tricky part is really if we have a dependency waiting in the kernel, like
>>>>> the recent eventfd fix.
>>>>
>>>> I see, thanks for sorting this out!
>>>
>>> Few more doubts about this (please mark me wrong if that is the case):
>>>
>>> - Task-work makes me feel like N completions waiting to be served by
>>> single task.
>>> Currently completions keep arriving and CQEs would be updated with
>>> result, but the user-space (submitter task) would not be poked.
>>>
>>> - Completion-code will set the task-work. But post that it cannot go
>>> immediately to its regular business of picking cqe and updating
>>> res/flags, as we cannot afford user-space to see the cqe before the
>>> pointer update. So it seems completion-code needs to spawn another
>>> work which will allocate/update cqe after waiting for pointer-update
>>> from task-work?
>>
>> The task work would post the completion CQE for the request after
>> writing the offset.
> 
> Got it, thank you for making it simple.
> Overall if I try to put the tradeoffs of moving to indirect-offset
> (compared to current scheme)–
> 
> Upside:
> - cqe res/flags would be intact, avoids future-headaches as you mentioned
> - short-write cases do not have to be failed in lower-layers (as
> cqe->res is there to report bytes-copied)

I personally think it is a super bad idea to allow short asynchronous append
writes. The interface should allow the async zone append write to proceed only
and only if it can be stuffed entirely into a single BIO which necessarilly will
be a single request on the device side. Otherwise, the application would have no
guarantees as to where a split may happen, and since this is zone append, the
next async append will not leave any hole to complete a previous short write.
This will wreak the structure of the application data.

For the sync case, this is fine. The application can just issue a new append
write with the remaining unwritten data from the previous append write. But in
the async case, if one write == one data record (e.g. a key-value tuple for an
SSTable in an LSM tree), then allowing a short write will destroy the record:
the partial write will be garbage data that will need garbage collection...

> 
> Downside:
> - We may not be able to use RWF_APPEND, and need exposing a new
> type/flag (RWF_INDIRECT_OFFSET etc.) user-space. Not sure if this
> sounds outrageous, but is it OK to have uring-only flag which can be
> combined with RWF_APPEND?

Why ? Where is the problem ? O_APPEND/RWF_APPEND is currently meaningless for
raw block device accesses. We could certainly define a meaning for these in the
context of zoned block devices.

I already commented on the need for first defining an interface (flags etc) and
its semantic (e.g. do we allow short zone append or not ? What happens for
regular files ? etc). Did you read my comment ? We really need to first agree on
something to clarify what needs to be done.


> -  Expensive compared to sending results in cqe itself. But I agree
> that this may not be major, and only for one type of write.
> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-07-31 Thread Damien Le Moal

On 2020/07/31 15:45, h...@infradead.org wrote:
> On Fri, Jul 31, 2020 at 06:42:10AM +0000, Damien Le Moal wrote:
>>> - We may not be able to use RWF_APPEND, and need exposing a new
>>> type/flag (RWF_INDIRECT_OFFSET etc.) user-space. Not sure if this
>>> sounds outrageous, but is it OK to have uring-only flag which can be
>>> combined with RWF_APPEND?
>>
>> Why ? Where is the problem ? O_APPEND/RWF_APPEND is currently meaningless for
>> raw block device accesses. We could certainly define a meaning for these in 
>> the
>> context of zoned block devices.
> 
> We can't just add a meaning for O_APPEND on block devices now,
> as it was previously silently ignored.  I also really don't think any
> of these semantics even fit the block device to start with.  If you
> want to work on raw zones use zonefs, that's what is exists for.

Which is fine with me. Just trying to say that I think this is exactly the
discussion we need to start with. What interface do we implement...

Allowing zone append only through zonefs as the raw block device equivalent, all
the O_APPEND/RWF_APPEND semantic is defined and the "return written offset"
implementation in VFS would be common for all file systems, including regular
ones. Beside that, there is I think the question of short writes... Not sure if
short writes can currently happen with async RWF_APPEND writes to regular files.
I think not but that may depend on the FS.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-07-31 Thread Damien Le Moal

On 2020/07/31 16:59, Kanchan Joshi wrote:
> On Fri, Jul 31, 2020 at 12:29 PM Damien Le Moal  wrote:
>>
>> On 2020/07/31 15:45, h...@infradead.org wrote:
>>> On Fri, Jul 31, 2020 at 06:42:10AM +, Damien Le Moal wrote:
>>>>> - We may not be able to use RWF_APPEND, and need exposing a new
>>>>> type/flag (RWF_INDIRECT_OFFSET etc.) user-space. Not sure if this
>>>>> sounds outrageous, but is it OK to have uring-only flag which can be
>>>>> combined with RWF_APPEND?
>>>>
>>>> Why ? Where is the problem ? O_APPEND/RWF_APPEND is currently meaningless 
>>>> for
>>>> raw block device accesses. We could certainly define a meaning for these 
>>>> in the
>>>> context of zoned block devices.
>>>
>>> We can't just add a meaning for O_APPEND on block devices now,
>>> as it was previously silently ignored.  I also really don't think any
>>> of these semantics even fit the block device to start with.  If you
>>> want to work on raw zones use zonefs, that's what is exists for.
>>
>> Which is fine with me. Just trying to say that I think this is exactly the
>> discussion we need to start with. What interface do we implement...
>>
>> Allowing zone append only through zonefs as the raw block device equivalent, 
>> all
>> the O_APPEND/RWF_APPEND semantic is defined and the "return written offset"
>> implementation in VFS would be common for all file systems, including regular
>> ones. Beside that, there is I think the question of short writes... Not sure 
>> if
>> short writes can currently happen with async RWF_APPEND writes to regular 
>> files.
>> I think not but that may depend on the FS.
> 
> generic_write_check_limits (called by generic_write_checks, used by
> most FS) may make it short, and AFAIK it does not depend on
> async/sync.

Johannes has a patch (not posted yet) fixing all this for zonefs,
differentiating sync and async cases, allow short writes or not, etc. This was
done by not using generic_write_check_limits() and instead writing a
zonefs_check_write() function that is zone append friendly.

We can post that as a base for the discussion on semantic if you want...

> This was one of the reason why we chose to isolate the operation by a
> different IOCB flag and not by IOCB_APPEND alone.

For zonefs, the plan is:
* For the sync write case, zone append is always used.
* For the async write case, if we see IOCB_APPEND, then zone append BIOs are
used. If not, regular write BIOs are used.

Simple enough I think. No need for a new flag.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-07-31 Thread Damien Le Moal

On 2020/07/31 18:14, h...@infradead.org wrote:
> On Fri, Jul 31, 2020 at 08:14:22AM +0000, Damien Le Moal wrote:
>>
>>> This was one of the reason why we chose to isolate the operation by a
>>> different IOCB flag and not by IOCB_APPEND alone.
>>
>> For zonefs, the plan is:
>> * For the sync write case, zone append is always used.
>> * For the async write case, if we see IOCB_APPEND, then zone append BIOs are
>> used. If not, regular write BIOs are used.
>>
>> Simple enough I think. No need for a new flag.
> 
> Simple, but wrong.  Sync vs async really doesn't matter, even sync
> writes will have problems if there are other writers.  We need a flag
> for "report the actual offset for appending writes", and based on that
> flag we need to not allow short writes (or split extents for real
> file systems).  We also need a fcntl or ioctl to report this max atomic
> write size so that applications can rely on it.
> 

Sync writes are done under the inode lock, so there cannot be other writers at
the same time. And for the sync case, since the actual written offset is
necessarily equal to the file size before the write, there is no need to report
it (there is no system call that can report that anyway). For this sync case,
the only change that the use of zone append introduces compared to regular
writes is the potential for more short writes.

Adding a flag for "report the actual offset for appending writes" is fine with
me, but do you also mean to use this flag for driving zone append write vs
regular writes in zonefs ?

The fcntl or ioctl for getting the max atomic write size would be fine too.
Given that zonefs is very close to the underlying zoned drive, I was assuming
that the application can simply consult the device sysfs zone_append_max_bytes
queue attribute. For regular file systems, this value would be used internally
only. I do not really see how it can be useful to applications. Furthermore, the
file system may have a hard time giving that information to the application
depending on its underlying storage configuration (e.g. erasure
coding/declustered RAID).

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v4 6/6] io_uring: add support for zone-append

2020-07-31 Thread Damien Le Moal

On 2020/07/31 18:41, h...@infradead.org wrote:
> On Fri, Jul 31, 2020 at 09:34:50AM +0000, Damien Le Moal wrote:
>> Sync writes are done under the inode lock, so there cannot be other writers 
>> at
>> the same time. And for the sync case, since the actual written offset is
>> necessarily equal to the file size before the write, there is no need to 
>> report
>> it (there is no system call that can report that anyway). For this sync case,
>> the only change that the use of zone append introduces compared to regular
>> writes is the potential for more short writes.
>>
>> Adding a flag for "report the actual offset for appending writes" is fine 
>> with
>> me, but do you also mean to use this flag for driving zone append write vs
>> regular writes in zonefs ?
> 
> Let's keep semantics and implementation separate.  For the case
> where we report the actual offset we need a size imitation and no
> short writes.

OK. So the name of the flag confused me. The flag name should reflect "Do zone
append and report written offset", right ?

Just to clarify, here was my thinking for zonefs:
1) file open with O_APPEND/aio has RWF_APPEND: then it is OK to assume that the
application did not set the aio offset since APPEND means offset==file size. In
that case, do zone append and report back the written offset.
2) file open without O_APPEND/aio does not have RWF_APPEND: the application
specified an aio offset and we must respect it, write it that exact same order,
so use regular writes.

For regular file systems, with case (1) condition, the FS use whatever it wants
for the implementation, and report back the written offset, which  will always
be the file size at the time the aio was issued.

Your method with a new flag to switch between (1) and (2) is OK with me, but the
"no short writes" may be difficult to achieve in a regular FS, no ? I do not
think current FSes have such guarantees... Especially in the case of buffered
async writes I think.

> Anything with those semantics can be implemented using Zone Append
> trivially in zonefs, and we don't even need the exclusive lock in that
> case.  But even without that flag anything that has an exclusive lock can
> at least in theory be implemented using Zone Append, it just need
> support for submitting another request from the I/O completion handler
> of the first.  I just don't think it is worth it - with the exclusive
> lock we do have access to the zone serialied so a normal write works
> just fine.  Both for the sync and async case.

We did switch to have zonefs do append writes in the sync case, always. Hmmm...
Not sure anymore it was such a good idea.

> 
>> The fcntl or ioctl for getting the max atomic write size would be fine too.
>> Given that zonefs is very close to the underlying zoned drive, I was assuming
>> that the application can simply consult the device sysfs 
>> zone_append_max_bytes
>> queue attribute.
> 
> For zonefs we can, yes.  But in many ways that is a lot more cumbersome
> that having an API that works on the fd you want to write on.

Got it. Makes sense.

>> For regular file systems, this value would be used internally
>> only. I do not really see how it can be useful to applications. Furthermore, 
>> the
>> file system may have a hard time giving that information to the application
>> depending on its underlying storage configuration (e.g. erasure
>> coding/declustered RAID).
> 
> File systems might have all kinds of limits of their own (e.g. extent
> sizes).  And a good API that just works everywhere and is properly
> documented is much better than heaps of cargo culted crap all over
> applications.

OK. Makes sense. That said, taking Naohiro's work on btrfs as an example, zone
append is used for every data write, no matter if it is O_APPEND/RWF_APPEND or
not. The size limitation for zone append writes is not needed at all by
applications. Maximum extent size is aligned to the max append write size
internally, and if the application issued a larger write, it loops over multiple
extents, similarly to any regular write may do (if there is overwrite etc...).

For the regular FS case, my thinking on the semantic really was: if asked to do
so, return the written offset for a RWF_APPEND aios. And I think that
implementing that does not depend in any way on what the FS does internally.

But I think I am starting to see the picture you are drawing here:
1) Introduce a fcntl() to get "maximum size for atomic append writes"
2) Introduce an aio flag specifying "Do atomic append write and report written
offset"
3) For an aio specifying "Do atomic append write and report written offset", if
the aio is larger than "maximum size for atomic append writes", fai

Re: [PATCH] riscv: Setup exception vector for K210 properly

2020-08-10 Thread Damien Le Moal

On 2020/08/11 15:38, Qiu Wenbo wrote:
> Exception vector is missing on nommu platform and it is a big issue.
> This patch is tested in Sipeed MAIX Bit Dev Board.
> 
> Fixes: 79b1feba5455 ("RISC-V: Setup exception vector early")
> Signed-off-by: Qiu Wenbo 
> ---
>  arch/riscv/kernel/smpboot.c |  1 +
>  arch/riscv/kernel/traps.c   | 11 ++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> index 356825a57551..23cde0ceb39d 100644
> --- a/arch/riscv/kernel/smpboot.c
> +++ b/arch/riscv/kernel/smpboot.c
> @@ -154,6 +154,7 @@ asmlinkage __visible void smp_callin(void)
>   mmgrab(mm);
>   current->active_mm = mm;
>  
> + trap_init();
>   notify_cpu_starting(curr_cpuid);
>   update_siblings_masks(curr_cpuid);
>   set_cpu_online(curr_cpuid, 1);
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index ad14f4466d92..a390239818ae 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -174,7 +174,16 @@ int is_valid_bugaddr(unsigned long pc)
>  }
>  #endif /* CONFIG_GENERIC_BUG */
>  
> -/* stvec & scratch is already set from head.S */
> +/* stvec & scratch is already set from head.S when mmu is enabled */
>  void trap_init(void)
>  {
> +#ifndef CONFIG_MMU
> + /*
> +  * Set sup0 scratch register to 0, indicating to exception vector
> +  * that we are presently executing in the kernel
> +  */
> + csr_write(CSR_SCRATCH, 0);
> + /* Set the exception vector address */
> + csr_write(CSR_TVEC, &handle_exception);
> +#endif
>  }
> 

Looks OK to me. But out of curiosity, how did you trigger a problem ? I never
got any weird exceptions with my busybox userspace.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] riscv: Setup exception vector for K210 properly

2020-08-10 Thread Damien Le Moal

On 2020/08/11 15:38, Qiu Wenbo wrote:
> Exception vector is missing on nommu platform and it is a big issue.
> This patch is tested in Sipeed MAIX Bit Dev Board.
> 
> Fixes: 79b1feba5455 ("RISC-V: Setup exception vector early")

I think this needs a "Cc: sta...@vger.kernel.org #5.8" too.

> Signed-off-by: Qiu Wenbo 
> ---
>  arch/riscv/kernel/smpboot.c |  1 +
>  arch/riscv/kernel/traps.c   | 11 ++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> index 356825a57551..23cde0ceb39d 100644
> --- a/arch/riscv/kernel/smpboot.c
> +++ b/arch/riscv/kernel/smpboot.c
> @@ -154,6 +154,7 @@ asmlinkage __visible void smp_callin(void)
>   mmgrab(mm);
>   current->active_mm = mm;
>  
> + trap_init();
>   notify_cpu_starting(curr_cpuid);
>   update_siblings_masks(curr_cpuid);
>   set_cpu_online(curr_cpuid, 1);
> diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> index ad14f4466d92..a390239818ae 100644
> --- a/arch/riscv/kernel/traps.c
> +++ b/arch/riscv/kernel/traps.c
> @@ -174,7 +174,16 @@ int is_valid_bugaddr(unsigned long pc)
>  }
>  #endif /* CONFIG_GENERIC_BUG */
>  
> -/* stvec & scratch is already set from head.S */
> +/* stvec & scratch is already set from head.S when mmu is enabled */
>  void trap_init(void)
>  {
> +#ifndef CONFIG_MMU
> + /*
> +  * Set sup0 scratch register to 0, indicating to exception vector
> +  * that we are presently executing in the kernel
> +  */
> + csr_write(CSR_SCRATCH, 0);
> + /* Set the exception vector address */
> + csr_write(CSR_TVEC, &handle_exception);
> +#endif
>  }
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] riscv: Setup exception vector for K210 properly

2020-08-11 Thread Damien Le Moal

On 2020/08/11 16:06, 邱文博 wrote:
> The serial port did not print anything after early console. 
> 
> [0.00] Sorting __ex_table...
> [0.00] mem auto-init: stack:off, heap alloc:off, heap free:off
> [0.00] Memory: 6480K/8192K available (1024K kernel code, 111K rwdata, 
> 170K rodata, 101K init, 97K bss, 1712K reserved, 0K cma-reserved)
> [0.00] rcu: Hierarchical RCU implementation.
> [0.00] rcu: RCU calculated value of scheduler-enlistment delay is 25 
> jiffies.
> [0.00] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
> [0.00] riscv-intc: 64 local interrupts mapped
> [0.00] plic: interrupt-controller@c00: mapped 65 interrupts with 
> 2 handlers for 4 contexts.
> [0.00] random: get_random_bytes called from 0x800019a4 with 
> crng_init=0
> [0.00] riscv_timer_init_dt: Registering clocksource cpuid [0] hartid 
> [0]
> [0.00] clocksource: riscv_clocksource: mask: 0x 
> max_cycles: 0x3990be68b, max_idle_ns: 881590404272 ns
> [0.15] sched_clock: 64 bits at 7MHz, resolution 128ns, wraps every 
> 4398046511054ns
> [0.008254] Console: colour dummy device 80x25

Interesting. Never saw that happening... Thanks !

> 
> 
> 
> > -原始邮件-
> > 发件人: "Damien Le Moal" 
> > 发送时间: 2020-08-11 14:42:15 (星期二)
> > 收件人: "Qiu Wenbo" , "Palmer Dabbelt" 
> , "Paul Walmsley" , 
> "linux-ri...@lists.infradead.org" 
> > 抄送: "Albert Ou" , "Atish Patra" 
> , "Anup
> >  Patel" , "Guo Ren" , 
> "Zong Li" , "Greentime Hu" , 
> "Vincent Chen" , "linux-kernel@vger.kernel.org" 
> 
> > 主题: Re: [PATCH] riscv: Setup exception vector for K210 properly
> > 
> > On 2020/08/11 15:38, Qiu Wenbo wrote:
> > > Exception vector is missing on nommu platform and it is a big issue.
> > > This patch is tested in Sipeed MAIX Bit Dev Board.
> > > 
> > > Fixes: 79b1feba5455 ("RISC-V: Setup exception vector early")
> > > Signed-off-by: Qiu Wenbo 
> > > ---
> > >  arch/riscv/kernel/smpboot.c |  1 +
> > >  arch/riscv/kernel/traps.c   | 11 ++-
> > >  2 files changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/riscv/kernel/smpboot.c 
> b/arch/riscv/kernel/smpboot.c
> > > index 356825a57551..23cde0ceb39d 100644
> > > --- a/arch/riscv/kernel/smpboot.c
> > > +++ b/arch/riscv/kernel/smpboot.c
> > > @@ -154,6 +154,7 @@ asmlinkage __visible void smp_callin(void)
> > > mmgrab(mm);
> > > current->active_mm = mm;
> > >  
> > > +   trap_init();
> > > notify_cpu_starting(curr_cpuid);
> > > update_siblings_masks(curr_cpuid);
> > > set_cpu_online(curr_cpuid, 1);
> > > diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
> > > index ad14f4466d92..a390239818ae 100644
> > > --- a/arch/riscv/kernel/traps.c
> > > +++ b/arch/riscv/kernel/traps.c
> > > @@ -174,7 +174,16 @@ int is_valid_bugaddr(unsigned long pc)
> > >  }
> > >  #endif /* CONFIG_GENERIC_BUG */
> > >  
> > > -/* stvec & scratch is already set from head.S */
> > > +/* stvec & scratch is already set from head.S when mmu is 
> enabled */
> > >  void trap_init(void)
> > >  {
> > > +#ifndef CONFIG_MMU
> > > +   /*
> > > +* Set sup0 scratch register to 0, indicating to exception 
> vector
> > > +* that we are presently executing in the kernel
> > > +*/
> > > +   csr_write(CSR_SCRATCH, 0);
> > > +   /* Set the exception vector address */
> > > +   csr_write(CSR_TVEC, &handle_exception);
> > > +#endif
> > >  }
> > > 
> > 
> > Looks OK to me. But out of curiosity, how did you trigger a problem ? I 
> never
> > got any weird exceptions with my busybox userspace.
> > 
> > -- 
> > Damien Le Moal
> > Western Digital Research
> 
> 
> 
> 
> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2] riscv: Setup exception vector for nommu platform

2020-08-12 Thread Damien Le Moal

On 2020/08/13 12:40, Qiu Wenbo wrote:
> Exception vector is missing on nommu platform and that is an issue.
> This patch is tested in Sipeed Maix Bit Dev Board.
> 
> Fixes: 79b1feba5455 ("RISC-V: Setup exception vector early")
> Suggested-by: Anup Patel 
> Suggested-by: Atish Patra 
> Signed-off-by: Qiu Wenbo 

Please add a cc stable #5.8 tag. Kendryte support is in 5.8 stable.

> ---
>  arch/riscv/kernel/head.S | 25 +
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index d0c5c316e9bb..0a4e81b8dc79 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -77,16 +77,10 @@ relocate:
>   csrw CSR_SATP, a0
>  .align 2
>  1:
> - /* Set trap vector to exception handler */
> - la a0, handle_exception
> + /* Set trap vector to spin forever to help debug */
> + la a0, .Lsecondary_park
>   csrw CSR_TVEC, a0
>  
> - /*
> -  * Set sup0 scratch register to 0, indicating to exception vector that
> -  * we are presently executing in kernel.
> -  */
> - csrw CSR_SCRATCH, zero
> -
>   /* Reload the global pointer */
>  .option push
>  .option norelax
> @@ -144,9 +138,23 @@ secondary_start_common:
>   la a0, swapper_pg_dir
>   call relocate
>  #endif
> + call setup_trap_vector
>   tail smp_callin
>  #endif /* CONFIG_SMP */
>  
> +.align 2
> +setup_trap_vector:
> + /* Set trap vector to exception handler */
> + la a0, handle_exception
> + csrw CSR_TVEC, a0
> +
> + /*
> +  * Set sup0 scratch register to 0, indicating to exception vector that
> +  * we are presently executing in kernel.
> +  */
> + csrw CSR_SCRATCH, zero
> + ret
> +
>  .Lsecondary_park:
>   /* We lack SMP support or have too many harts, so park this hart */
>   wfi
> @@ -240,6 +248,7 @@ clear_bss_done:
>       call relocate
>  #endif /* CONFIG_MMU */
>  
> + call setup_trap_vector
>   /* Restore C environment */
>   la tp, init_task
>   sw zero, TASK_TI_CPU(tp)
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2] riscv: Setup exception vector for nommu platform

2020-08-13 Thread Damien Le Moal

On 2020/08/13 15:45, Atish Patra wrote:
> On Wed, Aug 12, 2020 at 10:44 PM Damien Le Moal  wrote:
>>
>> On 2020/08/13 12:40, Qiu Wenbo wrote:
>>> Exception vector is missing on nommu platform and that is an issue.
>>> This patch is tested in Sipeed Maix Bit Dev Board.
>>>
>>> Fixes: 79b1feba5455 ("RISC-V: Setup exception vector early")
>>> Suggested-by: Anup Patel 
>>> Suggested-by: Atish Patra 
>>> Signed-off-by: Qiu Wenbo 
>>
>> Please add a cc stable #5.8 tag. Kendryte support is in 5.8 stable.
>>
> 
> That won't be necessary as the patch that broke nommu (79b1feba5455) was
> part of the 1st PR sent towards 5.9-rc1.

Oops. Yes indeed. Thanks !



-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Damien Le Moal

On 2020/08/19 17:34, Javier Gonzalez wrote:
> On 19.08.2020 09:40, Christoph Hellwig wrote:
>> On Tue, Aug 18, 2020 at 08:04:28PM +0200, Javier Gonzalez wrote:
>>> I understand that you want vendor alignment in the NVMe driver and I
>>> agree. We are not pushing for a non-append model - you can see that we
>>> are investing effort in implementing the append path in thee block layer
>>> and io_uring and we will continue doing so as patches get merged.
>>>
>>> This said, we do have some OEM models that do not implement append and I
>>> would like them to be supported in Linux. As you know, new TPs are being
>>> standardized now and the append emulation is the based for adding
>>> support for this. I do not believe it is unreasonable to find a way to
>>> add support for this SSDs.
>>
>> I do not think we should support anything but Zone Append, especially not
>> the new TP, which is going to add even more horrible code for absolutely
>> no good reason.
> 
> I must admit that this is a bit frustrating. The new TP adds
> functionality beyond operating as an Append alternative that I would
> very much like to see upstream (do want to discuss details here).
> 
> I understand the concerns about deviating from the Append model, but I
> believe we should find a way to add these new features. We are hiding
> all the logic in the NVMe driver and not touching the interface with the
> block layer, so the overall model is really not changed.
> 
>>
>>> If you completely close the door this approach, the alternative is
>>> carrying off-tree patches to the several OEMs that use these devices.
>>> This is not good for the zoned ecosystem nor for the future of Zone
>>> Append.
>>
>> I really don't have a problem with that.  If these OEMs want to use
>> an inferior access model only, they have to pay the price for it.
>> I also don't think that proxy arguments are very useful.  If you OEMs
>> are troubled by carrying patches becomes they decided to buy inferior
>> drivers they are perfectly happy to argue their cause here on the list.
> 
> I am not arguing as a proxy, I am stating the trouble we see from our
> perspective in having to diverge from mainline when our approach is
> being upstream first.
> 
> Whether the I/O mode is inferior or superior, they can answer that
> themselves if they read this list.
>>
>>> Are you open to us doing some characterization and if the impact
>>> to the fast path is not significant, moving ahead to a Zone Append
>>> emulation like in SCSI? I will promise that we will remove this path if
>>> requests for these devices terminate.
>>
>> As said I do not think implementing zone append emulation or the TP that
>> shall not be named are a good idea for Linux.
> 
> I would ask you to reconsider this position. I have a hard time
> understanding how zone append emulation is a good idea in SCSI and not
> in NVMe, when there is no performance penalty.

While defining a zone append command for SCSI/ZBC is possible (using sense data
for returning the written offset), there is no way to define zone append for
SATA/ZAC without entirely breaking the ATA command model. This is why we went
after an emulation implementation instead of trying to standardized native
commands. That implementation does not have any performance impact over regular
writes *and* zone write locking does not in general degrade HDD write
performance (only a few corner cases suffer from it). Comparing things equally,
the same could be said of NVMe drives that do not have zone append native
support: performance will be essentially the same using regular writes and
emulated zone append. But mq-deadline and zone write locking will significantly
lower performance for emulated zone append compared to a native zone append
support by the drive.


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

2020-08-19 Thread Damien Le Moal

On 2020/08/19 18:27, Kanchan Joshi wrote:
> On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig  wrote:
>>
>> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote:
>>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.
>>
>> No, it is not.
> 
> Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS?
> I see that null-block zoned and SCSI-ZBC both set this requirement. I
> wonder how it became different for NVMe.

It is not required for an NVMe ZNS drive that has zone append native support.
zonefs and upcoming btrfs do not use regular writes, removing the requirement
for zone write locking.

In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set only
and only if the drive does not have native zone append support. And even in that
case, since for an emulated zone append the zone write lock is taken and
released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required
only if the user will also be issuing regular writes at high QD. And that is
trivially controllable by the user by simply setting the drive elevator to
mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

2020-08-19 Thread Damien Le Moal

On 2020/08/19 19:32, Kanchan Joshi wrote:
> On Wed, Aug 19, 2020 at 3:08 PM Damien Le Moal  wrote:
>>
>> On 2020/08/19 18:27, Kanchan Joshi wrote:
>>> On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig  wrote:
>>>>
>>>> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote:
>>>>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.
>>>>
>>>> No, it is not.
>>>
>>> Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS?
>>> I see that null-block zoned and SCSI-ZBC both set this requirement. I
>>> wonder how it became different for NVMe.
>>
>> It is not required for an NVMe ZNS drive that has zone append native support.
>> zonefs and upcoming btrfs do not use regular writes, removing the requirement
>> for zone write locking.
> 
> I understand that if a particular user (zonefs, btrfs etc) is not
> sending regular-write and sending append instead, write-lock is not
> required.
> But if that particular user or some other user (say F2FS) sends
> regular write(s), write-lock is needed.

And that can be trivially enabled by setting the drive elevator to mq-deadline.

> Above block-layer, both the opcodes REQ_OP_WRITE and
> REQ_OP_ZONE_APPEND are available to be used by users. And I thought
> write-lock is taken or not is a per-opcode thing and not per-user (FS,
> MD/DM, user-space etc.), is not that correct? And MQ-deadline can
> cater to both the opcodes, while other schedulers cannot serve
> REQ_OP_WRITE well for zoned-device.

mq-deadline ignores zone append commands. No zone lock is taken for these. In
scsi, the emulation takes the zone lock before transforming the zone append into
a regular write. That locking is consistent with the mq-scheduler level locking
since the same lock bitmap is used. So if the user only issues zone append
writes, mq-deadline is not needed and there is no reasons to force its use by
setting ELEVATOR_F_ZBD_SEQ_WRITE. E.g. the user may want to use kyber...

>> In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set 
>> only
>> and only if the drive does not have native zone append support.
> 
> Sure I can keep it that way, once I get it right. If it is really not
> required for native-append drive, it should not be here at the place
> where I added.
> 
>> And even in that
>> case, since for an emulated zone append the zone write lock is taken and
>> released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required
>> only if the user will also be issuing regular writes at high QD. And that is
>> trivially controllable by the user by simply setting the drive elevator to
>> mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed.
> 
> Are we saying applications should switch schedulers based on the write
> QD  (use any-scheduler for QD1 and mq-deadline for QD-N).
> Even if it does that, it does not know what other applications would
> be doing. That seems hard-to-get-right and possible only in a
> tightly-controlled environment.

Even for SMR, the user is free to set the elevator to none, which disables zone
write locking. Issuing writes correctly then becomes the responsibility of the
application. This can be useful for settings that for instance use NCQ I/O
priorities, which give better results when "none" is used.

As far as I know, zoned drives are always used in tightly controlled
environments. Problems like "does not know what other applications would be
doing" are non-existent. Setting up the drive correctly for the use case at hand
is a sysadmin/server setup problem, based on *the* application (singular)
requirements.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 4.19 13/35] null_blk: Fix zone size initialization

2021-01-10 Thread Damien Le Moal

On 2021/01/06 21:55, Pavel Machek wrote:
> Hi!
> 
>> commit 0ebcdd702f49aeb0ad2e2d894f8c124a0acc6e23 upstream.
>>
>> For a null_blk device with zoned mode enabled is currently initialized
>> with a number of zones equal to the device capacity divided by the zone
>> size, without considering if the device capacity is a multiple of the
>> zone size. If the zone size is not a divisor of the capacity, the zones
>> end up not covering the entire capacity, potentially resulting is out
>> of bounds accesses to the zone array.
>>
>> Fix this by adding one last smaller zone with a size equal to the
>> remainder of the disk capacity divided by the zone size if the capacity
>> is not a multiple of the zone size. For such smaller last zone, the zone
>> capacity is also checked so that it does not exceed the smaller zone
>> size.
> 
>> --- a/drivers/block/null_blk_zoned.c
>> +++ b/drivers/block/null_blk_zoned.c
>> @@ -1,9 +1,9 @@
>>  // SPDX-License-Identifier: GPL-2.0
>>  #include 
>> +#include 
>>  #include "null_blk.h"
>>  
>> -/* zone_size in MBs to sectors. */
>> -#define ZONE_SIZE_SHIFT 11
>> +#define MB_TO_SECTS(mb) (((sector_t)mb * SZ_1M) >> SECTOR_SHIFT)
> 
> This macro is quite dangerous. (mb) would help, but inline function
> would be better.

Indeed.

> 
> 
>> +dev->nr_zones = dev_capacity_sects >> ilog2(dev->zone_size_sects);
>> +if (dev_capacity_sects & (dev->zone_size_sects - 1))
>> +dev->nr_zones++;
> 
> Is this same as nr_zones = DIV_ROUND_UP(dev_capacity_sects,
> dev->zone_size_sects)? Would that be faster, more readable and robust
> against weird dev->zone_size_sects sizes?

Yes, we can change to this to be more readable.
Will send a cleanup patch. Thanks !

> 
> Best regards,
>   Pavel
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] RISC-V: simplify BUILTIN_DTB processing

2021-01-11 Thread Damien Le Moal

On 2021/01/12 6:21, Vitaly Wool wrote:
> Provide __dtb_start as a parameter to setup_vm() in case
> CONFIG_BUILTIN_DTB is true, so we don't have to duplicate
> BUILTIN_DTB specific processing in MMU-enabled and MMU-disabled
> versions of setup_vm().
> 
> Signed-off-by: Vitaly Wool 
> ---
>  arch/riscv/kernel/head.S | 4 
>  arch/riscv/mm/init.c | 4 
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 16e9941900c4..f5a9bad86e58 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -260,7 +260,11 @@ clear_bss_done:
>  
>   /* Initialize page tables and relocate to virtual addresses */
>   la sp, init_thread_union + THREAD_SIZE
> +#ifdef CONFIG_BUILTIN_DTB
> + la a0, __dtb_start
> +#else
>   mv a0, s1
> +#endif /* CONFIG_BUILTIN_DTB */
>   call setup_vm
>  #ifdef CONFIG_MMU
>   la a0, early_pg_dir
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 5b17f8d22f91..45faad7c4291 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -615,11 +615,7 @@ static void __init setup_vm_final(void)
>  #else
>  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  {
> -#ifdef CONFIG_BUILTIN_DTB
> - dtb_early_va = (void *) __dtb_start;
> -#else
>   dtb_early_va = (void *)dtb_pa;
> -#endif
>   dtb_early_pa = dtb_pa;
>  }
>  
> 

Tested this with a nommu kernel on a MAIX bit board (K210 SoC). No problems
detected.

Tested-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] bio: limit bio max size.

2021-01-12 Thread Damien Le Moal

On 2021/01/12 17:52, Changheun Lee wrote:
> From: "Changheun Lee" 
> 
> bio size can grow up to 4GB when muli-page bvec is enabled.
> but sometimes it would lead to inefficient behaviors.
> in case of large chunk direct I/O, - 64MB chunk read in user space -
> all pages for 64MB would be merged to a bio structure if memory address is
> continued phsycally. it makes some delay to submit until merge complete.
> bio max size should be limited as a proper size.

But merging physically contiguous pages into the same bvec + later automatic bio
split on submit should give you better throughput for large IOs compared to
having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
likely need splitting anyway (because of DMA boundaries etc).

Do you have a specific case where you see higher performance with this patch
applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
considering that many hardware can execute larger IOs than that.


> 
> Signed-off-by: Changheun Lee 
> ---
>  block/bio.c | 2 +-
>  include/linux/bio.h | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 1f2cc1fbe283..dbe14d675f28 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
> *page,
>   struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>  
>   if (page_is_mergeable(bv, page, len, off, same_page)) {
> - if (bio->bi_iter.bi_size > UINT_MAX - len) {
> + if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>   *same_page = false;
>   return false;
>   }
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1edda614f7ce..0f49b354b1f6 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -20,6 +20,7 @@
>  #endif
>  
>  #define BIO_MAX_PAGES256
> +#define BIO_MAX_SIZE (BIO_MAX_PAGES * PAGE_SIZE)
>  
>  #define bio_prio(bio)(bio)->bi_ioprio
>  #define bio_set_prio(bio, prio)  ((bio)->bi_ioprio = prio)
> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
>   if (bio->bi_vcnt >= bio->bi_max_vecs)
>   return true;
>  
> - if (bio->bi_iter.bi_size > UINT_MAX - len)
> + if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>   return true;
>  
>   return false;
> 


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Damien Le Moal

On 2020/12/07 17:16, javier.g...@samsung.com wrote:
> On 07.12.2020 08:06, Damien Le Moal wrote:
>> On 2020/12/07 16:46, javier.g...@samsung.com wrote:
>>> On 04.12.2020 23:40, Keith Busch wrote:
>>>> On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote:
>>>>> On 2020/12/04 20:02, SelvaKumar S wrote:
>>>>>> This patchset tries to add support for TP4065a ("Simple Copy Command"),
>>>>>> v2020.05.04 ("Ratified")
>>>>>>
>>>>>> The Specification can be found in following link.
>>>>>> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>>>>>>
>>>>>> This is an RFC. Looking forward for any feedbacks or other alternate
>>>>>> designs for plumbing simple copy to IO stack.
>>>>>>
>>>>>> Simple copy command is a copy offloading operation and is  used to copy
>>>>>> multiple contiguous ranges (source_ranges) of LBA's to a single 
>>>>>> destination
>>>>>> LBA within the device reducing traffic between host and device.
>>>>>>
>>>>>> This implementation accepts destination, no of sources and arrays of
>>>>>> source ranges from application and attach it as payload to the bio and
>>>>>> submits to the device.
>>>>>>
>>>>>> Following limits are added to queue limits and are exposed in sysfs
>>>>>> to userspace
>>>>>>  - *max_copy_sectors* limits the sum of all source_range length
>>>>>>  - *max_copy_nr_ranges* limits the number of source ranges
>>>>>>  - *max_copy_range_sectors* limit the maximum number of sectors
>>>>>>  that can constitute a single source range.
>>>>>
>>>>> Same comment as before. I think this is a good start, but for this to be 
>>>>> really
>>>>> useful to users and kernel components alike, this really needs copy 
>>>>> emulation
>>>>> for drives that do not have a native copy feature, similarly to what 
>>>>> write zeros
>>>>> handling for instance: if the drive does not have a copy command (simple 
>>>>> copy
>>>>> for NVMe or XCOPY for scsi), then the block layer should issue read/write
>>>>> commands to seamlessly execute the copy. Otherwise, this will only serve 
>>>>> a small
>>>>> niche for users and will not be optimal for FS and DM drivers that could 
>>>>> be
>>>>> simplified with a generic block layer copy functionality.
>>>>>
>>>>> This is my 10 cents though, others may differ about this.
>>>>
>>>> Yes, I agree that copy emulation support should be included with the
>>>> hardware enabled solution.
>>>
>>> Keith, Damien,
>>>
>>> Can we do the block layer emulation with this patchset and then work in
>>> follow-up patchses on (i) the FS interface with F2FS as a first user and
>>> (ii) other HW accelerations such as XCOPY?
>>
>> The initial patchset supporting NVMe simple copy and emulation copy, all 
>> under
>> an API that probably will be similar that of dm-kcopyd will cover all block
>> devices. Other hardware native support for copy functions such as scsi 
>> extended
>> copy can be added later under the hood without any API changes (or minimal 
>> changes).
> 
> Sounds good. That we can do. We will add a new patch for this.
> 
>>
>> I am not sure what you mean by "FS interface for F2FS": the block layer API 
>> for
>> this copy functionality will be what F2FS (and other FSes) will call. That is
>> the interface, no ?
> 
> Essentially yes.. I mean adding the F2FS logic and potentially some
> helpers to the block layer to aid GC.

GC is very much special to each FS. SO I do not think adding helpers to the
block layer will have value. We should stick to a pure block copy API for that
layer.

> 
>>
>>> For XCOPY, I believe we need to have a separate discussion as much works
>>> is already done that we should align to.
>>
>> I think Martin (added to this thread) and others have looked into it but I do
>> not think that anything made it into the kernel yet.
> 
> Exactly. Looking at some of the code posted through time and recalling
> the discussions at LSF/MM, seems like there are a number of things we
> are not addressing here that could be incorporated down the road, such
> as dedicated syscalls / extensions, multi namespace / device support,
> etc.

dm-kcopyd interface supports copy between multiple devices. That of course would
not enable NVMe simple copy use, but that makes the interface generic enough so
that we should not have any problem with other hardware copy functions.

>>
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] bio: limit bio max size.

2021-01-12 Thread Damien Le Moal

On 2021/01/12 21:14, Changheun Lee wrote:
>> On 2021/01/12 17:52, Changheun Lee wrote:
>>> From: "Changheun Lee" 
>>>
>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>> but sometimes it would lead to inefficient behaviors.
>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>> all pages for 64MB would be merged to a bio structure if memory address is
>>> continued phsycally. it makes some delay to submit until merge complete.
>>> bio max size should be limited as a proper size.
>>
>> But merging physically contiguous pages into the same bvec + later automatic 
>> bio
>> split on submit should give you better throughput for large IOs compared to
>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and 
>> will
>> likely need splitting anyway (because of DMA boundaries etc).
>>
>> Do you have a specific case where you see higher performance with this patch
>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too 
>> small
>> considering that many hardware can execute larger IOs than that.
>>
> 
> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
> is merged into a bio structure.
> And elapsed time to merge complete was about 2ms.
> It means first bio-submit is after 2ms.
> If bio size is limited with 1MB with this patch, first bio-submit is about
> 100us by bio_full operation.

bio_submit() will split the large BIO case into multiple requests while the
small BIO case will likely result one or two requests only. That likely explain
the time difference here. However, for the large case, the 2ms will issue ALL
requests needed for processing the entire 32MB user IO while the 1MB bio case
will need 32 different bio_submit() calls. So what is the actual total latency
difference for the entire 32MB user IO ? That is I think what needs to be
compared here.

Also, what is your device max_sectors_kb and max queue depth ?

> It's not large delay and can't be observed with low speed device.
> But it's needed to reduce merge delay for high speed device.
> I improved 512MB sequential read performance from 1900MB/s to 2000MB/s
> with this patch on android platform.
> As you said, 1MB might be small for some device.
> But method is needed to re-size, or select the bio max size.

At the very least, I think that such limit should not be arbitrary as your patch
proposes but rely on the device characteristics (e.g.
max_hw_sectors_kb/max_sectors_kb and queue depth).

> 
>>
>>>
>>> Signed-off-by: Changheun Lee 
>>> ---
>>>  block/bio.c | 2 +-
>>>  include/linux/bio.h | 3 ++-
>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 1f2cc1fbe283..dbe14d675f28 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
>>> *page,
>>> struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>  
>>> if (page_is_mergeable(bv, page, len, off, same_page)) {
>>> -   if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>> +   if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>> *same_page = false;
>>> return false;
>>> }
>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>> index 1edda614f7ce..0f49b354b1f6 100644
>>> --- a/include/linux/bio.h
>>> +++ b/include/linux/bio.h
>>> @@ -20,6 +20,7 @@
>>>  #endif
>>>  
>>>  #define BIO_MAX_PAGES  256
>>> +#define BIO_MAX_SIZE   (BIO_MAX_PAGES * PAGE_SIZE)
>>>  
>>>  #define bio_prio(bio)  (bio)->bi_ioprio
>>>  #define bio_set_prio(bio, prio)((bio)->bi_ioprio = prio)
>>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned 
>>> len)
>>> if (bio->bi_vcnt >= bio->bi_max_vecs)
>>> return true;
>>>  
>>> -   if (bio->bi_iter.bi_size > UINT_MAX - len)
>>> +   if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>> return true;
>>>  
>>> return false;
>>>
>>
>>
>> -- 
>> Damien Le Moal
>> Western Digital Research
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] bio: limit bio max size.

2021-01-12 Thread Damien Le Moal

  *same_page = false;
return false;
}

With bio_max_size() being something like:

static inline size_t bio_max_size(struct bio *bio)
{
sector_t max_sectors = blk_queue_get_max_sectors(bio->bi_disk->queue,
 bio_op(bio));

return max_sectors << SECTOR_SHIFT;
}

Note that this is not super efficient as a BIO maximum size depends on the BIO
offset too (its start sector). So writing something similar to
blk_rq_get_max_sectors() would probably be better.

> Current is simple patch for default bio max size.
> Before applying of multipage bvec, bio max size was 1MB in kernel 4.x by 
> BIO_MAX_PAGES.
> So I think 1MB bio max size is reasonable as a default.

max_sectors_kb is always defined for any block device so I do not think there is
a need for any arbitrary default value.

Since such optimization likely very much depend on the speed of the system CPU
and of the storage device used, it may be a good idea to have this configurable
through sysfs. That is, bio_max_size() simply returns UINT_MAX leading to no
change from the current behavior if the optimization is disabled (default) and
max_sectors_kb if it is enabled.

> 
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Changheun Lee 
>>>>> ---
>>>>>  block/bio.c | 2 +-
>>>>>  include/linux/bio.h | 3 ++-
>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>> index 1f2cc1fbe283..dbe14d675f28 100644
>>>>> --- a/block/bio.c
>>>>> +++ b/block/bio.c
>>>>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct 
>>>>> page *page,
>>>>>   struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>>>  
>>>>>   if (page_is_mergeable(bv, page, len, off, same_page)) {
>>>>> - if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>>>> + if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>>>>           *same_page = false;
>>>>>       return false;
>>>>>   }
>>>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>>>> index 1edda614f7ce..0f49b354b1f6 100644
>>>>> --- a/include/linux/bio.h
>>>>> +++ b/include/linux/bio.h
>>>>> @@ -20,6 +20,7 @@
>>>>>  #endif
>>>>>  
>>>>>  #define BIO_MAX_PAGES256
>>>>> +#define BIO_MAX_SIZE (BIO_MAX_PAGES * PAGE_SIZE)
>>>>>  
>>>>>  #define bio_prio(bio)(bio)->bi_ioprio
>>>>>  #define bio_set_prio(bio, prio)  ((bio)->bi_ioprio = prio)
>>>>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned 
>>>>> len)
>>>>>   if (bio->bi_vcnt >= bio->bi_max_vecs)
>>>>>   return true;
>>>>>  
>>>>> - if (bio->bi_iter.bi_size > UINT_MAX - len)
>>>>> + if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>>>>   return true;
>>>>>  
>>>>>   return false;
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Damien Le Moal
>>>> Western Digital Research
>>>
>>
>>
>> -- 
>> Damien Le Moal
>> Western Digital Research
>>
> 
> ---
> Changheun Lee
> Samsung Electronics
> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] bio: limit bio max size.

2021-01-12 Thread Damien Le Moal

that large BIO is a lot more
efficient, CPU wise, than building and issuing a lot of small BIOs. That gives a
lot of benefits on high-end desktops and servers with fast CPUs, but is counter
productive in your case with a slower CPU.

I wonder: what is the user IO size when you start seeing a performance drop
without the patch ? It is clear that limiting the BIO size does imporve things
for the 32MB IO size you tested, but what about more realistic workloads with
128K or so IO sizes (typical IO size for an FS using the page cache) ?

> 
>>>
>>>>> It's not large delay and can't be observed with low speed device.
>>>>> But it's needed to reduce merge delay for high speed device.
>>>>> I improved 512MB sequential read performance from 1900MB/s to 2000MB/s
>>>>> with this patch on android platform.
>>>>> As you said, 1MB might be small for some device.
>>>>> But method is needed to re-size, or select the bio max size.
>>>>
>>>> At the very least, I think that such limit should not be arbitrary as your 
>>>> patch
>>>> proposes but rely on the device characteristics (e.g.
>>>> max_hw_sectors_kb/max_sectors_kb and queue depth).
>>>>
>>>
>>> I agree with your opinion, I thought same as your idea. For that, deep 
>>> research
>>> is needed, proper timing to set and bio structure modification, etc ...
>>
>> Why would you need any BIO structure modifications ? Your patch is on the 
>> right
>> track if limiting the BIO size is the right solution (I am not still 
>> completely
>> convinced). E.g., the code:
>>
>> if (page_is_mergeable(bv, page, len, off, same_page)) {
>> if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>> *same_page = false;
>> return false;
>> }
>>
>> could just become:
>>
>> if (page_is_mergeable(bv, page, len, off, same_page)) {
>> if (bio->bi_iter.bi_size > bio_max_size(bio) - len) {
>> *same_page = false;
>> return false;
>> }
>>
>> With bio_max_size() being something like:
>>
>> static inline size_t bio_max_size(struct bio *bio)
>> {
>> sector_t max_sectors = blk_queue_get_max_sectors(bio->bi_disk->queue,
>> bio_op(bio));
>>
>> return max_sectors << SECTOR_SHIFT;
>> }
>>
>> Note that this is not super efficient as a BIO maximum size depends on the 
>> BIO
>> offset too (its start sector). So writing something similar to
>> blk_rq_get_max_sectors() would probably be better.
> 
> Good suggestion. :)
> 
>>
>>> Current is simple patch for default bio max size.
>>> Before applying of multipage bvec, bio max size was 1MB in kernel 4.x by 
>>> BIO_MAX_PAGES.
>>> So I think 1MB bio max size is reasonable as a default.
>>
>> max_sectors_kb is always defined for any block device so I do not think 
>> there is
>> a need for any arbitrary default value.
>>
>> Since such optimization likely very much depend on the speed of the system 
>> CPU
>> and of the storage device used, it may be a good idea to have this 
>> configurable
>> through sysfs. That is, bio_max_size() simply returns UINT_MAX leading to no
>> change from the current behavior if the optimization is disabled (default) 
>> and
>> max_sectors_kb if it is enabled.
>>
> 
> OK, I agree with you. It will be best for all now.
> I'll try to make this.
> 
>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Changheun Lee 
>>>>>>> ---
>>>>>>>  block/bio.c | 2 +-
>>>>>>>  include/linux/bio.h | 3 ++-
>>>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>> index 1f2cc1fbe283..dbe14d675f28 100644
>>>>>>> --- a/block/bio.c
>>>>>>> +++ b/block/bio.c
>>>>>>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct 
>>>>>>> page *page,
>>>>>>> struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>>>>>  
>>>>>>> if (page_is_mergeable(bv, page, len, off, same_page)) {
>>>>>>> -   if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>>>>>> +   if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>>>>>> *same_page = false;
>>>>>>> return false;
>>>>>>> }
>>>>>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>>>>>> index 1edda614f7ce..0f49b354b1f6 100644
>>>>>>> --- a/include/linux/bio.h
>>>>>>> +++ b/include/linux/bio.h
>>>>>>> @@ -20,6 +20,7 @@
>>>>>>>  #endif
>>>>>>>  
>>>>>>>  #define BIO_MAX_PAGES  256
>>>>>>> +#define BIO_MAX_SIZE   (BIO_MAX_PAGES * PAGE_SIZE)
>>>>>>>  
>>>>>>>  #define bio_prio(bio)  (bio)->bi_ioprio
>>>>>>>  #define bio_set_prio(bio, prio)((bio)->bi_ioprio = 
>>>>>>> prio)
>>>>>>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, 
>>>>>>> unsigned len)
>>>>>>> if (bio->bi_vcnt >= bio->bi_max_vecs)
>>>>>>> return true;
>>>>>>>  
>>>>>>> -   if (bio->bi_iter.bi_size > UINT_MAX - len)
>>>>>>> +   if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>>>>>> return true;
>>>>>>>  
>>>>>>> return false;
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Damien Le Moal
>>>>>> Western Digital Research
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Damien Le Moal
>>>> Western Digital Research
>>>>
>>>
>>> ---
>>> Changheun Lee
>>> Samsung Electronics
>>>
>>>
>>
>>
>> -- 
>> Damien Le Moal
>> Western Digital Research
>>
> 
> ---
> Changheun Lee
> Samsung Electronics
> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] bio: limit bio max size.

2021-01-13 Thread Damien Le Moal

On 2021/01/13 18:19, Ming Lei wrote:
> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee  wrote:
>>
>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>> From: "Changheun Lee" 
>>>>>>
>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>> all pages for 64MB would be merged to a bio structure if memory address 
>>>>>> is
>>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>>> bio max size should be limited as a proper size.
>>>>>
>>>>> But merging physically contiguous pages into the same bvec + later 
>>>>> automatic bio
>>>>> split on submit should give you better throughput for large IOs compared 
>>>>> to
>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized 
>>>>> and will
>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>
>>>>> Do you have a specific case where you see higher performance with this 
>>>>> patch
>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and 
>>>>> too small
>>>>> considering that many hardware can execute larger IOs than that.
>>>>>
>>>>
>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>>> is merged into a bio structure.
>>>> And elapsed time to merge complete was about 2ms.
>>>> It means first bio-submit is after 2ms.
>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>>> 100us by bio_full operation.
>>>
>>> bio_submit() will split the large BIO case into multiple requests while the
>>> small BIO case will likely result one or two requests only. That likely 
>>> explain
>>> the time difference here. However, for the large case, the 2ms will issue 
>>> ALL
>>> requests needed for processing the entire 32MB user IO while the 1MB bio 
>>> case
>>> will need 32 different bio_submit() calls. So what is the actual total 
>>> latency
>>> difference for the entire 32MB user IO ? That is I think what needs to be
>>> compared here.
>>>
>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>
>>
>> 32MB total latency is about 19ms including merge time without this patch.
>> But with this patch, total latency is about 17ms including merge time too.
> 
> 19ms looks too big just for preparing one 32MB sized bio, which isn't
> supposed to
> take so long.  Can you investigate where the 19ms is taken just for
> preparing one
> 32MB sized bio?

Changheun mentioned that the device side IO latency is 16.7ms out of the 19ms
total. So the BIO handling, submission+completion takes about 2.3ms, and
Changheun points above to 2ms for the submission part.

> 
> It might be iov_iter_get_pages() for handling page fault. If yes, one 
> suggestion
> is to enable THP(Transparent HugePage Support) in your application.

But if that was due to page faults, the same large-ish time would be taken for
the preparing the size-limited BIOs too, no ? No matter how the BIOs are diced,
all 32MB of pages of the user IO are referenced...

> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] bio: limit bio max size.

2021-01-13 Thread Damien Le Moal

On 2021/01/13 19:25, Ming Lei wrote:
> On Wed, Jan 13, 2021 at 09:28:02AM +0000, Damien Le Moal wrote:
>> On 2021/01/13 18:19, Ming Lei wrote:
>>> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee  
>>> wrote:
>>>>
>>>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>>>> From: "Changheun Lee" 
>>>>>>>>
>>>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>>>> all pages for 64MB would be merged to a bio structure if memory 
>>>>>>>> address is
>>>>>>>> continued phsycally. it makes some delay to submit until merge 
>>>>>>>> complete.
>>>>>>>> bio max size should be limited as a proper size.
>>>>>>>
>>>>>>> But merging physically contiguous pages into the same bvec + later 
>>>>>>> automatic bio
>>>>>>> split on submit should give you better throughput for large IOs 
>>>>>>> compared to
>>>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized 
>>>>>>> and will
>>>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>>>
>>>>>>> Do you have a specific case where you see higher performance with this 
>>>>>>> patch
>>>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and 
>>>>>>> too small
>>>>>>> considering that many hardware can execute larger IOs than that.
>>>>>>>
>>>>>>
>>>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>>>>> is merged into a bio structure.
>>>>>> And elapsed time to merge complete was about 2ms.
>>>>>> It means first bio-submit is after 2ms.
>>>>>> If bio size is limited with 1MB with this patch, first bio-submit is 
>>>>>> about
>>>>>> 100us by bio_full operation.
>>>>>
>>>>> bio_submit() will split the large BIO case into multiple requests while 
>>>>> the
>>>>> small BIO case will likely result one or two requests only. That likely 
>>>>> explain
>>>>> the time difference here. However, for the large case, the 2ms will issue 
>>>>> ALL
>>>>> requests needed for processing the entire 32MB user IO while the 1MB bio 
>>>>> case
>>>>> will need 32 different bio_submit() calls. So what is the actual total 
>>>>> latency
>>>>> difference for the entire 32MB user IO ? That is I think what needs to be
>>>>> compared here.
>>>>>
>>>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>>>
>>>>
>>>> 32MB total latency is about 19ms including merge time without this patch.
>>>> But with this patch, total latency is about 17ms including merge time too.
>>>
>>> 19ms looks too big just for preparing one 32MB sized bio, which isn't
>>> supposed to
>>> take so long.  Can you investigate where the 19ms is taken just for
>>> preparing one
>>> 32MB sized bio?
>>
>> Changheun mentioned that the device side IO latency is 16.7ms out of the 19ms
>> total. So the BIO handling, submission+completion takes about 2.3ms, and
>> Changheun points above to 2ms for the submission part.
> 
> OK, looks I misunderstood the data.
> 
>>
>>>
>>> It might be iov_iter_get_pages() for handling page fault. If yes, one 
>>> suggestion
>>> is to enable THP(Transparent HugePage Support) in your application.
>>
>> But if that was due to page faults, the same large-ish time would be taken 
>> for
>> the preparing the size-limited BIOs too, no ? No matter how the BIOs are 
>> diced,
>> all 32MB of pages of the user IO are referenced...
> 
> If bio size is reduced to 1MB, just 256 pages need to be faulted before 
> submitting this
> bio, instead of 256*32 pages, that is why the following words are mentioned:
> 
>   It means first bio-submit is after 2ms.
>   If bio size is limited with 1MB with this patch, first bio-submit is 
> about
>   100us by bio_full operation.

Yes, but eventually, all pages for the 32MB IO will be faulted in, just not in
one go. Overall number of page faults is likely the same as with the large BIO
preparation. So I think we are back to my previous point, that is, reducing the
device idle time by starting a BIO more quickly, even a small one, leads to
overlap between CPU time needed for the next BIO preparation and previous BIO
execution, reducing overall the latency for the entire 32MB user IO. I don't
think that the reason is page faulting in itself.

> 
> 
> Thanks,
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] bio: limit bio max size.

2021-01-13 Thread Damien Le Moal

On 2021/01/13 20:48, Ming Lei wrote:
> On Wed, Jan 13, 2021 at 11:16:11AM +0000, Damien Le Moal wrote:
>> On 2021/01/13 19:25, Ming Lei wrote:
>>> On Wed, Jan 13, 2021 at 09:28:02AM +0000, Damien Le Moal wrote:
>>>> On 2021/01/13 18:19, Ming Lei wrote:
>>>>> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee  
>>>>> wrote:
>>>>>>
>>>>>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>>>>>> From: "Changheun Lee" 
>>>>>>>>>>
>>>>>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>>>>>> all pages for 64MB would be merged to a bio structure if memory 
>>>>>>>>>> address is
>>>>>>>>>> continued phsycally. it makes some delay to submit until merge 
>>>>>>>>>> complete.
>>>>>>>>>> bio max size should be limited as a proper size.
>>>>>>>>>
>>>>>>>>> But merging physically contiguous pages into the same bvec + later 
>>>>>>>>> automatic bio
>>>>>>>>> split on submit should give you better throughput for large IOs 
>>>>>>>>> compared to
>>>>>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily 
>>>>>>>>> sized and will
>>>>>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>>>>>
>>>>>>>>> Do you have a specific case where you see higher performance with 
>>>>>>>>> this patch
>>>>>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary 
>>>>>>>>> and too small
>>>>>>>>> considering that many hardware can execute larger IOs than that.
>>>>>>>>>
>>>>>>>>
>>>>>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 
>>>>>>>> 32MB
>>>>>>>> is merged into a bio structure.
>>>>>>>> And elapsed time to merge complete was about 2ms.
>>>>>>>> It means first bio-submit is after 2ms.
>>>>>>>> If bio size is limited with 1MB with this patch, first bio-submit is 
>>>>>>>> about
>>>>>>>> 100us by bio_full operation.
>>>>>>>
>>>>>>> bio_submit() will split the large BIO case into multiple requests while 
>>>>>>> the
>>>>>>> small BIO case will likely result one or two requests only. That likely 
>>>>>>> explain
>>>>>>> the time difference here. However, for the large case, the 2ms will 
>>>>>>> issue ALL
>>>>>>> requests needed for processing the entire 32MB user IO while the 1MB 
>>>>>>> bio case
>>>>>>> will need 32 different bio_submit() calls. So what is the actual total 
>>>>>>> latency
>>>>>>> difference for the entire 32MB user IO ? That is I think what needs to 
>>>>>>> be
>>>>>>> compared here.
>>>>>>>
>>>>>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>>>>>
>>>>>>
>>>>>> 32MB total latency is about 19ms including merge time without this patch.
>>>>>> But with this patch, total latency is about 17ms including merge time 
>>>>>> too.
>>>>>
>>>>> 19ms looks too big just for preparing one 32MB sized bio, which isn't
>>>>> supposed to
>>>>> take so long.  Can you investigate where the 19ms is taken just for
>>>>> preparing one
>>>>> 32MB sized bio?
>>>>
>>>> Changheun mentioned that the device side IO latency is 16.7ms out of the 
>>>> 19ms
>>>> total. So the BIO handling, submission+completion takes about 2.3ms, and
>>>> Changheun points above to 2ms for the submission part.
>>>
>>> OK, look

Re: [PATCH] bio: limit bio max size.

2021-01-13 Thread Damien Le Moal

On 2021/01/14 12:53, Ming Lei wrote:
> On Wed, Jan 13, 2021 at 12:02:44PM +0000, Damien Le Moal wrote:
>> On 2021/01/13 20:48, Ming Lei wrote:
>>> On Wed, Jan 13, 2021 at 11:16:11AM +0000, Damien Le Moal wrote:
>>>> On 2021/01/13 19:25, Ming Lei wrote:
>>>>> On Wed, Jan 13, 2021 at 09:28:02AM +, Damien Le Moal wrote:
>>>>>> On 2021/01/13 18:19, Ming Lei wrote:
>>>>>>> On Wed, Jan 13, 2021 at 12:09 PM Changheun Lee  
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>>>>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>>>>>>>> From: "Changheun Lee" 
>>>>>>>>>>>>
>>>>>>>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>>>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>>>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space 
>>>>>>>>>>>> -
>>>>>>>>>>>> all pages for 64MB would be merged to a bio structure if memory 
>>>>>>>>>>>> address is
>>>>>>>>>>>> continued phsycally. it makes some delay to submit until merge 
>>>>>>>>>>>> complete.
>>>>>>>>>>>> bio max size should be limited as a proper size.
>>>>>>>>>>>
>>>>>>>>>>> But merging physically contiguous pages into the same bvec + later 
>>>>>>>>>>> automatic bio
>>>>>>>>>>> split on submit should give you better throughput for large IOs 
>>>>>>>>>>> compared to
>>>>>>>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily 
>>>>>>>>>>> sized and will
>>>>>>>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>>>>>>>
>>>>>>>>>>> Do you have a specific case where you see higher performance with 
>>>>>>>>>>> this patch
>>>>>>>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary 
>>>>>>>>>>> and too small
>>>>>>>>>>> considering that many hardware can execute larger IOs than that.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 
>>>>>>>>>> 32MB
>>>>>>>>>> is merged into a bio structure.
>>>>>>>>>> And elapsed time to merge complete was about 2ms.
>>>>>>>>>> It means first bio-submit is after 2ms.
>>>>>>>>>> If bio size is limited with 1MB with this patch, first bio-submit is 
>>>>>>>>>> about
>>>>>>>>>> 100us by bio_full operation.
>>>>>>>>>
>>>>>>>>> bio_submit() will split the large BIO case into multiple requests 
>>>>>>>>> while the
>>>>>>>>> small BIO case will likely result one or two requests only. That 
>>>>>>>>> likely explain
>>>>>>>>> the time difference here. However, for the large case, the 2ms will 
>>>>>>>>> issue ALL
>>>>>>>>> requests needed for processing the entire 32MB user IO while the 1MB 
>>>>>>>>> bio case
>>>>>>>>> will need 32 different bio_submit() calls. So what is the actual 
>>>>>>>>> total latency
>>>>>>>>> difference for the entire 32MB user IO ? That is I think what needs 
>>>>>>>>> to be
>>>>>>>>> compared here.
>>>>>>>>>
>>>>>>>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> 32MB total latency is about 19ms including merge time without this 
>>>>>>>> patch.
>>>>>>>> But with this patch, total latenc

Re: [PATCH] drivers: pinctrl: Remove duplicate include of io.h

2021-03-22 Thread Damien Le Moal

On 2021/03/23 10:38, Wan Jiabing wrote:
> linux/io.h has been included at line 6, so remove the 
> duplicate include at line 18.
> 
> Signed-off-by: Wan Jiabing 
> ---
>  drivers/pinctrl/pinctrl-k210.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/pinctrl/pinctrl-k210.c b/drivers/pinctrl/pinctrl-k210.c
> index 8a733cf77ba0..f831526d06ff 100644
> --- a/drivers/pinctrl/pinctrl-k210.c
> +++ b/drivers/pinctrl/pinctrl-k210.c
> @@ -15,7 +15,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  #include 
>  
> 

Good catch !

Reviewed-by: Damien Le Moal 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH -next 2/5] block: add ioctl to read the disk sequence number

2021-03-15 Thread Damien Le Moal

On 2021/03/16 5:14, Matthew Wilcox wrote:
> On Mon, Mar 15, 2021 at 09:02:39PM +0100, Matteo Croce wrote:
>> +++ b/include/uapi/linux/fs.h
>> @@ -184,6 +184,7 @@ struct fsxattr {
>>  #define BLKSECDISCARD _IO(0x12,125)
>>  #define BLKROTATIONAL _IO(0x12,126)
>>  #define BLKZEROOUT _IO(0x12,127)
>> +#define BLKGETDISKSEQ _IOR(0x12,128,__u64)
>>  /*
>>   * A jump here: 130-131 are reserved for zoned block devices
>>   * (see uapi/linux/blkzoned.h)
> 
> Not your bug, but this is now 130-136.
> 
> +cc all the people who signed off on the commits that added those ioctl
> numbers without updating this comment.  Perhaps one of them will figure
> out how to stop this happening in future.
> 

Indeed. Will be more careful :)
And send a patch to fix this.

Thanks !

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone()

2021-03-16 Thread Damien Le Moal

On 2021/03/16 21:30, Chao Yu wrote:
> In zonefs_open_zone(), if opened zone count is larger than
> .s_max_open_zones threshold, we missed to recover .i_wr_refcnt,
> fix this.
> 
> Fixes: b5c00e975779 ("zonefs: open/close zone on file open/close")
> Signed-off-by: Chao Yu 
> ---
>  fs/zonefs/super.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 0fe76f376dee..be6b99f7de74 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -966,8 +966,7 @@ static int zonefs_open_zone(struct inode *inode)
>  
>   mutex_lock(&zi->i_truncate_mutex);
>  
> - zi->i_wr_refcnt++;
> - if (zi->i_wr_refcnt == 1) {
> + if (zi->i_wr_refcnt == 0) {

Nit: if (!zi->i_wr_refcnt) ? I can change that when applying.

>  
>   if (atomic_inc_return(&sbi->s_open_zones) > 
> sbi->s_max_open_zones) {
>   atomic_dec(&sbi->s_open_zones);
> @@ -978,7 +977,6 @@ static int zonefs_open_zone(struct inode *inode)
>   if (i_size_read(inode) < zi->i_max_size) {
>   ret = zonefs_zone_mgmt(inode, REQ_OP_ZONE_OPEN);
>   if (ret) {
> - zi->i_wr_refcnt--;
>   atomic_dec(&sbi->s_open_zones);
>   goto unlock;
>   }
> @@ -986,6 +984,8 @@ static int zonefs_open_zone(struct inode *inode)
>   }
>   }
>  
> + zi->i_wr_refcnt++;
> +
>  unlock:
>   mutex_unlock(&zi->i_truncate_mutex);
>  
> 

Good catch ! Will apply this and check zonefs test suite as this bug went
undetected.

Thanks.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] btrfs: Fix a typo

2021-03-25 Thread Damien Le Moal

On 2021/03/26 10:02, Bhaskar Chowdhury wrote:
> 
> s/reponsible/responsible/
> 
> Signed-off-by: Bhaskar Chowdhury 
> ---
>  fs/btrfs/scrub.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 3d9088eab2fc..14de898967bf 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -2426,7 +2426,7 @@ static void drop_csum_range(struct scrub_ctx *sctx, 
> struct btrfs_ordered_sum *su
>   * the csum into @csum.
>   *
>   * The search source is sctx->csum_list, which is a pre-populated list
> - * storing bytenr ordered csum ranges.  We're reponsible to cleanup any range
> + * storing bytenr ordered csum ranges.  We're responsible to cleanup any 
> range

If you are at fixing typos, you may as well fix the grammar at the same time :)

We're responsible to cleanup... -> We're responsible for cleaning up...

>   * that is before @logical.
>   *
>   * Return 0 if there is no csum for the range.
> --
> 2.26.2
> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v3 1/2] binfmt_flat: allow not offsetting data start

2021-04-15 Thread Damien Le Moal

On 2021/04/15 23:04, Greg Ungerer wrote:
> Hi Damien,
> 
> On 15/4/21 4:15 pm, Damien Le Moal wrote:
>> Commit 2217b9826246 ("binfmt_flat: revert "binfmt_flat: don't offset
>> the data start"") restored offsetting the start of the data section by
>> a number of words defined by MAX_SHARED_LIBS. As a result, since
>> MAX_SHARED_LIBS is never 0, a gap between the text and data sections
>> always exists. For architectures which cannot support a such gap
>> between the text and data sections (e.g. riscv nommu), flat binary
>> programs cannot be executed.
>>
>> To allow an architecture to request contiguous text and data sections,
>> introduce the config option CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP.
>> Using this new option, the macro DATA_GAP_WORDS is conditionally
>> defined in binfmt_flat.c to MAX_SHARED_LIBS for architectures
>> tolerating the text-to-data gap (CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP
>> disabled case) and to 0 when CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP is
>> enabled. DATA_GAP_WORDS is used in load_flat_file() to calculate the
>> data section length and start position.
>>
>> An architecture enabling CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP also
>> prevents the use of the separate text/data load case (when the flat file
>> header flags FLAT_FLAG_RAM and FLAT_FLAG_GZIP are not set with NOMMU
>> kernels) and forces the use of a single RAM region for loading
>> (equivalent to FLAT_FLAG_RAM being set).
> 
> So is it the case that a flat format file on RISC-V will never have
> relocations?

No, it does have relocations. But there is no entry for the global pointer
(__global_pointer$) location. This is because the loading of that value in the
gp register in the C-library crt1.S is done using a PC-relative instruction. The
value for it is resolved at compile time and does not get a relocation table
entry. Other functions calls and symbol references do have relocation table
entries, so the binary can be loaded anywhere. The missing relocation for the
global pointer mandates that text and data be loaded at the same positions
relative to each other that the linker file defines. Otherwise, loading of
__global_pointer$ into the gp register (first thing that C libraries crt1.S do)
result in a garbage value being loaded.

I tried some tricks with the linker file and changing uclibc crt1.S to have the
gp loading done using a symbol address instead of a PC-relative offset. I could
then see a relocation table entry for that symbol. That still did not work as I
was probably doing something wrong. Anyway, such solution requires changing a
lot of things in C libraries loading assembler that is common between NOMMU and
MMU code. Changing it would break MMU enabled programs.


>> Signed-off-by: Damien Le Moal 
>> Acked-by: Palmer Dabbelt 
>> ---
>>   fs/Kconfig.binfmt |  3 +++
>>   fs/binfmt_flat.c  | 21 +++--
>>   2 files changed, 18 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
>> index c6f1c8c1934e..c6df931d5d45 100644
>> --- a/fs/Kconfig.binfmt
>> +++ b/fs/Kconfig.binfmt
>> @@ -112,6 +112,9 @@ config BINFMT_FLAT_ARGVP_ENVP_ON_STACK
>>   config BINFMT_FLAT_OLD_ALWAYS_RAM
>>  bool
>>   
>> +config BINFMT_FLAT_NO_TEXT_DATA_GAP
>> +bool
>> +
>>   config BINFMT_FLAT_OLD
>>  bool "Enable support for very old legacy flat binaries"
>>  depends on BINFMT_FLAT
>> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
>> index b9c658e0548e..2be29bb964b8 100644
>> --- a/fs/binfmt_flat.c
>> +++ b/fs/binfmt_flat.c
>> @@ -74,6 +74,12 @@
>>   #defineMAX_SHARED_LIBS (1)
>>   #endif
>>   
>> +#ifdef CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP
>> +#define DATA_GAP_WORDS  (0)
>> +#else
>> +#define DATA_GAP_WORDS  (MAX_SHARED_LIBS)
>> +#endif
>> +>   struct lib_info {
>>  struct {
>>  unsigned long start_code;   /* Start of text 
>> segment */
>> @@ -559,7 +565,10 @@ static int load_flat_file(struct linux_binprm *bprm,
>>   * case,  and then the fully copied to RAM case which lumps
>>   * it all together.
>>   */
>> -if (!IS_ENABLED(CONFIG_MMU) && !(flags & 
>> (FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
>> +if (!IS_ENABLED(CONFIG_MMU) &&
>> +!IS_ENABLED(CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP) &&
> 
> If RISC-V flat format files must always be loaded to RAM then why don't
> they set the FLAT_FLAG_RAM when compiled/generated?

That is done. The patch I have for elf2flt sets it. C

Re: [PATCH v2 0/2] Fix binfmt_flat loader for RISC-V

2021-04-15 Thread Damien Le Moal

On 2021/04/16 9:22, Al Viro wrote:
> On Thu, Apr 15, 2021 at 07:56:05AM +0200, Christoph Hellwig wrote:
>> binfmt_flat tends to go through Greg's uclinux tree, adding him and
>> the list.
> 
>   FWIW, my involvement with binfmt_flat had been pretty much nil -
> the least trivial had been "binfmt_flat: flat_{get,put}_addr_from_rp()
> should be able to fail" about 4 years ago and that fell out of hunting
> for places where __get_user() had been used without checking error values.
> 
>   It's in fs/*, but I've no way to test it and I have pretty much
> zero familiarity with the guts of that one, so I can't give any useful
> feedback on that series.  So consider the Christoph's comment seconded -
> you want it reviewed by gerg et.al., and it probably ought to go via
> gerg/uclinux.git tree.
> 
>   I'm reasonably familiar with binfmt_{elf,misc,script}; anything
> else gets touched as part of larger series and only with sanity checks
> from other folks, if the changes are not entirely trivial.

Al,

Thanks for the clarification. Would it make sense to have an entry in
MAINTAINERS file pointing to Greg and the uclinux tree for binfmt_flat.c ?
Greg ?


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v3 1/2] binfmt_flat: allow not offsetting data start

2021-04-16 Thread Damien Le Moal

On 2021/04/16 16:24, Greg Ungerer wrote:
> 
> On 16/4/21 9:22 am, Damien Le Moal wrote:
>> On 2021/04/15 23:04, Greg Ungerer wrote:
>>> Hi Damien,
>>>
>>> On 15/4/21 4:15 pm, Damien Le Moal wrote:
>>>> Commit 2217b9826246 ("binfmt_flat: revert "binfmt_flat: don't offset
>>>> the data start"") restored offsetting the start of the data section by
>>>> a number of words defined by MAX_SHARED_LIBS. As a result, since
>>>> MAX_SHARED_LIBS is never 0, a gap between the text and data sections
>>>> always exists. For architectures which cannot support a such gap
>>>> between the text and data sections (e.g. riscv nommu), flat binary
>>>> programs cannot be executed.
>>>>
>>>> To allow an architecture to request contiguous text and data sections,
>>>> introduce the config option CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP.
>>>> Using this new option, the macro DATA_GAP_WORDS is conditionally
>>>> defined in binfmt_flat.c to MAX_SHARED_LIBS for architectures
>>>> tolerating the text-to-data gap (CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP
>>>> disabled case) and to 0 when CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP is
>>>> enabled. DATA_GAP_WORDS is used in load_flat_file() to calculate the
>>>> data section length and start position.
>>>>
>>>> An architecture enabling CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP also
>>>> prevents the use of the separate text/data load case (when the flat file
>>>> header flags FLAT_FLAG_RAM and FLAT_FLAG_GZIP are not set with NOMMU
>>>> kernels) and forces the use of a single RAM region for loading
>>>> (equivalent to FLAT_FLAG_RAM being set).
>>>
>>> So is it the case that a flat format file on RISC-V will never have
>>> relocations?
>>
>> No, it does have relocations. But there is no entry for the global pointer
>> (__global_pointer$) location. This is because the loading of that value in 
>> the
>> gp register in the C-library crt1.S is done using a PC-relative instruction. 
>> The
>> value for it is resolved at compile time and does not get a relocation table
>> entry. Other functions calls and symbol references do have relocation table
>> entries, so the binary can be loaded anywhere. The missing relocation for the
>> global pointer mandates that text and data be loaded at the same positions
>> relative to each other that the linker file defines. Otherwise, loading of
>> __global_pointer$ into the gp register (first thing that C libraries crt1.S 
>> do)
>> result in a garbage value being loaded.
>>
>> I tried some tricks with the linker file and changing uclibc crt1.S to have 
>> the
>> gp loading done using a symbol address instead of a PC-relative offset. I 
>> could
>> then see a relocation table entry for that symbol. That still did not work 
>> as I
>> was probably doing something wrong. Anyway, such solution requires changing a
>> lot of things in C libraries loading assembler that is common between NOMMU 
>> and
>> MMU code. Changing it would break MMU enabled programs.
>>
>>
>>>> Signed-off-by: Damien Le Moal 
>>>> Acked-by: Palmer Dabbelt 
>>>> ---
>>>>fs/Kconfig.binfmt |  3 +++
>>>>fs/binfmt_flat.c  | 21 +++--
>>>>2 files changed, 18 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
>>>> index c6f1c8c1934e..c6df931d5d45 100644
>>>> --- a/fs/Kconfig.binfmt
>>>> +++ b/fs/Kconfig.binfmt
>>>> @@ -112,6 +112,9 @@ config BINFMT_FLAT_ARGVP_ENVP_ON_STACK
>>>>config BINFMT_FLAT_OLD_ALWAYS_RAM
>>>>bool
>>>>
>>>> +config BINFMT_FLAT_NO_TEXT_DATA_GAP
>>>> +  bool
>>>> +
>>>>config BINFMT_FLAT_OLD
>>>>bool "Enable support for very old legacy flat binaries"
>>>>depends on BINFMT_FLAT
>>>> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
>>>> index b9c658e0548e..2be29bb964b8 100644
>>>> --- a/fs/binfmt_flat.c
>>>> +++ b/fs/binfmt_flat.c
>>>> @@ -74,6 +74,12 @@
>>>>#define MAX_SHARED_LIBS (1)
>>>>#endif
>>>>
>>>> +#ifdef CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP
>>>> +#define DATA_GAP_WORDS(0)
>>>> +#else
>>>> +#define DATA_GAP_WORDS

[PATCH v4 1/2] binfmt_flat: allow not offsetting data start

2021-04-16 Thread Damien Le Moal

Commit 2217b9826246 ("binfmt_flat: revert "binfmt_flat: don't offset
the data start"") restored offsetting the start of the data section by
a number of words defined by MAX_SHARED_LIBS. As a result, since
MAX_SHARED_LIBS is never 0, a gap between the text and data sections
always exists. For architectures which cannot support a such gap
between the text and data sections (e.g. riscv nommu), flat binary
programs cannot be executed.

To allow an architecture to request no data start offset to allow for
contiguous text and data sections for binaries flagged with
FLAT_FLAG_RAM, introduce the new config option
CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET. Using this new option, the
macro DATA_START_OFFSET_WORDS is conditionally defined in binfmt_flat.c
to MAX_SHARED_LIBS for architectures tolerating or needing the data
start offset (CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET disabled case)
and to 0 when CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET is enabled.
DATA_START_OFFSET_WORDS is used in load_flat_file() to calculate the
data section length and start position.

Signed-off-by: Damien Le Moal 
---
 fs/Kconfig.binfmt |  3 +++
 fs/binfmt_flat.c  | 19 ++-
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index c6f1c8c1934e..06fb7a93a1bd 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -112,6 +112,9 @@ config BINFMT_FLAT_ARGVP_ENVP_ON_STACK
 config BINFMT_FLAT_OLD_ALWAYS_RAM
bool
 
+config BINFMT_FLAT_NO_DATA_START_OFFSET
+   bool
+
 config BINFMT_FLAT_OLD
bool "Enable support for very old legacy flat binaries"
depends on BINFMT_FLAT
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index b9c658e0548e..1dc68dfba3e0 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -74,6 +74,12 @@
 #defineMAX_SHARED_LIBS (1)
 #endif
 
+#ifdef CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET
+#define DATA_START_OFFSET_WORDS(0)
+#else
+#define DATA_START_OFFSET_WORDS(MAX_SHARED_LIBS)
+#endif
+
 struct lib_info {
struct {
unsigned long start_code;   /* Start of text 
segment */
@@ -560,6 +566,7 @@ static int load_flat_file(struct linux_binprm *bprm,
 * it all together.
 */
if (!IS_ENABLED(CONFIG_MMU) && !(flags & 
(FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
+
/*
 * this should give us a ROM ptr,  but if it doesn't we don't
 * really care
@@ -576,7 +583,8 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
 
-   len = data_len + extra + MAX_SHARED_LIBS * sizeof(unsigned 
long);
+   len = data_len + extra +
+   DATA_START_OFFSET_WORDS * sizeof(unsigned long);
len = PAGE_ALIGN(len);
realdatastart = vm_mmap(NULL, 0, len,
PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 0);
@@ -591,7 +599,7 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(unsigned long),
+   DATA_START_OFFSET_WORDS * sizeof(unsigned long),
FLAT_DATA_ALIGN);
 
pr_debug("Allocated data+bss+stack (%u bytes): %lx\n",
@@ -622,7 +630,8 @@ static int load_flat_file(struct linux_binprm *bprm,
memp_size = len;
} else {
 
-   len = text_len + data_len + extra + MAX_SHARED_LIBS * 
sizeof(u32);
+   len = text_len + data_len + extra +
+   DATA_START_OFFSET_WORDS * sizeof(u32);
len = PAGE_ALIGN(len);
textpos = vm_mmap(NULL, 0, len,
PROT_READ | PROT_EXEC | PROT_WRITE, MAP_PRIVATE, 0);
@@ -638,7 +647,7 @@ static int load_flat_file(struct linux_binprm *bprm,
 
realdatastart = textpos + ntohl(hdr->data_start);
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(u32),
+   DATA_START_OFFSET_WORDS * sizeof(u32),
FLAT_DATA_ALIGN);
 
reloc = (__be32 __user *)
@@ -714,7 +723,7 @@ static int load_flat_file(struct linux_binprm *bprm,
ret = result;
pr_err("Unable to read code+data+bss, errno %d\n", ret);
vm_munmap(textpos, text_len + data_len + extra +
-   MAX_SHARED_LIBS * sizeof(u32));
+ DATA_START_OFFSET_WORDS * sizeof(u32));
goto err;
}
}
-- 
2.30.2

[PATCH v4 2/2] riscv: Disable data start offset in flat binaries

2021-04-16 Thread Damien Le Moal

uclibc/gcc combined with elf2flt riscv linker file fully resolve the
PC relative __global_pointer$ value at compile time and do not generate
a relocation entry to set a correct value of the gp register at runtime.
As a result, if the flatbin loader offsets the start of the data
section, the relative position change between the text and data sections
compared to the compile time positions results in an incorrect gp value
being used. This causes flatbin executables to crash.

Avoid this problem by enabling CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET
automatically when CONFIG_RISCV is enabled and CONFIG_MMU is disabled.

Signed-off-by: Damien Le Moal 
Acked-by: Palmer Dabbelt 
---
 arch/riscv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 4515a10c5d22..add528eb9235 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -33,6 +33,7 @@ config RISCV
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
+   select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU
select CLONE_BACKWARDS
select CLINT_TIMER if !MMU
select COMMON_CLK
-- 
2.30.2

[PATCH v4 0/2] Fix binfmt_flat loader for RISC-V

2021-04-16 Thread Damien Le Moal

RISC-V NOMMU flat binaries cannot tolerate a gap between the text and
data section as the toolchain fully resolves at compile time the PC
relative global pointer (__global_pointer$ value loaded in the gp
register). Without a relocation entry provided, the flat bin loader
cannot fix the value if a gap is introduced and user executables fail
to run.

This series fixes this problem by allowing an architecture to request
the flat loader to suppress the offset of the data start section.
Combined with the use of elf2flt "-r" option to mark the flat
executables with the FLAT_FLAG_RAM flag, the text and data sections are
loaded contiguously in memory, without a change in their relative
position from compile time.

The first patch fixes binfmt_flat flat_load_file() using the new
configuration option CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET. The
second patch enables this new option for RISCV NOMMU builds.

These patches do not change the binfmt_flat loader behavior for other
architectures.

Changes from v3:
* Renamed the configuration option from
  CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP to
  CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET to clarify that only the
  offseting of the data section start is suppressed.
* Do not force loding to RAM (contiguously) if the flat binary does not
  have the FLAT_FLAG_RAM flag set.
* Updated commit messages to reflect above changes.

Changes from v2:
* Updated distribution list
* Added Palmer ack-by tag

Changes from v1:
* Replace FLAT_TEXT_DATA_NO_GAP macro with
  CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP config option (patch 1).
* Remove the addition of riscv/include/asm/flat.h and set
  CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP for RISCV and !MMU

Damien Le Moal (2):
  binfmt_flat: allow not offsetting data start
  riscv: Disable data start offset in flat binaries

 arch/riscv/Kconfig |  1 +
 fs/Kconfig.binfmt  |  3 +++
 fs/binfmt_flat.c   | 19 ++-
 3 files changed, 18 insertions(+), 5 deletions(-)

-- 
2.30.2

Re: [RFC PATCH v5 2/4] block: add simple copy support

2021-04-12 Thread Damien Le Moal

On 2021/04/12 23:35, Selva Jove wrote:
> On Mon, Apr 12, 2021 at 5:55 AM Damien Le Moal  wrote:
>>
>> On 2021/04/07 20:33, Selva Jove wrote:
>>> Initially I started moving the dm-kcopyd interface to the block layer
>>> as a generic interface.
>>> Once I dig deeper in dm-kcopyd code, I figured that dm-kcopyd is
>>> tightly coupled with dm_io()
>>>
>>> To move dm-kcopyd to block layer, it would also require dm_io code to
>>> be moved to block layer.
>>> It would cause havoc in dm layer, as it is the backbone of the
>>> dm-layer and needs complete
>>> rewriting of dm-layer. Do you see any other way of doing this without
>>> having to move dm_io code
>>> or to have redundant code ?
>>
>> Right. Missed that. So reusing dm-kcopyd and making it a common interface 
>> will
>> take some more efforts. OK, then. For the first round of commits, let's 
>> forget
>> about this. But I still think that your emulation could be a lot better than 
>> a
>> loop doing blocking writes after blocking reads.
>>
> 
> Current implementation issues read asynchronously and once all the reads are
> completed, then the write is issued as whole to reduce the IO traffic
> in the queue.
> I agree that things can be better. Will explore another approach of
> sending writes
> immediately once reads are completed and with  plugging to increase the 
> chances
> of merging.
> 
>> [...]
>>>>> +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
>>>>> + struct range_entry *src_rlist, struct block_device 
>>>>> *dest_bdev,
>>>>> + sector_t dest, gfp_t gfp_mask, int flags)
>>>>> +{
>>>>> + struct request_queue *q = bdev_get_queue(src_bdev);
>>>>> + struct request_queue *dest_q = bdev_get_queue(dest_bdev);
>>>>> + struct blk_copy_payload *payload;
>>>>> + sector_t bs_mask, copy_size;
>>>>> + int ret;
>>>>> +
>>>>> + ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
>>>>> + &payload, ©_size);
>>>>> + if (ret)
>>>>> + return ret;
>>>>> +
>>>>> + bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
>>>>> + if (dest & bs_mask) {
>>>>> + return -EINVAL;
>>>>> + goto out;
>>>>> + }
>>>>> +
>>>>> + if (q == dest_q && q->limits.copy_offload) {
>>>>> + ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
>>>>> + if (ret)
>>>>> + goto out;
>>>>> + } else if (flags & BLKDEV_COPY_NOEMULATION) {
>>>>
>>>> Why ? whoever calls blkdev_issue_copy() wants a copy to be done. Why would 
>>>> that
>>>> user say "Fail on me if the device does not support copy" ??? This is a 
>>>> weird
>>>> interface in my opinion.
>>>>
>>>
>>> BLKDEV_COPY_NOEMULATION flag was introduced to allow blkdev_issue_copy() 
>>> callers
>>> to use their native copying method instead of the emulated copy that I
>>> added. This way we
>>> ensure that dm uses the hw-assisted copy and if that is not present,
>>> it falls back to existing
>>> copy method.
>>>
>>> The other users who don't have their native emulation can use this
>>> emulated-copy implementation.
>>
>> I do not understand. Emulation or not should be entirely driven by the device
>> reporting support for simple copy (or not). It does not matter which 
>> component
>> is issuing the simple copy call: an FS to a real device, and FS to a DM 
>> device
>> or a DM target driver. If the underlying device reported support for simple
>> copy, use that. Otherwise, emulate with read/write. What am I missing here ?
>>
> 
> blkdev_issue_copy() api will generally complete the copy-operation,
> either by using
> offloaded-copy or by using emulated-copy. The caller of the api is not
> required to
> figure the type of support. However, it can opt out of emulated-copy
> by specifying
> the flag BLKDEV_NOEMULATION. This is helpful for the case when the
> caller already
> has got a sophisticated emulation (e.g. dm-kcopyd users).

This does not make any sense to me. If the user has already another mean of
doing copies, then that user will not call blkdev_issue_copy(). So I really do
not understand what the "opting out of emulated copy" would be useful for. That
user can check the simple copy support glag in the device request queue and act
accordingly: use its own block copy code when simple copy is not supported or
use blkdev_issue_copy() when the device has simple copy. Adding that
BLKDEV_COPY_NOEMULATION does not serve any purpose at all.



-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2 0/2] Fix binfmt_flat loader for RISC-V

2021-04-14 Thread Damien Le Moal

On 2021/04/08 0:49, Damien Le Moal wrote:
> RISC-V NOMMU flat binaries cannot tolerate a gap between the text and
> data section as the toolchain fully resolves at compile time the PC
> relative global pointer (__global_pointer$ value loaded in gp register).
> Without a relocation entry provided, the flat bin loader cannot fix the
> value if a gap is introduced and executables fail to run.
> 
> This series fixes this problem by allowing an architecture to request
> the flat loader to suppress the gap between the text and data sections.
> The first patch fixes binfmt_flat flat_load_file() using the new
> configuration option CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP. The second
> patch enables this option for RISCV NOMMU builds.
> 
> These patches do not change the binfmt_flat loader behavior for other
> architectures.
> 
> Changes from v1:
> * Replace FLAT_TEXT_DATA_NO_GAP macro with
>   CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP config option (patch 1).
> * Remove the addition of riscv/include/asm/flat.h and set
>   CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP for RISCV and !MMU
> 
> Damien Le Moal (2):
>   binfmt_flat: allow not offsetting data start
>   riscv: Disable text-data gap in flat binaries
> 
>  arch/riscv/Kconfig |  1 +
>  fs/Kconfig.binfmt  |  3 +++
>  fs/binfmt_flat.c   | 21 +++--
>  3 files changed, 19 insertions(+), 6 deletions(-)
> 

Ping ?

Any comment on these patches ?

Without them, RISC-V NOMMU user space does not run... I would really like to get
these in this cycle if possible.


-- 
Damien Le Moal
Western Digital Research

[PATCH v3 0/2] Fix binfmt_flat loader for RISC-V

2021-04-14 Thread Damien Le Moal

RISC-V NOMMU flat binaries cannot tolerate a gap between the text and
data section as the toolchain fully resolves at compile time the PC
relative global pointer (__global_pointer$ value loaded in gp register).
Without a relocation entry provided, the flat bin loader cannot fix the
value if a gap is introduced and executables fail to run.

This series fixes this problem by allowing an architecture to request
the flat loader to suppress the gap between the text and data sections.
The first patch fixes binfmt_flat flat_load_file() using the new
configuration option CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP. The second
patch enables this option for RISCV NOMMU builds.

These patches do not change the binfmt_flat loader behavior for other
architectures.

Changes from v2:
* Updated distribution list
* Added Palmer ack-by tag

Changes from v1:
* Replace FLAT_TEXT_DATA_NO_GAP macro with
  CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP config option (patch 1).
* Remove the addition of riscv/include/asm/flat.h and set
  CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP for RISCV and !MMU

Damien Le Moal (2):
  binfmt_flat: allow not offsetting data start
  riscv: Disable text-data gap in flat binaries

 arch/riscv/Kconfig |  1 +
 fs/Kconfig.binfmt  |  3 +++
 fs/binfmt_flat.c   | 21 +++--
 3 files changed, 19 insertions(+), 6 deletions(-)

-- 
2.30.2

[PATCH v3 1/2] binfmt_flat: allow not offsetting data start

2021-04-14 Thread Damien Le Moal

Commit 2217b9826246 ("binfmt_flat: revert "binfmt_flat: don't offset
the data start"") restored offsetting the start of the data section by
a number of words defined by MAX_SHARED_LIBS. As a result, since
MAX_SHARED_LIBS is never 0, a gap between the text and data sections
always exists. For architectures which cannot support a such gap
between the text and data sections (e.g. riscv nommu), flat binary
programs cannot be executed.

To allow an architecture to request contiguous text and data sections,
introduce the config option CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP.
Using this new option, the macro DATA_GAP_WORDS is conditionally
defined in binfmt_flat.c to MAX_SHARED_LIBS for architectures
tolerating the text-to-data gap (CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP
disabled case) and to 0 when CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP is
enabled. DATA_GAP_WORDS is used in load_flat_file() to calculate the
data section length and start position.

An architecture enabling CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP also
prevents the use of the separate text/data load case (when the flat file
header flags FLAT_FLAG_RAM and FLAT_FLAG_GZIP are not set with NOMMU
kernels) and forces the use of a single RAM region for loading
(equivalent to FLAT_FLAG_RAM being set).

Signed-off-by: Damien Le Moal 
Acked-by: Palmer Dabbelt 
---
 fs/Kconfig.binfmt |  3 +++
 fs/binfmt_flat.c  | 21 +++--
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index c6f1c8c1934e..c6df931d5d45 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -112,6 +112,9 @@ config BINFMT_FLAT_ARGVP_ENVP_ON_STACK
 config BINFMT_FLAT_OLD_ALWAYS_RAM
bool
 
+config BINFMT_FLAT_NO_TEXT_DATA_GAP
+   bool
+
 config BINFMT_FLAT_OLD
bool "Enable support for very old legacy flat binaries"
depends on BINFMT_FLAT
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index b9c658e0548e..2be29bb964b8 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -74,6 +74,12 @@
 #defineMAX_SHARED_LIBS (1)
 #endif
 
+#ifdef CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP
+#define DATA_GAP_WORDS (0)
+#else
+#define DATA_GAP_WORDS (MAX_SHARED_LIBS)
+#endif
+
 struct lib_info {
struct {
unsigned long start_code;   /* Start of text 
segment */
@@ -559,7 +565,10 @@ static int load_flat_file(struct linux_binprm *bprm,
 * case,  and then the fully copied to RAM case which lumps
 * it all together.
 */
-   if (!IS_ENABLED(CONFIG_MMU) && !(flags & 
(FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
+   if (!IS_ENABLED(CONFIG_MMU) &&
+   !IS_ENABLED(CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP) &&
+   !(flags & (FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
+
/*
 * this should give us a ROM ptr,  but if it doesn't we don't
 * really care
@@ -576,7 +585,7 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
 
-   len = data_len + extra + MAX_SHARED_LIBS * sizeof(unsigned 
long);
+   len = data_len + extra + DATA_GAP_WORDS * sizeof(unsigned long);
len = PAGE_ALIGN(len);
realdatastart = vm_mmap(NULL, 0, len,
PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 0);
@@ -591,7 +600,7 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(unsigned long),
+   DATA_GAP_WORDS * sizeof(unsigned long),
FLAT_DATA_ALIGN);
 
pr_debug("Allocated data+bss+stack (%u bytes): %lx\n",
@@ -622,7 +631,7 @@ static int load_flat_file(struct linux_binprm *bprm,
memp_size = len;
} else {
 
-   len = text_len + data_len + extra + MAX_SHARED_LIBS * 
sizeof(u32);
+   len = text_len + data_len + extra + DATA_GAP_WORDS * 
sizeof(u32);
len = PAGE_ALIGN(len);
textpos = vm_mmap(NULL, 0, len,
PROT_READ | PROT_EXEC | PROT_WRITE, MAP_PRIVATE, 0);
@@ -638,7 +647,7 @@ static int load_flat_file(struct linux_binprm *bprm,
 
realdatastart = textpos + ntohl(hdr->data_start);
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(u32),
+   DATA_GAP_WORDS * sizeof(u32),
FLAT_DATA_ALIGN);
 
reloc = (__be32 __user *)
@@ -714,7 +723,7 @@ static int load_flat_file(struct linux_binprm *bprm,
ret = result;
pr_err("Unable to read code+data+bss, errno %d\n", ret);

[PATCH v3 2/2] riscv: Disable text-data gap in flat binaries

2021-04-14 Thread Damien Le Moal

uclibc/gcc combined with elf2flt riscv linker file fully resolve the
PC relative __global_pointer$ value at compile time and do not generate
a relocation entry to set a runtime gp value. As a result, if the
flatbin loader introduces a gap between the text and data sections, the
gp value becomes incorrect and prevent correct execution of a flatbin
executable.

Avoid this problem by enabling CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP
automatically when CONFIG_RISCV is enabled and CONFIG_MMU disabled.

Signed-off-by: Damien Le Moal 
Acked-by: Palmer Dabbelt 
---
 arch/riscv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 0d0cf67359cb..6a85fbbd056e 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -33,6 +33,7 @@ config RISCV
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
+   select BINFMT_FLAT_NO_TEXT_DATA_GAP if !MMU
select CLONE_BACKWARDS
select CLINT_TIMER if !MMU
select COMMON_CLK
-- 
2.30.2

Re: [PATCH v2 0/2] Fix binfmt_flat loader for RISC-V

2021-04-14 Thread Damien Le Moal

On 2021/04/15 14:56, Christoph Hellwig wrote:
> binfmt_flat tends to go through Greg's uclinux tree, adding him and
> the list.

Thanks Christoph. I resent the series adding Gerg and uclinux-dev.
MAINTAINERS file needs an update may be ?

> 
> On Wed, Apr 14, 2021 at 10:46:36PM -0700, Palmer Dabbelt wrote:
>> On Wed, 14 Apr 2021 17:32:10 PDT (-0700), Damien Le Moal wrote:
>>>> On 2021/04/08 0:49, Damien Le Moal wrote:
>>>> RISC-V NOMMU flat binaries cannot tolerate a gap between the text and
>>>> data section as the toolchain fully resolves at compile time the PC
>>>> relative global pointer (__global_pointer$ value loaded in gp register).
>>>> Without a relocation entry provided, the flat bin loader cannot fix the
>>>> value if a gap is introduced and executables fail to run.
>>>>
>>>> This series fixes this problem by allowing an architecture to request
>>>> the flat loader to suppress the gap between the text and data sections.
>>>> The first patch fixes binfmt_flat flat_load_file() using the new
>>>> configuration option CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP. The second
>>>> patch enables this option for RISCV NOMMU builds.
>>>>
>>>> These patches do not change the binfmt_flat loader behavior for other
>>>> architectures.
>>>>
>>>> Changes from v1:
>>>> * Replace FLAT_TEXT_DATA_NO_GAP macro with
>>>>   CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP config option (patch 1).
>>>> * Remove the addition of riscv/include/asm/flat.h and set
>>>>   CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP for RISCV and !MMU
>>>>
>>>> Damien Le Moal (2):
>>>>   binfmt_flat: allow not offsetting data start
>>>>   riscv: Disable text-data gap in flat binaries
>>>>
>>>>  arch/riscv/Kconfig |  1 +
>>>>  fs/Kconfig.binfmt  |  3 +++
>>>>  fs/binfmt_flat.c   | 21 +++--
>>>>  3 files changed, 19 insertions(+), 6 deletions(-)
>>>>
>>>
>>> Ping ?
>>>
>>> Any comment on these patches ?
>>>
>>> Without them, RISC-V NOMMU user space does not run... I would really like 
>>> to get
>>> these in this cycle if possible.
>>
>> This LGTM, but it's pretty far out of my area of expertise.  I'm happy to 
>> take them via my tree, but I'd prefer to get an Ack from someone.
>>
>> Al, get_maintainer suggests you?
>>
>> Acked-by: Palmer Dabbelt 
> ---end quoted text---
> 


-- 
Damien Le Moal
Western Digital Research

Re: [RESEND,v5,1/2] bio: limit bio max size

2021-04-07 Thread Damien Le Moal

> created.
>>>>>>>>>> And it lead to delay first I/O request issue.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Changheun Lee 
>>>>>>>>>> ---
>>>>>>>>>>  block/bio.c| 13 -
>>>>>>>>>>  include/linux/bio.h|  2 +-
>>>>>>>>>>  include/linux/blkdev.h |  3 +++
>>>>>>>>>>  3 files changed, 16 insertions(+), 2 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>> index 1f2cc1fbe283..c528e1f944c7 100644
>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>> @@ -287,6 +287,17 @@ void bio_init(struct bio *bio, struct bio_vec 
>>>>>>>>>> *table,
>>>>>>>>>>  }
>>>>>>>>>>  EXPORT_SYMBOL(bio_init);
>>>>>>>>>>  
>>>>>>>>>> +unsigned int bio_max_size(struct bio *bio)
>>>>>>>>>> +{
>>>>>>>>>> +struct request_queue *q = bio->bi_disk->queue;
>>>>>>>>>> +
>>>>>>>>>> +if (blk_queue_limit_bio_size(q))
>>>>>>>>>> +return blk_queue_get_max_sectors(q, bio_op(bio))
>>>>>>>>>> +<< SECTOR_SHIFT;
>>>>>>>>>> +
>>>>>>>>>> +return UINT_MAX;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  /**
>>>>>>>>>>   * bio_reset - reinitialize a bio
>>>>>>>>>>   * @bio:bio to reset
>>>>>>>>>> @@ -877,7 +888,7 @@ bool __bio_try_merge_page(struct bio *bio, 
>>>>>>>>>> struct page *page,
>>>>>>>>>>  struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>>>>>>>>  
>>>>>>>>>>  if (page_is_mergeable(bv, page, len, off, same_page)) {
>>>>>>>>>> -if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>>>>>>>>> +if (bio->bi_iter.bi_size > bio_max_size(bio) - 
>>>>>>>>>> len) {
>>>>>>>>>>  *same_page = false;
>>>>>>>>>>  return false;
>>>>>>>>>>  }
>>>>>>>>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>>>>>>>>> index 1edda614f7ce..13b6f6562a5b 100644
>>>>>>>>>> --- a/include/linux/bio.h
>>>>>>>>>> +++ b/include/linux/bio.h
>>>>>>>>>> @@ -113,7 +113,7 @@ static inline bool bio_full(struct bio *bio, 
>>>>>>>>>> unsigned len)
>>>>>>>>>>  if (bio->bi_vcnt >= bio->bi_max_vecs)
>>>>>>>>>>  return true;
>>>>>>>>>>  
>>>>>>>>>> -if (bio->bi_iter.bi_size > UINT_MAX - len)
>>>>>>>>>> +if (bio->bi_iter.bi_size > bio_max_size(bio) - len)
>>>>>>>>>>  return true;
>>>>>>>>>>  
>>>>>>>>>>  return false;
>>>>>>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>>>>>>>>> index f94ee3089e01..3aeab9e7e97b 100644
>>>>>>>>>> --- a/include/linux/blkdev.h
>>>>>>>>>> +++ b/include/linux/blkdev.h
>>>>>>>>>> @@ -621,6 +621,7 @@ struct request_queue {
>>>>>>>>>>  #define QUEUE_FLAG_RQ_ALLOC_TIME 27 /* record rq->alloc_time_ns */
>>>>>>>>>>  #define QUEUE_FLAG_HCTX_ACTIVE  28  /* at least one blk-mq 
>>>>>>>>>> hctx is active */
>>>>>>>>>>  #define QUEUE_FLAG_NOWAIT   29  /* device supports NOWAIT */
>>>>>>>>>> +#define QUEUE_FLAG_LIMIT_BIO_SIZE 30/* limit bio size */
>>>>>>>>>>  
>>>>>>>>>>  #define QUEUE_FLAG_MQ_DEFAULT   ((1 << QUEUE_FLAG_IO_STAT) |
>>>>>>>>>> \
>>>>>>>>>>   (1 << QUEUE_FLAG_SAME_COMP) |  
>>>>>>>>>> \
>>>>>>>>>> @@ -667,6 +668,8 @@ bool blk_queue_flag_test_and_set(unsigned int 
>>>>>>>>>> flag, struct request_queue *q);
>>>>>>>>>>  #define blk_queue_fua(q)test_bit(QUEUE_FLAG_FUA, 
>>>>>>>>>> &(q)->queue_flags)
>>>>>>>>>>  #define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, 
>>>>>>>>>> &(q)->queue_flags)
>>>>>>>>>>  #define blk_queue_nowait(q) test_bit(QUEUE_FLAG_NOWAIT, 
>>>>>>>>>> &(q)->queue_flags)
>>>>>>>>>> +#define blk_queue_limit_bio_size(q) \
>>>>>>>>>> +test_bit(QUEUE_FLAG_LIMIT_BIO_SIZE, &(q)->queue_flags)
>>>>>>>>>>  
>>>>>>>>>>  extern void blk_set_pm_only(struct request_queue *q);
>>>>>>>>>>  extern void blk_clear_pm_only(struct request_queue *q);
>>>>>>>>>> -- 
>>>>>>>>>> 2.28.0
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please feedback to me if more modification is needed to apply. :)
>>>>>>>>
>>>>>>>> You are adding code that tests for a value to be set, yet you never set
>>>>>>>> it in this code so why is it needed at all?
>>>>>>>
>>>>>>> This patch is a solution for some inefficient case of multipage bvec 
>>>>>>> like
>>>>>>> as current DIO scenario. So it's not set as a default.
>>>>>>> It will be set when bio size limitation is needed in runtime.
>>>>>>
>>>>>> Set where?
>>>>>
>>>>> In my environment, set it on init.rc file like as below.
>>>>> "echo 1 > /sys/block/sda/queue/limit_bio_size"
>>>>
>>>> I do not see any sysfs file in this patch, and why would you ever want
>>>> to be forced to manually do this?  The hardware should know the limits
>>>> itself, and should automatically tune things like this, do not force a
>>>> user to do it as that's just not going to go well at all.
>>>
>>> Patch for sysfs is sent "[RESEND,v5,2/2] bio: add limit_bio_size sysfs".
>>> Actually I just suggested constant - 1MB - value to limit bio size at first.
>>> But I got a feedback that patch will be better if it's optional, and
>>> getting meaningful value from device queue on patchwork.
>>> There are some differences for each system environment I think.
>>>
>>> But there are inefficient logic obviously by applying of multipage bvec.
>>> So it will be shown in several system environment.
>>> Currently providing this patch as a option would be better to select
>>> according to each system environment, and policy I think.
>>>
>>> Please, revisit applying this patch.
>>>
>>>>
>>>> So if this patch series is forcing a new option to be configured by
>>>> sysfs only, that's not acceptable, sorry.
>>>
>>> If it is not acceptable ever with current, may I progress review again
>>> with default enabled?
>>
>> I am sorry, I can not parse this, can you rephrase this?
>>
>> thanks,
>>
>> greg k-h
>>
> 
> I'll prepare new patch as you recommand. It will be added setting of
> limit_bio_size automatically when queue max sectors is determined.

Please do that in the driver for the HW that benefits from it. Do not do this
for all block devices.

> 
> 
> Thanks,
> 
> Changheun Lee
> 


-- 
Damien Le Moal
Western Digital Research

[PATCH 2/2] riscv: introduce asm/flat.h

2021-04-07 Thread Damien Le Moal

uclibc/gcc combined with elf2flt riscv linker file fully resolve the
PC relative __global_pointer$ value at compile time and do not generate
a relocation entry to set a runtime gp value. As a result, if the
flatbin loader introduces a gap between the text and data sections, the
gp value becomes incorrect and prevent correct execution of a flatbin
executable. Avoid this problem by introducing the file asm/flat.h
and defining the macro FLAT_TEXT_DATA_NO_GAP to indicate that the text
and data sections must be loaded at contiguous addresses.

Signed-off-by: Damien Le Moal 
---
 arch/riscv/include/asm/Kbuild |  1 -
 arch/riscv/include/asm/flat.h | 29 +
 2 files changed, 29 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/include/asm/flat.h

diff --git a/arch/riscv/include/asm/Kbuild b/arch/riscv/include/asm/Kbuild
index 445ccc97305a..a8b54a3f4c2b 100644
--- a/arch/riscv/include/asm/Kbuild
+++ b/arch/riscv/include/asm/Kbuild
@@ -1,7 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 generic-y += early_ioremap.h
 generic-y += extable.h
-generic-y += flat.h
 generic-y += kvm_para.h
 generic-y += user.h
 generic-y += vmlinux.lds.h
diff --git a/arch/riscv/include/asm/flat.h b/arch/riscv/include/asm/flat.h
new file mode 100644
index ..43bccf090fd1
--- /dev/null
+++ b/arch/riscv/include/asm/flat.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_RISCV_FLAT_H
+#define _ASM_RISCV_FLAT_H
+
+#include 
+
+static inline int flat_get_addr_from_rp(u32 __user *rp, u32 relval, u32 flags,
+   u32 *addr)
+{
+   *addr = get_unaligned((__force u32 *)rp);
+   return 0;
+}
+
+static inline int flat_put_addr_at_rp(u32 __user *rp, u32 addr, u32 rel)
+{
+   put_unaligned(addr, (__force u32 *)rp);
+   return 0;
+}
+
+/*
+ * uclibc/gcc fully resolve the PC relative __global_pointer value
+ * at compile time and do not generate a relocation entry to set a
+ * runtime gp value. As a result, the flatbin loader must not introduce
+ * a gap between the text and data sections and keep them contiguous to
+ * avoid invalid address accesses.
+ */
+#define FLAT_TEXT_DATA_NO_GAP  (1)
+
+#endif /* _ASM_RISCV_FLAT_H */
-- 
2.30.2

[PATCH 0/2] Fix binfmt_flat loader for RISC-V

2021-04-07 Thread Damien Le Moal

RISC-V NOMMU flat binaries cannot tolerate a gap between the text and
data section as the toolchain fully resolve at compile time the PC
relative global pointer (__global_pointer$ value loaded in gp register).
Without a relocation entry provided, the flat bin loader cannot fix the
value if a gap is introduced and executables fail to run.

This series fixes this problem by allowing an architecture to request
the flat loader to suppress the gap between the text and data sections.
The first patch fixes binfmt_flat flat_load_file(). The second patch
adds the asm/flat.h file to riscv arch to request the gap suppression
using the newly introduced macro FLAT_TEXT_DATA_NO_GAP.

These patches do not change the binfmt_flat loader behavior for other
architectures.

Damien Le Moal (2):
  binfmt_flat: allow not offsetting data start
  riscv: introduce asm/flat.h

 arch/riscv/include/asm/Kbuild |  1 -
 arch/riscv/include/asm/flat.h | 29 +
 fs/binfmt_flat.c  | 25 +++--
 3 files changed, 48 insertions(+), 7 deletions(-)
 create mode 100644 arch/riscv/include/asm/flat.h

-- 
2.30.2

[PATCH 1/2] binfmt_flat: allow not offsetting data start

2021-04-07 Thread Damien Le Moal

Commit 2217b9826246 ("binfmt_flat: revert "binfmt_flat: don't offset
the data start"") restored offsetting the start of the data section by
a number of words defined by MAX_SHARED_LIBS. As a result, since
MAX_SHARED_LIBS is never 0, a gap between the text and data sections
always exist. For architecture which cannot support a such gap between
the text and data sections (e.g. riscv nommu), flat binary programs
cannot be executed.

To allow an architecture to request contiguous text and data sections,
introduce the macro FLAT_TEXT_DATA_NO_GAP which can be defined by the
architecture in its asm/flat.h file. With this change, the macro
DATA_GAP_WORDS is conditionally defined in binfmt_flat.c to
MAX_SHARED_LIBS for architectures tolerating the gap
(FLAT_TEXT_DATA_NO_GAP undefined case) and to 0 when
FLAT_TEXT_DATA_NO_GAP is defined. DATA_GAP_WORDS is used in
load_flat_file() to calculate the data section length and start
position.

The definition of FLAT_TEXT_DATA_NO_GAP by an architecture also
prevents the use of the separate text/data load case (when
FLAT_FLAG_RAM and FLAT_FLAG_GZIP are not set with NOMMU kernels).

Signed-off-by: Damien Le Moal 
---
 fs/binfmt_flat.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index b9c658e0548e..2bfa05ac5cb4 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -74,6 +74,12 @@
 #defineMAX_SHARED_LIBS (1)
 #endif
 
+#ifdef FLAT_TEXT_DATA_NO_GAP
+#define DATA_GAP_WORDS (0)
+#else
+#define DATA_GAP_WORDS (MAX_SHARED_LIBS)
+#endif
+
 struct lib_info {
struct {
unsigned long start_code;   /* Start of text 
segment */
@@ -437,7 +443,6 @@ static int load_flat_file(struct linux_binprm *bprm,
__be32 __user *reloc;
u32 __user *rp;
int i, rev, relocs;
-   loff_t fpos;
unsigned long start_code, end_code;
ssize_t result;
int ret;
@@ -560,6 +565,9 @@ static int load_flat_file(struct linux_binprm *bprm,
 * it all together.
 */
if (!IS_ENABLED(CONFIG_MMU) && !(flags & 
(FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
+#ifndef FLAT_TEXT_DATA_NO_GAP
+   loff_t fpos;
+
/*
 * this should give us a ROM ptr,  but if it doesn't we don't
 * really care
@@ -576,7 +584,7 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
 
-   len = data_len + extra + MAX_SHARED_LIBS * sizeof(unsigned 
long);
+   len = data_len + extra + DATA_GAP_WORDS * sizeof(unsigned long);
len = PAGE_ALIGN(len);
realdatastart = vm_mmap(NULL, 0, len,
PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 0);
@@ -591,7 +599,7 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(unsigned long),
+   DATA_GAP_WORDS * sizeof(unsigned long),
FLAT_DATA_ALIGN);
 
pr_debug("Allocated data+bss+stack (%u bytes): %lx\n",
@@ -620,9 +628,14 @@ static int load_flat_file(struct linux_binprm *bprm,
(datapos + (ntohl(hdr->reloc_start) - text_len));
memp = realdatastart;
memp_size = len;
+#else
+   pr_err("Separate text/data loading not supported\n");
+   ret = -ENOEXEC;
+   goto err;
+#endif /* FLAT_TEXT_DATA_NO_GAP */
} else {
 
-   len = text_len + data_len + extra + MAX_SHARED_LIBS * 
sizeof(u32);
+   len = text_len + data_len + extra + DATA_GAP_WORDS * 
sizeof(u32);
len = PAGE_ALIGN(len);
textpos = vm_mmap(NULL, 0, len,
PROT_READ | PROT_EXEC | PROT_WRITE, MAP_PRIVATE, 0);
@@ -638,7 +651,7 @@ static int load_flat_file(struct linux_binprm *bprm,
 
realdatastart = textpos + ntohl(hdr->data_start);
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(u32),
+   DATA_GAP_WORDS * sizeof(u32),
FLAT_DATA_ALIGN);
 
reloc = (__be32 __user *)
@@ -714,7 +727,7 @@ static int load_flat_file(struct linux_binprm *bprm,
ret = result;
pr_err("Unable to read code+data+bss, errno %d\n", ret);
vm_munmap(textpos, text_len + data_len + extra +
-   MAX_SHARED_LIBS * sizeof(u32));
+ DATA_GAP_WORDS * sizeof(u32));
goto err;
}
}
-- 
2.30.2

Re: [null_blk] de3510e52b: blktests.block.014.fail

2021-04-07 Thread Damien Le Moal

On 2021/04/07 18:02, kernel test robot wrote:
> 
> 
> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-9):
> 
> commit: de3510e52b0a398261271455562458003b8eea62 ("null_blk: fix command 
> timeout completion handling")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 
> in testcase: blktests
> version: blktests-x86_64-a210761-1_20210124
> with following parameters:
> 
>   disk: 1SSD
>   test: nvme-group-00
>   ucode: 0x11
> 
> 
> 
> on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 
> 80G memory
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
> 
> 
> block/014 (run null-blk with blk-mq and timeout injection configured)
> block/014 (run null-blk with blk-mq and timeout injection configured) [failed]
> runtime  ...  71.624s
> --- tests/block/014.out 2021-01-24 06:04:08.0 +
> +++ /mnt/nvme-group-00/nodev/block/014.out.bad  2021-04-06 
> 09:21:25.133971868 +
> @@ -1,2 +1,377 @@
>  Running block/014
> +dd: error reading '/dev/nullb0': Connection timed out
> +dd: error reading '/dev/nullb0': Connection timed out
> +dd: error reading '/dev/nullb0': Connection timed out
> +dd: error reading '/dev/nullb0': Connection timed out
> +dd: error reading '/dev/nullb0': Connection timed out
> +dd: error reading '/dev/nullb0': Connection timed out
> ...
> (Run 'diff -u tests/block/014.out 
> /mnt/nvme-group-00/nodev/block/014.out.bad' to see the entire diff)

This is not a kernel bug. It is a problem with blktest. Before my patch, the
timeout error was not propagated back to the user. It is now and causes dd to
fail. blktest seeing dd failing reports the test as failed. On the kernel side,
all is good, the reqs are completed as expected.

Note that the timeout error is reported back as is, using BLK_STS_TIMEOUT which
becomes ETIMEDOUT, hence the "Connection timed out" error message. May be we
should use the more traditional EIO ? Jens ?

In any case, I will send a patch to fix blktest block/014.


> 
> 
> 
> To reproduce:
> 
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp installjob.yaml  # job file is attached in 
> this email
> bin/lkp split-job --compatible job.yaml
>     bin/lkp runcompatible-job.yaml
> 
> 
> 
> ---
> 0DAY/LKP+ Test Infrastructure   Open Source Technology Center
> https://lists.01.org/hyperkitty/list/l...@lists.01.org   Intel Corporation
> 
> Thanks,
> Oliver Sang
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v4 5/6] PCI: fu740: Add SiFive FU740 PCIe host controller driver

2021-04-01 Thread Damien Le Moal

; + fu740_phyregwrite(1, PCIEX8MGMT_PHY_LANE1_BASE, 
> PCIEX8MGMT_PHY_INIT_VAL, afp);
> + fu740_phyregwrite(1, PCIEX8MGMT_PHY_LANE2_BASE, 
> PCIEX8MGMT_PHY_INIT_VAL, afp);
> + fu740_phyregwrite(1, PCIEX8MGMT_PHY_LANE3_BASE, 
> PCIEX8MGMT_PHY_INIT_VAL, afp);
> +}
> +
> +static void fu740_pcie_ltssm_enable(struct device *dev)
> +{
> + struct fu740_pcie *afp = dev_get_drvdata(dev);
> +
> + /* Enable LTSSM */
> + writel_relaxed(0x1, afp->mgmt_base + PCIEX8MGMT_APP_LTSSM_ENABLE);
> +}
> +
> +static int fu740_pcie_start_link(struct dw_pcie *pci)
> +{
> + struct device *dev = pci->dev;

No need for this variable.

> +
> + /* Start LTSSM. */
> + fu740_pcie_ltssm_enable(dev);
> + return 0;
> +}
> +
> +static int fu740_pcie_host_init(struct pcie_port *pp)
> +{
> + struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> + struct fu740_pcie *afp = to_fu740_pcie(pci);
> + struct device *dev = pci->dev;
> + int ret;
> +
> + /* Power on reset */
> + fu740_pcie_drive_reset(afp);
> +
> + /* Enable pcieauxclk */
> + ret = clk_prepare_enable(afp->pcie_aux);
> + if (ret)
> + dev_err(dev, "unable to enable pcie_aux clock\n");

No bailing out ? Without a clock, is this going to work ?
If that is not a problem, then I would suggest a dev_warn() here.

> +
> + /*
> +  * Assert hold_phy_rst (hold the controller LTSSM in reset after
> +  * power_up_rst_n for register programming with cr_para)
> +  */
> + writel_relaxed(0x1, afp->mgmt_base + PCIEX8MGMT_APP_HOLD_PHY_RST);
> +
> + /* Deassert power_up_rst_n */
> + ret = reset_control_deassert(afp->rst);
> + if (ret)
> + dev_err(dev, "unable to deassert pcie_power_up_rst_n\n");

Same as above.

> +
> + fu740_pcie_init_phy(afp);
> +
> + /* Disable pcieauxclk */
> + clk_disable_unprepare(afp->pcie_aux);
> + /* Clear hold_phy_rst */
> + writel_relaxed(0x0, afp->mgmt_base + PCIEX8MGMT_APP_HOLD_PHY_RST);
> + /* Enable pcieauxclk */
> + ret = clk_prepare_enable(afp->pcie_aux);
> + /* Set RC mode */
> + writel_relaxed(0x4, afp->mgmt_base + PCIEX8MGMT_DEVICE_TYPE);
> +
> + return 0;
> +}
> +
> +static const struct dw_pcie_host_ops fu740_pcie_host_ops = {
> + .host_init = fu740_pcie_host_init,
> +};
> +
> +static const struct dw_pcie_ops dw_pcie_ops = {
> + .start_link = fu740_pcie_start_link,
> +};
> +
> +static int fu740_pcie_probe(struct platform_device *pdev)
> +{
> + struct device *dev = &pdev->dev;
> + struct dw_pcie *pci;
> + struct fu740_pcie *afp;
> + int ret;
> +
> + afp = devm_kzalloc(dev, sizeof(*afp), GFP_KERNEL);
> + if (!afp)
> + return -ENOMEM;
> + pci = &afp->pci;
> + pci->dev = dev;
> + pci->ops = &dw_pcie_ops;
> + pci->pp.ops = &fu740_pcie_host_ops;
> +
> + /* SiFive specific region: mgmt */
> + afp->mgmt_base = devm_platform_ioremap_resource_byname(pdev, "mgmt");
> + if (IS_ERR(afp->mgmt_base))
> + return PTR_ERR(afp->mgmt_base);
> +
> + /* Fetch GPIOs */
> + afp->reset = devm_gpiod_get_optional(dev, "reset-gpios", GPIOD_OUT_LOW);
> + if (IS_ERR(afp->reset)) {
> + dev_err(dev, "unable to get reset-gpios\n");
> + return ret;
> + }
> + afp->pwren = devm_gpiod_get_optional(dev, "pwren-gpios", GPIOD_OUT_LOW);
> + if (IS_ERR(afp->pwren)) {
> + dev_err(dev, "unable to get pwren-gpios\n");
> + return ret;

Why not return dev_err_probe(...); ? Same for the returns above.

> + }
> +
> + /* Fetch clocks */
> + afp->pcie_aux = devm_clk_get(dev, "pcie_aux");
> + if (IS_ERR(afp->pcie_aux))
> + return dev_err_probe(dev, PTR_ERR(afp->pcie_aux),
> +  "pcie_aux clock source missing or 
> invalid\n");
> +
> + /* Fetch reset */
> + afp->rst = devm_reset_control_get_exclusive(dev, NULL);
> + if (IS_ERR(afp->rst))
> + return dev_err_probe(dev, PTR_ERR(afp->rst), "unable to get 
> reset\n");
> +
> + platform_set_drvdata(pdev, afp);
> +
> + ret = dw_pcie_host_init(&pci->pp);
> + if (ret < 0)
> + return ret;

You can simplify this with a simple:

return dw_pcie_host_init(&pci->pp);

> +
> + return 0;
> +}
> +
> +static void fu740_pcie_shutdown(struct platform_device *pdev)
> +{
> + struct fu740_pcie *afp = platform_get_drvdata(pdev);
> +
> + /* Bring down link, so bootloader gets clean state in case of reboot */
> + fu740_pcie_assert_reset(afp);
> +}
> +
> +static const struct of_device_id fu740_pcie_of_match[] = {
> + { .compatible = "sifive,fu740-pcie", },
> + {},
> +};
> +
> +static struct platform_driver fu740_pcie_driver = {
> + .driver = {
> +.name = "fu740-pcie",
> +.of_match_table = fu740_pcie_of_match,
> +.suppress_bind_attrs = true,
> + },
> + .probe = fu740_pcie_probe,
> + .shutdown = fu740_pcie_shutdown,
> +};
> +
> +builtin_platform_driver(fu740_pcie_driver);
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v4 1/2] binfmt_flat: allow not offsetting data start

2021-04-16 Thread Damien Le Moal

On 2021/04/17 13:52, Greg Ungerer wrote:
> 
> On 17/4/21 11:10 am, Damien Le Moal wrote:
>> Commit 2217b9826246 ("binfmt_flat: revert "binfmt_flat: don't offset
>> the data start"") restored offsetting the start of the data section by
>> a number of words defined by MAX_SHARED_LIBS. As a result, since
>> MAX_SHARED_LIBS is never 0, a gap between the text and data sections
>> always exists. For architectures which cannot support a such gap
>> between the text and data sections (e.g. riscv nommu), flat binary
>> programs cannot be executed.
>>
>> To allow an architecture to request no data start offset to allow for
>> contiguous text and data sections for binaries flagged with
>> FLAT_FLAG_RAM, introduce the new config option
>> CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET. Using this new option, the
>> macro DATA_START_OFFSET_WORDS is conditionally defined in binfmt_flat.c
>> to MAX_SHARED_LIBS for architectures tolerating or needing the data
>> start offset (CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET disabled case)
>> and to 0 when CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET is enabled.
>> DATA_START_OFFSET_WORDS is used in load_flat_file() to calculate the
>> data section length and start position.
>>
>> Signed-off-by: Damien Le Moal 
>> ---
>>   fs/Kconfig.binfmt |  3 +++
>>   fs/binfmt_flat.c  | 19 ++-
>>   2 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
>> index c6f1c8c1934e..06fb7a93a1bd 100644
>> --- a/fs/Kconfig.binfmt
>> +++ b/fs/Kconfig.binfmt
>> @@ -112,6 +112,9 @@ config BINFMT_FLAT_ARGVP_ENVP_ON_STACK
>>   config BINFMT_FLAT_OLD_ALWAYS_RAM
>>  bool
>>   
>> +config BINFMT_FLAT_NO_DATA_START_OFFSET
>> +bool
>> +
>>   config BINFMT_FLAT_OLD
>>  bool "Enable support for very old legacy flat binaries"
>>  depends on BINFMT_FLAT
>> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
>> index b9c658e0548e..1dc68dfba3e0 100644
>> --- a/fs/binfmt_flat.c
>> +++ b/fs/binfmt_flat.c
>> @@ -74,6 +74,12 @@
>>   #defineMAX_SHARED_LIBS (1)
>>   #endif
>>   
>> +#ifdef CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET
>> +#define DATA_START_OFFSET_WORDS (0)
>> +#else
>> +#define DATA_START_OFFSET_WORDS (MAX_SHARED_LIBS)
>> +#endif
>> +
>>   struct lib_info {
>>  struct {
>>  unsigned long start_code;   /* Start of text 
>> segment */
>> @@ -560,6 +566,7 @@ static int load_flat_file(struct linux_binprm *bprm,
>>   * it all together.
>>   */
>>  if (!IS_ENABLED(CONFIG_MMU) && !(flags & 
>> (FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
>> +
> 
> Random white space change...
> Don't worry about re-spinning though, I will just edit this chunk out.

Oops. Sorry about that. I should have better checked :)

> 
> 
>>  /*
>>   * this should give us a ROM ptr,  but if it doesn't we don't
>>   * really care
>> @@ -576,7 +583,8 @@ static int load_flat_file(struct linux_binprm *bprm,
>>  goto err;
>>  }
>>   
>> -len = data_len + extra + MAX_SHARED_LIBS * sizeof(unsigned 
>> long);
>> +len = data_len + extra +
>> +DATA_START_OFFSET_WORDS * sizeof(unsigned long);
>>  len = PAGE_ALIGN(len);
>>  realdatastart = vm_mmap(NULL, 0, len,
>>  PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 0);
>> @@ -591,7 +599,7 @@ static int load_flat_file(struct linux_binprm *bprm,
>>  goto err;
>>  }
>>  datapos = ALIGN(realdatastart +
>> -MAX_SHARED_LIBS * sizeof(unsigned long),
>> +DATA_START_OFFSET_WORDS * sizeof(unsigned long),
>>  FLAT_DATA_ALIGN);
>>   
>>  pr_debug("Allocated data+bss+stack (%u bytes): %lx\n",
>> @@ -622,7 +630,8 @@ static int load_flat_file(struct linux_binprm *bprm,
>>  memp_size = len;
>>  } else {
>>   
>> -len = text_len + data_len + extra + MAX_SHARED_LIBS * 
>> sizeof(u32);
>> +len = text_len + data_len + extra +
>> +DATA_START_OFFSET_WORDS * sizeof(u32);
>>  len = PAGE_ALIGN(len);
>>  textpos = vm_mmap(NULL, 0, len,
>>

Re: [RESEND,v5,1/2] bio: limit bio max size

2021-04-11 Thread Damien Le Moal

On 2021/04/09 23:47, Bart Van Assche wrote:
> On 4/7/21 3:27 AM, Damien Le Moal wrote:
>> On 2021/04/07 18:46, Changheun Lee wrote:
>>> I'll prepare new patch as you recommand. It will be added setting of
>>> limit_bio_size automatically when queue max sectors is determined.
>>
>> Please do that in the driver for the HW that benefits from it. Do not do this
>> for all block devices.
> 
> Hmm ... is it ever useful to build a bio with a size that exceeds 
> max_hw_sectors when submitting a bio directly to a block device, or in 
> other words, if no stacked block driver sits between the submitter and 
> the block device? Am I perhaps missing something?

Device performance wise, the benefits are certainly not obvious to me either.
But for very fast block devices, I think the CPU overhead of building more
smaller BIOs may be significant compared to splitting a large BIO into multiple
requests. Though it may be good to revisit this with some benchmark numbers.

> 
> Thanks,
> 
> Bart.
> 


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH v5 2/4] block: add simple copy support

2021-04-11 Thread Damien Le Moal

On 2021/04/07 20:33, Selva Jove wrote:
> Initially I started moving the dm-kcopyd interface to the block layer
> as a generic interface.
> Once I dig deeper in dm-kcopyd code, I figured that dm-kcopyd is
> tightly coupled with dm_io()
> 
> To move dm-kcopyd to block layer, it would also require dm_io code to
> be moved to block layer.
> It would cause havoc in dm layer, as it is the backbone of the
> dm-layer and needs complete
> rewriting of dm-layer. Do you see any other way of doing this without
> having to move dm_io code
> or to have redundant code ?

Right. Missed that. So reusing dm-kcopyd and making it a common interface will
take some more efforts. OK, then. For the first round of commits, let's forget
about this. But I still think that your emulation could be a lot better than a
loop doing blocking writes after blocking reads.

[...]
>>> +int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
>>> + struct range_entry *src_rlist, struct block_device *dest_bdev,
>>> + sector_t dest, gfp_t gfp_mask, int flags)
>>> +{
>>> + struct request_queue *q = bdev_get_queue(src_bdev);
>>> + struct request_queue *dest_q = bdev_get_queue(dest_bdev);
>>> + struct blk_copy_payload *payload;
>>> + sector_t bs_mask, copy_size;
>>> + int ret;
>>> +
>>> + ret = blk_prepare_payload(src_bdev, nr_srcs, src_rlist, gfp_mask,
>>> + &payload, ©_size);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
>>> + if (dest & bs_mask) {
>>> + return -EINVAL;
>>> + goto out;
>>> + }
>>> +
>>> + if (q == dest_q && q->limits.copy_offload) {
>>> + ret = blk_copy_offload(src_bdev, payload, dest, gfp_mask);
>>> + if (ret)
>>> + goto out;
>>> + } else if (flags & BLKDEV_COPY_NOEMULATION) {
>>
>> Why ? whoever calls blkdev_issue_copy() wants a copy to be done. Why would 
>> that
>> user say "Fail on me if the device does not support copy" ??? This is a weird
>> interface in my opinion.
>>
> 
> BLKDEV_COPY_NOEMULATION flag was introduced to allow blkdev_issue_copy() 
> callers
> to use their native copying method instead of the emulated copy that I
> added. This way we
> ensure that dm uses the hw-assisted copy and if that is not present,
> it falls back to existing
> copy method.
> 
> The other users who don't have their native emulation can use this
> emulated-copy implementation.

I do not understand. Emulation or not should be entirely driven by the device
reporting support for simple copy (or not). It does not matter which component
is issuing the simple copy call: an FS to a real device, and FS to a DM device
or a DM target driver. If the underlying device reported support for simple
copy, use that. Otherwise, emulate with read/write. What am I missing here ?

[...]
>>> @@ -565,6 +569,12 @@ int blk_stack_limits(struct queue_limits *t, struct 
>>> queue_limits *b,
>>>   if (b->chunk_sectors)
>>>   t->chunk_sectors = gcd(t->chunk_sectors, b->chunk_sectors);
>>>
>>> + /* simple copy not supported in stacked devices */
>>> + t->copy_offload = 0;
>>> + t->max_copy_sectors = 0;
>>> + t->max_copy_range_sectors = 0;
>>> + t->max_copy_nr_ranges = 0;
>>
>> You do not need this. Limits not explicitely initialized are 0 already.
>> But I do not see why you can't support copy on stacked devices. That should 
>> be
>> feasible taking the min() for each of the above limit.
>>
> 
> Disabling stacked device support was feedback from v2.
> 
> https://patchwork.kernel.org/project/linux-block/patch/20201204094659.12732-2-selvakuma...@samsung.com/

Right. But the initialization to 0 is still not needed. The fields are already
initialized to 0.


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH v5 2/4] block: add simple copy support

2021-02-19 Thread Damien Le Moal

);
> + if (!rlist)
> + return -ENOMEM;
> +
> + if (copy_from_user(rlist, (void __user *)crange.range_list,
> + sizeof(*rlist) * crange.nr_range)) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + ret = blkdev_issue_copy(bdev, crange.nr_range, rlist, bdev, crange.dest,
> + GFP_KERNEL, flags);
> +out:
> + kfree(rlist);
> + return ret;
> +}
> +
>  static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
>   unsigned long arg)
>  {
> @@ -458,6 +489,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, 
> fmode_t mode,
>   case BLKSECDISCARD:
>   return blk_ioctl_discard(bdev, mode, arg,
>   BLKDEV_DISCARD_SECURE);
> + case BLKCOPY:
> + return blk_ioctl_copy(bdev, mode, arg, 0);
>   case BLKZEROOUT:
>   return blk_ioctl_zeroout(bdev, mode, arg);
>   case BLKREPORTZONE:
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1edda614f7ce..164313bdfb35 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -71,6 +71,7 @@ static inline bool bio_has_data(struct bio *bio)
>  static inline bool bio_no_advance_iter(const struct bio *bio)
>  {
>   return bio_op(bio) == REQ_OP_DISCARD ||
> +bio_op(bio) == REQ_OP_COPY ||
>  bio_op(bio) == REQ_OP_SECURE_ERASE ||
>  bio_op(bio) == REQ_OP_WRITE_SAME ||
>  bio_op(bio) == REQ_OP_WRITE_ZEROES;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 866f74261b3b..5a35c02ac0a8 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -380,6 +380,8 @@ enum req_opf {
>   REQ_OP_ZONE_RESET   = 15,
>   /* reset all the zone present on the device */
>   REQ_OP_ZONE_RESET_ALL   = 17,
> + /* copy ranges within device */
> + REQ_OP_COPY = 19,
>  
>   /* SCSI passthrough using struct scsi_request */
>   REQ_OP_SCSI_IN  = 32,
> @@ -506,6 +508,11 @@ static inline bool op_is_discard(unsigned int op)
>   return (op & REQ_OP_MASK) == REQ_OP_DISCARD;
>  }
>  
> +static inline bool op_is_copy(unsigned int op)
> +{
> + return (op & REQ_OP_MASK) == REQ_OP_COPY;
> +}
> +
>  /*
>   * Check if a bio or request operation is a zone management operation, with
>   * the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case
> @@ -565,4 +572,11 @@ struct blk_rq_stat {
>   u64 batch;
>  };
>  
> +struct blk_copy_payload {
> + sector_tdest;
> + int copy_nr_ranges;
> + struct block_device *src_bdev;
> + struct  range_entry range[];
> +};
> +
>  #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 699ace6b25ff..2bb4513d4bb8 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -337,10 +337,14 @@ struct queue_limits {
>   unsigned intmax_zone_append_sectors;
>   unsigned intdiscard_granularity;
>   unsigned intdiscard_alignment;
> + unsigned intcopy_offload;
> + unsigned intmax_copy_sectors;
>  
>   unsigned short  max_segments;
>   unsigned short  max_integrity_segments;
>   unsigned short  max_discard_segments;
> + unsigned short  max_copy_range_sectors;
> + unsigned short  max_copy_nr_ranges;
>  
>   unsigned char   misaligned;
>   unsigned char   discard_misaligned;
> @@ -621,6 +625,7 @@ struct request_queue {
>  #define QUEUE_FLAG_RQ_ALLOC_TIME 27  /* record rq->alloc_time_ns */
>  #define QUEUE_FLAG_HCTX_ACTIVE   28  /* at least one blk-mq hctx is 
> active */
>  #define QUEUE_FLAG_NOWAIT   29   /* device supports NOWAIT */
> +#define QUEUE_FLAG_SIMPLE_COPY   30  /* supports simple copy */
>  
>  #define QUEUE_FLAG_MQ_DEFAULT((1 << QUEUE_FLAG_IO_STAT) |
> \
>(1 << QUEUE_FLAG_SAME_COMP) |  \
> @@ -643,6 +648,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, 
> struct request_queue *q);
>  #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
>  #define blk_queue_add_random(q)  test_bit(QUEUE_FLAG_ADD_RANDOM, 
> &(q)->queue_flags)
>  #define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
> +#define blk_queue_copy(q)test_bit(QUEUE_FLAG_SIMPLE_COPY, 
> &(q)->queue_flags)
>  #define blk_queue_zone_resetall(q)   \
>   test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
>  #define blk_queue_secure_erase(q) \
> @@ -1069,6 +1075,9 @@ static inline unsigned int 
> blk_queue_get_max_sectors(struct request_queue *q,
>   return min(q->limits.max_discard_sectors,
>  UINT_MAX >> SECTOR_SHIFT);
>  
> + if (unlikely(op == REQ_OP_COPY))
> + return q->limits.max_copy_sectors;
> +

I would agreee with this if a copy BIO was always a single range, but that is
not the case. So I am not sure this makes sense at all.

>   if (unlikely(op == REQ_OP_WRITE_SAME))
>   return q->limits.max_write_same_sectors;
>  
> @@ -1343,6 +1352,12 @@ extern int __blkdev_issue_discard(struct block_device 
> *bdev, sector_t sector,
>   sector_t nr_sects, gfp_t gfp_mask, int flags,
>   struct bio **biop);
>  
> +#define BLKDEV_COPY_NOEMULATION  (1 << 0)/* do not emulate if 
> copy offload not supported */
> +
> +extern int blkdev_issue_copy(struct block_device *src_bdev, int nr_srcs,
> + struct range_entry *src_rlist, struct block_device *dest_bdev,
> + sector_t dest, gfp_t gfp_mask, int flags);

No need for extern.

> +
>  #define BLKDEV_ZERO_NOUNMAP  (1 << 0)  /* do not free blocks */
>  #define BLKDEV_ZERO_NOFALLBACK   (1 << 1)  /* don't write explicit 
> zeroes */
>  
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index f44eb0a04afd..5cadb176317a 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -64,6 +64,18 @@ struct fstrim_range {
>   __u64 minlen;
>  };
>  
> +struct range_entry {
> + __u64 src;
> + __u64 len;
> +};
> +
> +struct copy_range {
> + __u64 dest;
> + __u64 nr_range;
> + __u64 range_list;
> + __u64 rsvd;
> +};
> +
>  /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions 
> */
>  #define FILE_DEDUPE_RANGE_SAME   0
>  #define FILE_DEDUPE_RANGE_DIFFERS1
> @@ -184,6 +196,7 @@ struct fsxattr {
>  #define BLKSECDISCARD _IO(0x12,125)
>  #define BLKROTATIONAL _IO(0x12,126)
>  #define BLKZEROOUT _IO(0x12,127)
> +#define BLKCOPY _IOWR(0x12, 128, struct copy_range)
>  /*
>   * A jump here: 130-131 are reserved for zoned block devices
>   * (see uapi/linux/blkzoned.h)
> 

Please test your code more thoroughly. It is full of problems that you should
have detected with better testing including RO devices, partitions and error
path coverage.

-- 
Damien Le Moal
Western Digital Research

Re: [dm-devel] [RFC PATCH v2 1/2] block: add simple copy support

2020-12-08 Thread Damien Le Moal

On 2020/12/09 13:20, Martin K. Petersen wrote:
> 
> SelvaKumar,
> 
>> Add new BLKCOPY ioctl that offloads copying of multiple sources
>> to a destination to the device.
> 
> Your patches are limited in scope to what is currently possible with
> NVMe. I.e. multiple source ranges to a single destination within the
> same device. That's fine, I think the garbage collection use case is
> valid and worth pursuing.
> 
> I just wanted to go over what the pain points were for the various
> attempts in SCSI over the years.
> 
> The main headache was due the stacking situation with DM and MD.
> Restricting offload to raw SCSI disks would have been simple but not
> really a good fit for most real world developments that often use DM or
> MD to provision the storage.
> 
> Things are simple for DM/MD with reads and writes because you have one
> bio as parent that may get split into many clones that complete
> individually prior to the parent being marked as completed.
> 
> In the copy offload scenario things quickly become complex once both
> source and destination ranges have to be split into multiple commands
> for potentially multiple devices. And these clones then need to be
> correctly paired at the bottom of the stack. There's also no guarantee
> that a 1MB source range maps to a single 1MB destination range. So you
> could end up with an M:N relationship to resolve.
> 
> After a few failed attempts we focused on single source range/single
> destination range. Just to simplify the slicing and dicing. That worked
> reasonably well. However, then came along the token-based commands in
> SCSI and those threw a wrench in the gears. Now the block layer plumbing
> had to support two completely different semantic approaches.
> 
> Inspired by a combination of Mikulas' efforts with pointer matching and
> the token-based approach in SCSI I switched the block layer
> implementation from a single operation (REQ_COPY) to something similar
> to the SCSI token approach with a REQ_COPY_IN and a REQ_COPY_OUT.
> 
> The premise being that you would send a command to the source device and
> "get" the data. In the EXTENDED COPY scenario, the data wasn't really
> anything but a confirmation from the SCSI disk driver that the I/O had
> reached the bottom of the stack without being split by DM/MD. And once
> completion of the REQ_COPY_IN reached blk-lib, a REQ_COPY_OUT would be
> issued and, if that arrived unchanged in the disk driver, get turned
> into an EXTENDED COPY sent to the destination.
> 
> In the token-based scenario the same thing happened except POPULATE
> TOKEN was sent all the way out to the device to receive a cookie
> representing the source block ranges. Upon completion, that cookie was
> used by blk-lib to issue a REQ_COPY_OUT command which was then sent to
> the destination device. Again only if the REQ_COPY_OUT I/O hadn't been
> split traversing the stack.
> 
> The idea was to subsequently leverage the separation of REQ_COPY_IN and
> REQ_COPY_OUT to permit a DM/MD iterative approach to both stages of the
> operation. That seemed to me like the only reasonable way to approach
> the M:N splitting problem (if at all)...

Another simple approach, at least initially for the first drop, would be to
disable any sort of native hardware-based copy for stacked devices. These
devices would simply not advertise copy support in their request queue flags,
forcing the block layer generic copy API to do read-writes, very similar to
dm-kcopyd. Use cases where a drive with native copy support is used directly
would still be able to benefit from the hardware native function, dependent
eventually on a sysfs switch (which by default would be off maybe).

Integrating nvme simple copy in such initial support would I think be quite
simple and scsi xcopy can follow. From there, adding stack device support can be
worked on with little, if any, impact on the existing users of the block copy
API (mostly FSes such as f2fs and btrfs).


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-04 Thread Damien Le Moal

On 2020/12/04 20:02, SelvaKumar S wrote:
> This patchset tries to add support for TP4065a ("Simple Copy Command"),
> v2020.05.04 ("Ratified")
> 
> The Specification can be found in following link.
> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> 
> This is an RFC. Looking forward for any feedbacks or other alternate
> designs for plumbing simple copy to IO stack.
> 
> Simple copy command is a copy offloading operation and is  used to copy
> multiple contiguous ranges (source_ranges) of LBA's to a single destination
> LBA within the device reducing traffic between host and device.
> 
> This implementation accepts destination, no of sources and arrays of
> source ranges from application and attach it as payload to the bio and
> submits to the device.
> 
> Following limits are added to queue limits and are exposed in sysfs
> to userspace
>   - *max_copy_sectors* limits the sum of all source_range length
>   - *max_copy_nr_ranges* limits the number of source ranges
>   - *max_copy_range_sectors* limit the maximum number of sectors
>   that can constitute a single source range.

Same comment as before. I think this is a good start, but for this to be really
useful to users and kernel components alike, this really needs copy emulation
for drives that do not have a native copy feature, similarly to what write zeros
handling for instance: if the drive does not have a copy command (simple copy
for NVMe or XCOPY for scsi), then the block layer should issue read/write
commands to seamlessly execute the copy. Otherwise, this will only serve a small
niche for users and will not be optimal for FS and DM drivers that could be
simplified with a generic block layer copy functionality.

This is my 10 cents though, others may differ about this.

> 
> Changes from v1:
> 
> 1. Fix memory leak in __blkdev_issue_copy
> 2. Unmark blk_check_copy inline
> 3. Fix line break in blk_check_copy_eod
> 4. Remove p checks and made code more readable
> 5. Don't use bio_set_op_attrs and remove op and set
>bi_opf directly
> 6. Use struct_size to calculate total_size
> 7. Fix partition remap of copy destination
> 8. Remove mcl,mssrl,msrc from nvme_ns
> 9. Initialize copy queue limits to 0 in nvme_config_copy
> 10. Remove return in QUEUE_FLAG_COPY check
> 11. Remove unused OCFS
> 
> SelvaKumar S (2):
>   block: add simple copy support
>   nvme: add simple copy support
> 
>  block/blk-core.c  |  94 ++---
>  block/blk-lib.c   | 123 ++
>  block/blk-merge.c |   2 +
>  block/blk-settings.c  |  11 
>  block/blk-sysfs.c |  23 +++
>  block/blk-zoned.c |   1 +
>  block/bounce.c|   1 +
>  block/ioctl.c |  43 +
>  drivers/nvme/host/core.c  |  87 +++
>  include/linux/bio.h   |   1 +
>  include/linux/blk_types.h |  15 +
>  include/linux/blkdev.h|  15 +
>  include/linux/nvme.h  |  43 -
>  include/uapi/linux/fs.h   |  13 
>  14 files changed, 461 insertions(+), 11 deletions(-)
> 


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Damien Le Moal

On 2020/12/07 16:46, javier.g...@samsung.com wrote:
> On 04.12.2020 23:40, Keith Busch wrote:
>> On Fri, Dec 04, 2020 at 11:25:12AM +0000, Damien Le Moal wrote:
>>> On 2020/12/04 20:02, SelvaKumar S wrote:
>>>> This patchset tries to add support for TP4065a ("Simple Copy Command"),
>>>> v2020.05.04 ("Ratified")
>>>>
>>>> The Specification can be found in following link.
>>>> https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
>>>>
>>>> This is an RFC. Looking forward for any feedbacks or other alternate
>>>> designs for plumbing simple copy to IO stack.
>>>>
>>>> Simple copy command is a copy offloading operation and is  used to copy
>>>> multiple contiguous ranges (source_ranges) of LBA's to a single destination
>>>> LBA within the device reducing traffic between host and device.
>>>>
>>>> This implementation accepts destination, no of sources and arrays of
>>>> source ranges from application and attach it as payload to the bio and
>>>> submits to the device.
>>>>
>>>> Following limits are added to queue limits and are exposed in sysfs
>>>> to userspace
>>>>- *max_copy_sectors* limits the sum of all source_range length
>>>>- *max_copy_nr_ranges* limits the number of source ranges
>>>>- *max_copy_range_sectors* limit the maximum number of sectors
>>>>that can constitute a single source range.
>>>
>>> Same comment as before. I think this is a good start, but for this to be 
>>> really
>>> useful to users and kernel components alike, this really needs copy 
>>> emulation
>>> for drives that do not have a native copy feature, similarly to what write 
>>> zeros
>>> handling for instance: if the drive does not have a copy command (simple 
>>> copy
>>> for NVMe or XCOPY for scsi), then the block layer should issue read/write
>>> commands to seamlessly execute the copy. Otherwise, this will only serve a 
>>> small
>>> niche for users and will not be optimal for FS and DM drivers that could be
>>> simplified with a generic block layer copy functionality.
>>>
>>> This is my 10 cents though, others may differ about this.
>>
>> Yes, I agree that copy emulation support should be included with the
>> hardware enabled solution.
> 
> Keith, Damien,
> 
> Can we do the block layer emulation with this patchset and then work in
> follow-up patchses on (i) the FS interface with F2FS as a first user and
> (ii) other HW accelerations such as XCOPY?

The initial patchset supporting NVMe simple copy and emulation copy, all under
an API that probably will be similar that of dm-kcopyd will cover all block
devices. Other hardware native support for copy functions such as scsi extended
copy can be added later under the hood without any API changes (or minimal 
changes).

I am not sure what you mean by "FS interface for F2FS": the block layer API for
this copy functionality will be what F2FS (and other FSes) will call. That is
the interface, no ?

> For XCOPY, I believe we need to have a separate discussion as much works
> is already done that we should align to.

I think Martin (added to this thread) and others have looked into it but I do
not think that anything made it into the kernel yet.


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] riscv: defconfig: k210: Disable CONFIG_VT

2020-11-24 Thread Damien Le Moal

On 2020/11/25 3:57, Geert Uytterhoeven wrote:
> There is no need to enable Virtual Terminal support in the Canaan
> Kendryte K210 defconfigs, as no terminal devices are supported and
> enabled.  Hence disable CONFIG_VT, and remove the no longer needed
> override for CONFIG_VGA_CONSOLE.
> 
> This reduces kernel size by ca. 65 KiB.

Indeed, nice saving. Just tested, and all is good.

Can I squash this patch into the 2 defconfig update patches of the series,
adding your signed-off-by ? Or would you prefer that I keep it as a separate 
patch ?

> 
> Signed-off-by: Geert Uytterhoeven 
> ---
> Against k210-sysctl-v15
> ---
>  arch/riscv/configs/nommu_k210_defconfig| 2 +-
>  arch/riscv/configs/nommu_k210_sdcard_defconfig | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/riscv/configs/nommu_k210_defconfig 
> b/arch/riscv/configs/nommu_k210_defconfig
> index df89d53bd125679b..9262223037e43479 100644
> --- a/arch/riscv/configs/nommu_k210_defconfig
> +++ b/arch/riscv/configs/nommu_k210_defconfig
> @@ -48,6 +48,7 @@ CONFIG_DEVTMPFS_MOUNT=y
>  # CONFIG_INPUT_KEYBOARD is not set
>  # CONFIG_INPUT_MOUSE is not set
>  # CONFIG_SERIO is not set
> +# CONFIG_VT is not set
>  # CONFIG_LEGACY_PTYS is not set
>  # CONFIG_LDISC_AUTOLOAD is not set
>  # CONFIG_HW_RANDOM is not set
> @@ -67,7 +68,6 @@ CONFIG_GPIO_SIFIVE=y
>  CONFIG_POWER_RESET=y
>  CONFIG_POWER_RESET_SYSCON=y
>  # CONFIG_HWMON is not set
> -# CONFIG_VGA_CONSOLE is not set
>  # CONFIG_HID is not set
>  # CONFIG_USB_SUPPORT is not set
>  CONFIG_NEW_LEDS=y
> diff --git a/arch/riscv/configs/nommu_k210_sdcard_defconfig 
> b/arch/riscv/configs/nommu_k210_sdcard_defconfig
> index 3d2cb4747e7f85b7..4cd1715dd0cf3747 100644
> --- a/arch/riscv/configs/nommu_k210_sdcard_defconfig
> +++ b/arch/riscv/configs/nommu_k210_sdcard_defconfig
> @@ -41,6 +41,7 @@ CONFIG_DEVTMPFS_MOUNT=y
>  # CONFIG_INPUT_KEYBOARD is not set
>  # CONFIG_INPUT_MOUSE is not set
>  # CONFIG_SERIO is not set
> +# CONFIG_VT is not set
>  # CONFIG_LEGACY_PTYS is not set
>  # CONFIG_LDISC_AUTOLOAD is not set
>  # CONFIG_HW_RANDOM is not set
> @@ -60,7 +61,6 @@ CONFIG_GPIO_SIFIVE=y
>  CONFIG_POWER_RESET=y
>  CONFIG_POWER_RESET_SYSCON=y
>  # CONFIG_HWMON is not set
> -# CONFIG_VGA_CONSOLE is not set
>  # CONFIG_HID is not set
>  # CONFIG_USB_SUPPORT is not set
>  CONFIG_MMC=y
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] efi: EFI_EARLYCON should depend on EFI

2020-11-25 Thread Damien Le Moal

On Tue, 2020-11-24 at 20:16 +0100, Geert Uytterhoeven wrote:
> CONFIG_EFI_EARLYCON defaults to yes, and thus is enabled on systems that
> do not support EFI, or do not have EFI support enabled, but do satisfy
> the symbol's other dependencies.
> 
> While drivers/firmware/efi/ won't be entered during the build phase if
> CONFIG_EFI=n, and drivers/firmware/efi/earlycon.c itself thus won't be
> built, enabling EFI_EARLYCON does force-enable CONFIG_FONT_SUPPORT and
> CONFIG_ARCH_USE_MEMREMAP_PROT, and CONFIG_FONT_8x16, which is
> undesirable.
> 
> Fix this by making CONFIG_EFI_EARLYCON depend on CONFIG_EFI.
> 
> This reduces kernel size on headless systems by more than 4 KiB.
> 
> Fixes: 69c1f396f25b805a ("efi/x86: Convert x86 EFI earlyprintk into generic 
> earlycon implementation")
> Signed-off-by: Geert Uytterhoeven 
> ---
>  drivers/firmware/efi/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
> index b452cfa2100b401c..1dd1f7784f0888ff 100644
> --- a/drivers/firmware/efi/Kconfig
> +++ b/drivers/firmware/efi/Kconfig
> @@ -270,7 +270,7 @@ config EFI_DEV_PATH_PARSER
>  
> 
>  config EFI_EARLYCON
>   def_bool y
> - depends on SERIAL_EARLYCON && !ARM && !IA64
> + depends on EFI && SERIAL_EARLYCON && !ARM && !IA64
>   select FONT_SUPPORT
>   select ARCH_USE_MEMREMAP_PROT
>  
> 

Looks good to me.
Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital

Re: [PATCH] riscv: defconfig: k210: Disable CONFIG_VT

2020-11-25 Thread Damien Le Moal

On Wed, 2020-11-25 at 09:20 +0100, Geert Uytterhoeven wrote:
> Hi Damien,
> 
> On Wed, Nov 25, 2020 at 7:14 AM Damien Le Moal  wrote:
> > On 2020/11/25 3:57, Geert Uytterhoeven wrote:
> > > There is no need to enable Virtual Terminal support in the Canaan
> > > Kendryte K210 defconfigs, as no terminal devices are supported and
> > > enabled.  Hence disable CONFIG_VT, and remove the no longer needed
> > > override for CONFIG_VGA_CONSOLE.
> > > 
> > > This reduces kernel size by ca. 65 KiB.
> > 
> > Indeed, nice saving. Just tested, and all is good.
> > 
> > Can I squash this patch into the 2 defconfig update patches of the series,
> > adding your signed-off-by ? Or would you prefer that I keep it as a 
> > separate patch ?
> 
> Feel free to squash it into your queued updates.
> No need to add my SoB, as the full updates don't pass through me.

Done. Thanks !

> 
> Thanks!
> 
> Gr{oetje,eeting}s,
> 
> Geert
> 

-- 
Damien Le Moal
Western Digital

Re: [PATCH] riscv: defconfig: k210: Disable CONFIG_VT

2020-11-25 Thread Damien Le Moal

On 2020/11/25 17:51, Geert Uytterhoeven wrote:
> Hi Damien,
> 
> On Wed, Nov 25, 2020 at 7:14 AM Damien Le Moal  wrote:
>> On 2020/11/25 3:57, Geert Uytterhoeven wrote:
>>> There is no need to enable Virtual Terminal support in the Canaan
>>> Kendryte K210 defconfigs, as no terminal devices are supported and
>>> enabled.  Hence disable CONFIG_VT, and remove the no longer needed
>>> override for CONFIG_VGA_CONSOLE.
>>>
>>> This reduces kernel size by ca. 65 KiB.
>>
>> Indeed, nice saving. Just tested, and all is good.
> 
> I used my old script[1] to check the impact of disabling config options.
> 
> I don't see any other low-hanging fruits:
> 
> Disabling CONFIG_BLOCK saves 492890 bytes
> Disabling CONFIG_EXT4_FS saves 322528 bytes
> Disabling CONFIG_PRINTK saves 214612 bytes
> Disabling CONFIG_SMP saves 214486 bytes
> Disabling CONFIG_FRAME_POINTER saves 166368 bytes
> Disabling CONFIG_TTY saves 156618 bytes
> Disabling CONFIG_PROC_FS saves 110274 bytes
> Disabling CONFIG_MMC saves 87656 bytes
> Disabling CONFIG_VT saves 70350 bytes
> Disabling CONFIG_SYSFS saves 62298 bytes
> Disabling CONFIG_BUG saves 50882 bytes
> Disabling CONFIG_SPI saves 34420 bytes
> Disabling CONFIG_SOC_CANAAN saves 34286 bytes
> Disabling CONFIG_I2C saves 34086 bytes
> Disabling CONFIG_PROC_SYSCTL saves 23890 bytes
> Disabling CONFIG_POSIX_TIMERS saves 18388 bytes
> Disabling CONFIG_I2C_DESIGNWARE_PLATFORM saves 17530 bytes
> Disabling CONFIG_MMC_BLOCK saves 17200 bytes
> Disabling CONFIG_UNIX98_PTYS saves 16360 bytes
> Disabling CONFIG_MULTIUSER saves 16148 bytes
> Disabling CONFIG_NEW_LEDS saves 15964 bytes
> Disabling CONFIG_SPI_DESIGNWARE saves 15434 bytes
> Disabling CONFIG_GPIO_CDEV saves 15352 bytes
> Disabling CONFIG_MMC_SPI saves 14754 bytes
> Disabling CONFIG_SOC_CANAAN_K210_DTB_BUILTIN saves 13864 bytes
> 
> (Yes, I have ext4 enabled ;-)
> 
> I haven't done enough riscv kernel development yet to assess if I need
> CONFIG_FRAME_POINTER or not.

Disabling it significantly reduced code size for me. Since the series is more
stable now, it is not really needed, so I disabled it in the defconfig.

I was just fiddling with CONFIG_UNIX98_PTYS. Disabling it is OK with the simple
busybox userspace (no telnet/xterm like app running). But it saves only about
1KB with my toolchain (gcc 9.3). So I left that one enabled. I am surprised that
you see 16K size impact... How big is your image ?

For me, it is 1.768 MB right now for the sdcard defconfig, with CONFIG_VT
disabled and ext2 enabled.

Disabling the block layer, ext2 and mmc driver, the size goes down to 1.511 MB
without any intramfs cpio file embedded.

> 
> [1] 
> https://github.com/geertu/linux-scripts/blob/master/linux-analyze-marginal-sizes

Thanks ! I will try to run this on my end.

> 
> Gr{oetje,eeting}s,
> 
> Geert
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] riscv: defconfig: k210: Disable CONFIG_VT

2020-11-25 Thread Damien Le Moal

On 2020/11/25 18:26, Geert Uytterhoeven wrote:
> Hi Damien,
> 
> On Wed, Nov 25, 2020 at 10:02 AM Damien Le Moal  wrote:
>> On 2020/11/25 17:51, Geert Uytterhoeven wrote:
>>> On Wed, Nov 25, 2020 at 7:14 AM Damien Le Moal  
>>> wrote:
>>>> On 2020/11/25 3:57, Geert Uytterhoeven wrote:
>>>>> There is no need to enable Virtual Terminal support in the Canaan
>>>>> Kendryte K210 defconfigs, as no terminal devices are supported and
>>>>> enabled.  Hence disable CONFIG_VT, and remove the no longer needed
>>>>> override for CONFIG_VGA_CONSOLE.
>>>>>
>>>>> This reduces kernel size by ca. 65 KiB.
>>>>
>>>> Indeed, nice saving. Just tested, and all is good.
>>>
>>> I used my old script[1] to check the impact of disabling config options.
> 
>>> I haven't done enough riscv kernel development yet to assess if I need
>>> CONFIG_FRAME_POINTER or not.
>>
>> Disabling it significantly reduced code size for me. Since the series is more
>> stable now, it is not really needed, so I disabled it in the defconfig.
>>
>> I was just fiddling with CONFIG_UNIX98_PTYS. Disabling it is OK with the 
>> simple
>> busybox userspace (no telnet/xterm like app running). But it saves only about
>> 1KB with my toolchain (gcc 9.3). So I left that one enabled. I am surprised 
>> that
>> you see 16K size impact... How big is your image ?
>>
>> For me, it is 1.768 MB right now for the sdcard defconfig, with CONFIG_VT
>> disabled and ext2 enabled.
> 
> It might depend on how you measure.  "size" says 15 KiB impact for UNIX98
> ptys, while bloat-o-meter reported less than 7 (my script uses "size").

I look at the size of the arch/riscv/boot/loader.bin file since that is what
gets loaded in RAM and booted. It is significantly smaller than vmlinux file
size. E.g. for the sd card defconfig, I have:

vmlinux: 2369920 B
loader.bin: 1751250 B

> I'm at 1.88 MiB, with ext4 and without frame pointers.
> I also got rid of the EFI partition support, and a few I/O schedulers:
> 
> +CONFIG_PARTITION_ADVANCED=y
> +# CONFIG_EFI_PARTITION is not set
> +# CONFIG_MQ_IOSCHED_DEADLINE is not set
> +# CONFIG_MQ_IOSCHED_KYBER is not set

I have all of these disabled. The schedulers are forced disabled in the sdcard
defconfig.

I also noticed that it hugely depend on the compiler. Using the buildroot
generated rv64 gcc 10, the kernel image goes up to almost 2 MB. So for the
kernel, I keep using the bootlin precompiled gcc 9.3:

https://toolchains.bootlin.com/

Just noticed that they now have a 10.2 version available. Will try it out.

> 
> Gr{oetje,eeting}s,
> 
> Geert
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] riscv: defconfig: k210: Disable CONFIG_VT

2020-11-25 Thread Damien Le Moal

On 2020/11/25 20:00, Damien Le Moal wrote:
> On 2020/11/25 18:26, Geert Uytterhoeven wrote:
>> Hi Damien,
>>
>> On Wed, Nov 25, 2020 at 10:02 AM Damien Le Moal  
>> wrote:
>>> On 2020/11/25 17:51, Geert Uytterhoeven wrote:
>>>> On Wed, Nov 25, 2020 at 7:14 AM Damien Le Moal  
>>>> wrote:
>>>>> On 2020/11/25 3:57, Geert Uytterhoeven wrote:
>>>>>> There is no need to enable Virtual Terminal support in the Canaan
>>>>>> Kendryte K210 defconfigs, as no terminal devices are supported and
>>>>>> enabled.  Hence disable CONFIG_VT, and remove the no longer needed
>>>>>> override for CONFIG_VGA_CONSOLE.
>>>>>>
>>>>>> This reduces kernel size by ca. 65 KiB.
>>>>>
>>>>> Indeed, nice saving. Just tested, and all is good.
>>>>
>>>> I used my old script[1] to check the impact of disabling config options.
>>
>>>> I haven't done enough riscv kernel development yet to assess if I need
>>>> CONFIG_FRAME_POINTER or not.
>>>
>>> Disabling it significantly reduced code size for me. Since the series is 
>>> more
>>> stable now, it is not really needed, so I disabled it in the defconfig.
>>>
>>> I was just fiddling with CONFIG_UNIX98_PTYS. Disabling it is OK with the 
>>> simple
>>> busybox userspace (no telnet/xterm like app running). But it saves only 
>>> about
>>> 1KB with my toolchain (gcc 9.3). So I left that one enabled. I am surprised 
>>> that
>>> you see 16K size impact... How big is your image ?
>>>
>>> For me, it is 1.768 MB right now for the sdcard defconfig, with CONFIG_VT
>>> disabled and ext2 enabled.
>>
>> It might depend on how you measure.  "size" says 15 KiB impact for UNIX98
>> ptys, while bloat-o-meter reported less than 7 (my script uses "size").
> 
> I look at the size of the arch/riscv/boot/loader.bin file since that is what
> gets loaded in RAM and booted. It is significantly smaller than vmlinux file
> size. E.g. for the sd card defconfig, I have:
> 
> vmlinux: 2369920 B
> loader.bin: 1751250 B
> 
>> I'm at 1.88 MiB, with ext4 and without frame pointers.
>> I also got rid of the EFI partition support, and a few I/O schedulers:
>>
>> +CONFIG_PARTITION_ADVANCED=y
>> +# CONFIG_EFI_PARTITION is not set
>> +# CONFIG_MQ_IOSCHED_DEADLINE is not set
>> +# CONFIG_MQ_IOSCHED_KYBER is not set
> 
> I have all of these disabled. The schedulers are forced disabled in the sdcard
> defconfig.
> 
> I also noticed that it hugely depend on the compiler. Using the buildroot
> generated rv64 gcc 10, the kernel image goes up to almost 2 MB. So for the
> kernel, I keep using the bootlin precompiled gcc 9.3:
> 
> https://toolchains.bootlin.com/
> 
> Just noticed that they now have a 10.2 version available. Will try it out.

Correction: my PATH was actually pointing to the Fedora riscv64 gcc from the
distro rpm package which is version 10.2, not 9.3.

> 
>>
>> Gr{oetje,eeting}s,
>>
>> Geert
>>
> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] riscv: defconfig: k210: Disable CONFIG_VT

2020-11-25 Thread Damien Le Moal

On Wed, 2020-11-25 at 13:47 +0100, Geert Uytterhoeven wrote:
> Hi Damien,
> 
> On Wed, Nov 25, 2020 at 12:00 PM Damien Le Moal  wrote:
> > On 2020/11/25 18:26, Geert Uytterhoeven wrote:
> > > On Wed, Nov 25, 2020 at 10:02 AM Damien Le Moal  
> > > wrote:
> > > > On 2020/11/25 17:51, Geert Uytterhoeven wrote:
> > > > I was just fiddling with CONFIG_UNIX98_PTYS. Disabling it is OK with 
> > > > the simple
> > > > busybox userspace (no telnet/xterm like app running). But it saves only 
> > > > about
> > > > 1KB with my toolchain (gcc 9.3). So I left that one enabled. I am 
> > > > surprised that
> > > > you see 16K size impact... How big is your image ?
> > > > 
> > > > For me, it is 1.768 MB right now for the sdcard defconfig, with 
> > > > CONFIG_VT
> > > > disabled and ext2 enabled.
> > > 
> > > It might depend on how you measure.  "size" says 15 KiB impact for UNIX98
> > > ptys, while bloat-o-meter reported less than 7 (my script uses "size").
> > 
> > I look at the size of the arch/riscv/boot/loader.bin file since that is what
> > gets loaded in RAM and booted. It is significantly smaller than vmlinux file
> > size. E.g. for the sd card defconfig, I have:
> > 
> > vmlinux: 2369920 B
> > loader.bin: 1751250 B
> 
> Doesn't loader.bin lack bss?
> Bss does consume RAM, so you do want to take that into account, too.

Indeed. Good point. Thanks !

> 
> Gr{oetje,eeting}s,
> 
> Geert
> 

-- 
Damien Le Moal
Western Digital

Re: [PATCH] block: fix possible bd_size_lock deadlock

2021-03-12 Thread Damien Le Moal

On 2021/03/13 4:37, Jens Axboe wrote:
> On 3/11/21 5:11 AM, yanfei...@windriver.com wrote:
>> From: Yanfei Xu 
>>
>> bd_size_lock spinlock could be taken in block softirq, thus we should
>> disable the softirq before taking the lock.
>>
>> WARNING: inconsistent lock state
>> 5.12.0-rc2-syzkaller #0 Not tainted
>> 
>> inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-R} usage.
>> kworker/u4:0/7 [HC0[0]:SC1[1]:HE0:SE0] takes:
>> 8f87826c (&inode->i_size_seqcount){+.+-}-{0:0}, at:
>> end_bio_bh_io_sync+0x38/0x54 fs/buffer.c:3006
>> {SOFTIRQ-ON-W} state was registered at:
>>   lock_acquire.part.0+0xf0/0x41c kernel/locking/lockdep.c:5510
>>   lock_acquire+0x6c/0x74 kernel/locking/lockdep.c:5483
>>   do_write_seqcount_begin_nested include/linux/seqlock.h:520 [inline]
>>   do_write_seqcount_begin include/linux/seqlock.h:545 [inline]
>>   i_size_write include/linux/fs.h:863 [inline]
>>   set_capacity+0x13c/0x1f8 block/genhd.c:50
>>   brd_alloc+0x130/0x180 drivers/block/brd.c:401
>>   brd_init+0xcc/0x1e0 drivers/block/brd.c:500
>>   do_one_initcall+0x8c/0x59c init/main.c:1226
>>   do_initcall_level init/main.c:1299 [inline]
>>   do_initcalls init/main.c:1315 [inline]
>>   do_basic_setup init/main.c:1335 [inline]
>>   kernel_init_freeable+0x2cc/0x330 init/main.c:1537
>>   kernel_init+0x10/0x120 init/main.c:1424
>>   ret_from_fork+0x14/0x20 arch/arm/kernel/entry-common.S:158
>>   0x0
>> irq event stamp: 2783413
>> hardirqs last  enabled at (2783412): [<802011ec>]
>> __do_softirq+0xf4/0x7ac kernel/softirq.c:329
>> hardirqs last disabled at (2783413): [<8277d260>]
>> __raw_read_lock_irqsave include/linux/rwlock_api_smp.h:157 [inline]
>> hardirqs last disabled at (2783413): [<8277d260>]
>> _raw_read_lock_irqsave+0x84/0x88 kernel/locking/spinlock.c:231
>> softirqs last  enabled at (2783410): [<826b5050>] spin_unlock_bh
>> include/linux/spinlock.h:399 [inline]
>> softirqs last  enabled at (2783410): [<826b5050>]
>> batadv_nc_purge_paths+0x10c/0x148 net/batman-adv/network-coding.c:467
>> softirqs last disabled at (2783411): [<8024ddfc>] do_softirq_own_stack
>> include/asm-generic/softirq_stack.h:10 [inline]
>> softirqs last disabled at (2783411): [<8024ddfc>] do_softirq
>> kernel/softirq.c:248 [inline]
>> softirqs last disabled at (2783411): [<8024ddfc>] do_softirq+0xd8/0xe4
>> kernel/softirq.c:235
>>
>> other info that might help us debug this:
>>  Possible unsafe locking scenario:
>>
>>CPU0
>>
>>   lock(&inode->i_size_seqcount);
>>   
>> lock(&inode->i_size_seqcount);
>>
>>  *** DEADLOCK ***
>>
>> 3 locks held by kworker/u4:0/7:
>>  #0: 88c622a8 ((wq_completion)bat_events){+.+.}-{0:0}, at: set_work_data
>> kernel/workqueue.c:615 [inline]
>>  #0: 88c622a8 ((wq_completion)bat_events){+.+.}-{0:0}, at:
>> set_work_pool_and_clear_pending kernel/workqueue.c:643 [inline]
>>  #0: 88c622a8 ((wq_completion)bat_events){+.+.}-{0:0}, at:
>> process_one_work+0x214/0x998 kernel/workqueue.c:2246
>>  #1: 85147ef8
>> ((work_completion)(&(&bat_priv->nc.work)->work)){+.+.}-{0:0}, at:
>> set_work_data kernel/workqueue.c:615 [inline]
>>  #1: 85147ef8
>> ((work_completion)(&(&bat_priv->nc.work)->work)){+.+.}-{0:0}, at:
>> set_work_pool_and_clear_pending kernel/workqueue.c:643 [inline]
>>  #1: 85147ef8
>> ((work_completion)(&(&bat_priv->nc.work)->work)){+.+.}-{0:0}, at:
>> process_one_work+0x214/0x998 kernel/workqueue.c:2246
>>  #2: 8f878010 (&ni->size_lock){...-}-{2:2}, at:
>> ntfs_end_buffer_async_read+0x6c/0x558 fs/ntfs/aops.c:66
> 
> Damien? We have that revert queued up for this for 5.12, but looking
> at that, the state before that was kind of messy too.

Indeed... I was thinking about this and I think I am with Christoph on this:
drivers should not call set_capacity() from command completion context. I think
the best thing to do would be to fix drivers that do that but that may not be RC
material ?

Looking into more details of this case, it is slightly different though.
set_capacity() is here not called from soft IRQ context. It looks like a regular
initialization, but one that seems way too early in the boot process when a
secondary core is being initialized with IRQ not yet enabled... I think. And the
warnings come from i_size_write() calling preempt_disable() rather than
set_capacity() use of spin_lock(&bdev->bd_size_lock).

I wonder how it is possible to have brd being initialized so early.
I am not sure how to fix that. It looks like arm arch code territory.

For now, we could revert the revert as I do not think that Yanfei patch is
enough since completions may be from hard IRQ context too, which is not covered
with the spin_lock_bh() variants (c.f. a similar problem we are facing with that
in scsi completion [1])
I do not have any good idea how to proceed though.

[1]
https://lore.kernel.org/linux-scsi/ph0pr04mb7416c8330459e92d8aa21a889b...@ph0pr04mb7416.namprd04.prod.outlook.com/T/#t

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 1/1] scsi: sd: use max_xfer_blocks for set rw_max if max_xfer_blocks is available

2021-01-20 Thread Damien Le Moal

 /* Some devices report a maximum block count for READ/WRITE requests. */
>> -dev_max = min_not_zero(dev_max, sdkp->max_xfer_blocks);
>> q->limits.max_dev_sectors = logical_to_sectors(sdp, dev_max);
>>
>> -if (sd_validate_opt_xfer_size(sdkp, dev_max)) {
>> +if (sd_validate_max_xfer_size(sdkp, dev_max)) {
>> +q->limits.io_opt = 0;
>> +rw_max = logical_to_sectors(sdp, sdkp->max_xfer_blocks);
>> +q->limits.max_dev_sectors = rw_max;
>> +} else if (sd_validate_opt_xfer_size(sdkp, dev_max)) {

This does not look correct to me. This renders the device reported
opt_xfer_blocks useless.

The unmodified code sets dev_max to the min of SD_MAX_XFER_BLOCKS or
SD_DEF_XFER_BLOCKS and of the device reported max_xfer_blocks. The result of
this is used as the device max_dev_sectors queue limit, which in turn is used to
set the max_hw_sectors queue limit accounting for the adapter limits too.

opt_xfer_blocks, if it is valid, will be used to set the io_opt queue limit,
which is a hint. This hint is used to optimize the "soft" max_sectors command
limit used by the block layer to limit command size if the value of
opt_xfer_blocks is smaller than the limit initially set with max_xfer_blocks.

So if for your device max_sectors end up being too small, it is likely because
the device itself is reporting an opt_xfer_blocks value that is too small for
its own good. The max_sectors limit can be manually increased with "echo xxx >
/sys/block/sdX/queue/max_sectors_kb". A udev rule can be used to handle this
autmatically if needed.

But to get a saner default for that device, I do not think that this patch is
the right solution. Ideally, the device peculiarity should be handled with a
quirk, but that is not used in scsi. So beside the udev rule trick, I am not
sure what the right approach is here.

>> q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks);
>> rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks);
>> } else {
>> -- 
>> 2.29.0
>>
>>
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2] bio: limit bio max size.

2021-01-20 Thread Damien Le Moal

 unsigned len)
>   if (bio->bi_vcnt >= bio->bi_max_vecs)
>   return true;
>  
> - if (bio->bi_iter.bi_size > UINT_MAX - len)
> + if (bio->bi_iter.bi_size > bio->bi_max_size - len)
>   return true;
>  
>   return false;
> @@ -482,20 +482,13 @@ extern struct bio_vec *bvec_alloc(gfp_t, int, unsigned 
> long *, mempool_t *);
>  extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
>  extern unsigned int bvec_nr_vecs(unsigned short idx);
>  extern const char *bio_devname(struct bio *bio, char *buffer);
> -
> -#define bio_set_dev(bio, bdev)   \
> -do { \
> - if ((bio)->bi_disk != (bdev)->bd_disk)  \
> - bio_clear_flag(bio, BIO_THROTTLED);\
> - (bio)->bi_disk = (bdev)->bd_disk;   \
> - (bio)->bi_partno = (bdev)->bd_partno;   \
> - bio_associate_blkg(bio);\
> -} while (0)
> +extern void bio_set_dev(struct bio *bio, struct block_device *bdev);
>  
>  #define bio_copy_dev(dst, src)   \
>  do { \
>   (dst)->bi_disk = (src)->bi_disk;\
>   (dst)->bi_partno = (src)->bi_partno;\
> + (dst)->bi_max_size = (src)->bi_max_size;\
>   bio_clone_blkg_association(dst, src);   \
>  } while (0)
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 866f74261b3b..e5dd5b7d8fc1 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -270,6 +270,7 @@ struct bio {
>*/
>  
>   unsigned short  bi_max_vecs;/* max bvl_vecs we can hold */
> + unsigned intbi_max_size;/* max data size we can hold */
>  
>   atomic_t__bi_cnt;   /* pin count */

This modification comes at the cost of increasing the bio structure size to
simply tell the block layer "do not delay BIO splitting"...

I think there is a much simpler approach. What about:

1) Use a request queue flag to indicate "limit BIO size"
2) modify __bio_try_merge_page() to look at that flag to disallow page merging
if the bio size exceeds blk_queue_get_max_sectors(), or more ideally a version
of it that takes into account the bio start sector.
3) Set the "limit bio size" queue flag in the driver of the device that benefit
from this change. Eventually, that could also be controlled through sysfs.

With such change, you will get the same result without having to increase the
BIO structure size.


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2] bio: limit bio max size

2021-01-21 Thread Damien Le Moal

the general case of devices that do not care about limiting the bio size
(like now), this will add one boolean evaluation (queue flag test). That's it.
For your case, sure you now have 2 boolean evals instead of one. But that must
be put in perspective with the cost of increasing the bio size.

> 
> bool __bio_try_merge_page(struct bio *bio, struct page *page,
>   unsigned int len, unsigned int off, bool *same_page)
> {
>   ...
>   if (page_is_mergeable(bv, page, len, off, same_page)) {
>   if (bio->bi_iter.bi_size > UINT_MAX - len) {
>   *same_page = false;
>   return false;
>   }
> 
> + if (blk_queue_limit_bio_max_size(bio) &&
> +(bio->bi_iter.bi_size >
> blk_queue_get_bio_max_size(bio) - len)) {
> + *same_page = false;
> + return false;
> + }
> 
>   bv->bv_len += len;
>   bio->bi_iter.bi_size += len;
>   return true;
>   }
>   ...
> }
> 
> 
> static inline bool bio_full(struct bio *bio, unsigned len)
> {
>   ...
>   if (bio->bi_iter.bi_size > UINT_MAX - len)
>   return true;
> 
> + if (blk_queue_limit_bio_max_size(bio) &&
> +(bio->bi_iter.bi_size > blk_queue_get_bio_max_size(bio) - len))
> + return true;
>   ...
> }
> 
> 
> Page merge is CPU-bound job as you said.
> How about below with adding of bi_max_size in bio?

I am not a fan of adding a bio field for using it only in one place.
This is only my opinion. I will let others comment about this, but personnally
I would rather do something like this:

#define blk_queue_limit_bio_merge_size(q) \
test_bit(QUEUE_FLAG_LIMIT_MERGE, &(q)->queue_flags)

static inline unsigned int bio_max_merge_size(struct bio *bio)
{
    struct request_queue *q = bio->bi_disk->queue;

if (blk_queue_limit_bio_merge_size(q))
    return blk_queue_get_max_sectors(q, bio_op(bio))
<< SECTOR_SHIFT;
return UINT_MAX;
}

and use that helper in __bio_try_merge_page(), e.g.:

if (bio->bi_iter.bi_size > bio_max_merge_size(bio) - len) {
*same_page = false;
return false;
}

No need to change the bio struct.

If you measure performance with and without this change on nullblk, you can
verify if it has any impact for regular devices. And for your use case, that
should give you the same performance.

> 
> bool __bio_try_merge_page(struct bio *bio, struct page *page,
>   unsigned int len, unsigned int off, bool *same_page)
> {
>   ...
>   if (page_is_mergeable(bv, page, len, off, same_page)) {
> - if (bio->bi_iter.bi_size > UINT_MAX - len) {
> + if (bio->bi_iter.bi_size > bio->bi_max_size - len) {
>   *same_page = false;
>   return false;
>   }
> 
>   bv->bv_len += len;
>   bio->bi_iter.bi_size += len;
>   return true;
>   }
>   ...
> }
> 
> 
> static inline bool bio_full(struct bio *bio, unsigned len)
> {
>   ...
> - if (bio->bi_iter.bi_size > UINT_MAX - len)
> + if (bio->bi_iter.bi_size > bio->bi_max_size - len)
>   return true;
>   ...
> }
> 
> +void bio_set_dev(struct bio *bio, struct block_device *bdev)
> +{
> + if (bio->bi_disk != bdev->bd_disk)
> + bio_clear_flag(bio, BIO_THROTTLED);
> +
> + bio->bi_disk = bdev->bd_disk;
> + bio->bi_partno = bdev->bd_partno;
> + if (blk_queue_limit_bio_max_size(bio))
> + bio->bi_max_size = blk_queue_get_bio_max_size(bio);
> +
> + bio_associate_blkg(bio);
> +}
> +EXPORT_SYMBOL(bio_set_dev);
> 
> > -- 
> > Damien Le Moal
> > Western Digital Research
> 
> ---
> Changheun Lee
> Samsung Electronics

-- 
Damien Le Moal
Western Digital

Re: [PATCH 1/1] scsi: sd: use max_xfer_blocks for set rw_max if max_xfer_blocks is available

2021-01-21 Thread Damien Le Moal

l sd_validate_max_xfer_size(struct scsi_disk *sdkp,
>> const char *name,
>> unsigned int xfer_blocks,
>> unsigned int dev_max)
>>
>> To allow checking both opt_xfer_blocks and max_xfer_blocks ?
>>
>>>> +
>>>> /*
>>>> * Determine the device's preferred I/O size for reads and writes
>>>> * unless the reported value is unreasonably small, large, not a
>>>> @@ -3233,12 +3280,13 @@ static int sd_revalidate_disk(struct gendisk *disk)
>>>>
>>>> /* Initial block count limit based on CDB TRANSFER LENGTH field size. */
>>>> dev_max = sdp->use_16_for_rw ? SD_MAX_XFER_BLOCKS : SD_DEF_XFER_BLOCKS;
>>
>> This looks weird: no indentation. Care to resend ?
>>
>>>> -
>>>> -  /* Some devices report a maximum block count for READ/WRITE requests. */
>>>> -  dev_max = min_not_zero(dev_max, sdkp->max_xfer_blocks);
>>>> q->limits.max_dev_sectors = logical_to_sectors(sdp, dev_max);
>>>>
>>>> -  if (sd_validate_opt_xfer_size(sdkp, dev_max)) {
>>>> +  if (sd_validate_max_xfer_size(sdkp, dev_max)) {
>>>> +  q->limits.io_opt = 0;
>>>> +  rw_max = logical_to_sectors(sdp, sdkp->max_xfer_blocks);
>>>> +  q->limits.max_dev_sectors = rw_max;
>>>> +  } else if (sd_validate_opt_xfer_size(sdkp, dev_max)) {
>>
>> This does not look correct to me. This renders the device reported
>> opt_xfer_blocks useless.
>>
>> The unmodified code sets dev_max to the min of SD_MAX_XFER_BLOCKS or
>> SD_DEF_XFER_BLOCKS and of the device reported max_xfer_blocks. The result of
>> this is used as the device max_dev_sectors queue limit, which in turn is 
>> used to
>> set the max_hw_sectors queue limit accounting for the adapter limits too.
>>
>> opt_xfer_blocks, if it is valid, will be used to set the io_opt queue limit,
>> which is a hint. This hint is used to optimize the "soft" max_sectors command
>> limit used by the block layer to limit command size if the value of
>> opt_xfer_blocks is smaller than the limit initially set with max_xfer_blocks.
>>
>> So if for your device max_sectors end up being too small, it is likely 
>> because
>> the device itself is reporting an opt_xfer_blocks value that is too small for
>> its own good. The max_sectors limit can be manually increased with "echo xxx 
>> >
>> /sys/block/sdX/queue/max_sectors_kb". A udev rule can be used to handle this
>> autmatically if needed.
>>
>> But to get a saner default for that device, I do not think that this patch is
>> the right solution. Ideally, the device peculiarity should be handled with a
>> quirk, but that is not used in scsi. So beside the udev rule trick, I am not
>> sure what the right approach is here.
>>
> 
> This approach is for using sdkp->max_xfer_blocks as a rw_max.
> There are no way to use it now when sdkp->opt_xfer_blocks is valid.
> In my case, scsi device reports both of sdkp->max_xfer_blocks, and
> sdkp->opt_xfer_blocks.
> 
> How about set larger valid value between sdkp->max_xfer_blocks,
> and sdkp->opt_xfer_blocks to rw_max?

Again, if your device reports an opt_xfer_blocks value that is too small for its
own good, that is a problem with this device. The solution for that is not to
change something that will affect *all* other storage devices, including those
with a perfectly valid opt_xfer_blocks value.

I think that the solution should be at the LLD level, for that device only. But
I am not sure how to communicate a quirk for opt_xfer_blocks back to the generic
sd driver. You should explore a solution like that. Others may have ideas about
this too. Wait for more comments.

> 
>>>> q->limits.io_opt = logical_to_bytes(sdp, sdkp->opt_xfer_blocks);
>>>> rw_max = logical_to_sectors(sdp, sdkp->opt_xfer_blocks);
>>>> } else {
>>>> -- 
>>>> 2.29.0
>>>>
>>>>
>>>
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v3 1/2] bio: limit bio max size

2021-01-26 Thread Damien Le Moal

v_bio_end_io_simple;
>   bio.bi_ioprio = iocb->ki_ioprio;
>  
> - ret = bio_iov_iter_get_pages(&bio, iter);
> + ret = bio_iov_iter_get_pages(&bio, iter, true);
>   if (unlikely(ret))
>   goto out;
>   ret = bio.bi_iter.bi_size;
> @@ -397,7 +397,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_end_io = blkdev_bio_end_io;
>   bio->bi_ioprio = iocb->ki_ioprio;
>  
> - ret = bio_iov_iter_get_pages(bio, iter);
> + ret = bio_iov_iter_get_pages(bio, iter, is_sync);
>   if (unlikely(ret)) {
>   bio->bi_status = BLK_STS_IOERR;
>   bio_endio(bio);
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index ea1e8f696076..5105982a9bf8 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -277,7 +277,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   bio->bi_private = dio;
>   bio->bi_end_io = iomap_dio_bio_end_io;
>  
> - ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
> + ret = bio_iov_iter_get_pages(bio, dio->submit.iter,
> + is_sync_kiocb(dio->iocb));
>   if (unlikely(ret)) {
>       /*
>* We have to stop part way through an IO. We must fall
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index bec47f2d074b..c95ac37f9305 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -690,7 +690,7 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, 
> struct iov_iter *from)
>   if (iocb->ki_flags & IOCB_DSYNC)
>   bio->bi_opf |= REQ_FUA;
>  
> - ret = bio_iov_iter_get_pages(bio, from);
> + ret = bio_iov_iter_get_pages(bio, from, is_sync_kiocb(iocb));
>   if (unlikely(ret))
>   goto out_release;
>  
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 676870b2c88d..fa3a503b955c 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -472,7 +472,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
> *page,
>   unsigned int len, unsigned int off, bool *same_page);
>  void __bio_add_page(struct bio *bio, struct page *page,
>   unsigned int len, unsigned int off);
> -int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
> +int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter, bool 
> sync);
>  void bio_release_pages(struct bio *bio, bool mark_dirty);
>  extern void bio_set_pages_dirty(struct bio *bio);
>  extern void bio_check_pages_dirty(struct bio *bio);
> 
> 
> Thanks,
> Ming
> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v3 1/2] bio: limit bio max size

2021-01-26 Thread Damien Le Moal

On 2021/01/26 15:07, Ming Lei wrote:
> On Tue, Jan 26, 2021 at 04:06:06AM +0000, Damien Le Moal wrote:
>> On 2021/01/26 12:58, Ming Lei wrote:
>>> On Tue, Jan 26, 2021 at 10:32:34AM +0900, Changheun Lee wrote:
>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>> but sometimes it would lead to inefficient behaviors.
>>>> in case of large chunk direct I/O, - 32MB chunk read in user space -
>>>> all pages for 32MB would be merged to a bio structure if the pages
>>>> physical addresses are contiguous. it makes some delay to submit
>>>> until merge complete. bio max size should be limited to a proper size.
>>>>
>>>> When 32MB chunk read with direct I/O option is coming from userspace,
>>>> kernel behavior is below now. it's timeline.
>>>>
>>>>  | bio merge for 32MB. total 8,192 pages are merged.
>>>>  | total elapsed time is over 2ms.
>>>>  |-- ... --->|
>>>>  | 8,192 pages merged a 
>>>> bio.
>>>>  | at this time, first bio 
>>>> submit is done.
>>>>  | 1 bio is split to 32 
>>>> read request and issue.
>>>>  |--->
>>>>   |--->
>>>>|--->
>>>>   ..
>>>>
>>>> |--->
>>>> 
>>>> |--->|
>>>>   total 19ms elapsed to complete 32MB read done 
>>>> from device. |
>>>>
>>>> If bio max size is limited with 1MB, behavior is changed below.
>>>>
>>>>  | bio merge for 1MB. 256 pages are merged for each bio.
>>>>  | total 32 bio will be made.
>>>>  | total elapsed time is over 2ms. it's same.
>>>>  | but, first bio submit timing is fast. about 100us.
>>>>  |--->|--->|--->|---> ... -->|--->|--->|--->|--->|
>>>>   | 256 pages merged a bio.
>>>>   | at this time, first bio submit is done.
>>>>   | and 1 read request is issued for 1 bio.
>>>>   |--->
>>>>|--->
>>>> |--->
>>>>   ..
>>>>  |--->
>>>>   |--->|
>>>> total 17ms elapsed to complete 32MB read done from device. |
>>>>
>>>> As a result, read request issue timing is faster if bio max size is 
>>>> limited.
>>>> Current kernel behavior with multipage bvec, super large bio can be 
>>>> created.
>>>> And it lead to delay first I/O request issue.
>>>>
>>>> Signed-off-by: Changheun Lee 
>>>> ---
>>>>  block/bio.c| 17 -
>>>>  include/linux/bio.h|  4 +++-
>>>>  include/linux/blkdev.h |  3 +++
>>>>  3 files changed, 22 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index 1f2cc1fbe283..ec0281889045 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -287,6 +287,21 @@ void bio_init(struct bio *bio, struct bio_vec *table,
>>>>  }
>>>>  EXPORT_SYMBOL(bio_init);
>>>>  
>>>> +unsigned int bio_max_size(struct bio *bio)
>>>> +{
>>>> +  struct request_queue *q;
>>>> +
>>>> +  if (!bio->bi_disk)
>>>> +  return UINT_MAX;
>>>> +
>>>> +  q = bio->bi_disk->queue;
>>>> +  if (!blk_queue_limit_bio_size(q))
>>>> +  return UINT_MAX;
>>>> +
>>>> +  return blk_queue_get_max_sectors(q, bio_op(bio)) << SECTOR_SHIFT;
>>>> +}
>>>> +EXPORT_SYMBOL(bio_max_size);
>>>> +
>>>>  /**
>>>>   * bio_reset - reinitialize a bio
>>>>   * @

Re: [PATCH v3 1/2] bio: limit bio max size

2021-01-26 Thread Damien Le Moal

bi_vcnt >= bio->bi_max_vecs)
>   return true;
>  
> - if (bio->bi_iter.bi_size > UINT_MAX - len)
> + if (bio->bi_iter.bi_size > bio_max_size(bio) - len)
>   return true;
>  
>   return false;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f94ee3089e01..3aeab9e7e97b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -621,6 +621,7 @@ struct request_queue {
>  #define QUEUE_FLAG_RQ_ALLOC_TIME 27  /* record rq->alloc_time_ns */
>  #define QUEUE_FLAG_HCTX_ACTIVE   28  /* at least one blk-mq hctx is 
> active */
>  #define QUEUE_FLAG_NOWAIT   29   /* device supports NOWAIT */
> +#define QUEUE_FLAG_LIMIT_BIO_SIZE 30 /* limit bio size */
>  
>  #define QUEUE_FLAG_MQ_DEFAULT((1 << QUEUE_FLAG_IO_STAT) |
> \
>(1 << QUEUE_FLAG_SAME_COMP) |  \
> @@ -667,6 +668,8 @@ bool blk_queue_flag_test_and_set(unsigned int flag, 
> struct request_queue *q);
>  #define blk_queue_fua(q) test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
>  #define blk_queue_registered(q)  test_bit(QUEUE_FLAG_REGISTERED, 
> &(q)->queue_flags)
>  #define blk_queue_nowait(q)  test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
> +#define blk_queue_limit_bio_size(q)  \
> + test_bit(QUEUE_FLAG_LIMIT_BIO_SIZE, &(q)->queue_flags)
>  
>  extern void blk_set_pm_only(struct request_queue *q);
>  extern void blk_clear_pm_only(struct request_queue *q);
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v3 1/2] bio: limit bio max size

2021-01-26 Thread Damien Le Moal

On 2021/01/27 9:36, Changheun Lee wrote:
>>> +
>>>  /**
>>>   * bio_reset - reinitialize a bio
>>>   * @bio:   bio to reset
>>> @@ -877,7 +892,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
>>> *page,
>>> struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>  
>>> if (page_is_mergeable(bv, page, len, off, same_page)) {
>>> -   if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>> +   if (bio->bi_iter.bi_size > bio_max_size(bio) - len) {
>>> *same_page = false;
>>> return false;
>>> }
>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>> index 1edda614f7ce..cdb134ca7bf5 100644
>>> --- a/include/linux/bio.h
>>> +++ b/include/linux/bio.h
>>> @@ -100,6 +100,8 @@ static inline void *bio_data(struct bio *bio)
>>> return NULL;
>>>  }
>>>  
>>> +extern unsigned int bio_max_size(struct bio *);
>>
>> No need for extern.
> 
> It's just for compile warning in my test environment.
> I'll remove it too. But I think compile warning could be in the other
> .c file which includes bio.h. Is it OK?

Hmmm... not having extern should not generate a compilation warning. There are
tons of functions declared without extern in header files in the kernel. What
compiler are you using ?


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] dm zoned: select CONFIG_CRC32

2021-01-03 Thread Damien Le Moal

On 2021/01/04 6:41, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> Without crc32 support, this driver fails to link:
> 
> arm-linux-gnueabi-ld: drivers/md/dm-zoned-metadata.o: in function 
> `dmz_write_sb':
> dm-zoned-metadata.c:(.text+0xe98): undefined reference to `crc32_le'
> arm-linux-gnueabi-ld: drivers/md/dm-zoned-metadata.o: in function 
> `dmz_check_sb':
> dm-zoned-metadata.c:(.text+0x7978): undefined reference to `crc32_le'
> 
> Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/md/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
> index b7e2d914..a67b9ed3ca89 100644
> --- a/drivers/md/Kconfig
> +++ b/drivers/md/Kconfig
> @@ -622,6 +622,7 @@ config DM_ZONED
>   tristate "Drive-managed zoned block device target support"
>   depends on BLK_DEV_DM
>   depends on BLK_DEV_ZONED
> + select CRC32
>   help
> This device-mapper target takes a host-managed or host-aware zoned
> block device and exposes most of its capacity as a regular block
> 

Reviewed-by: Damien Le Moal 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] zonefs: select CONFIG_CRC32

2021-01-03 Thread Damien Le Moal

On 2021/01/04 6:44, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> When CRC32 is disabled, zonefs cannot be linked:
> 
> ld: fs/zonefs/super.o: in function `zonefs_fill_super':
> 
> Add a Kconfig 'select' statement for it.
> 
> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
> Signed-off-by: Arnd Bergmann 
> ---
>  fs/zonefs/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/zonefs/Kconfig b/fs/zonefs/Kconfig
> index ef2697b78820..827278f937fe 100644
> --- a/fs/zonefs/Kconfig
> +++ b/fs/zonefs/Kconfig
> @@ -3,6 +3,7 @@ config ZONEFS_FS
>   depends on BLOCK
>   depends on BLK_DEV_ZONED
>   select FS_IOMAP
> + select CRC32
>   help
> zonefs is a simple file system which exposes zones of a zoned block
> device (e.g. host-managed or host-aware SMR disk drives) as files.
> 

Applied. Thanks Arnd !

-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH v4 1/3] block: export bio_map_kern()

2021-01-04 Thread Damien Le Moal

On 2021/01/04 19:48, SelvaKumar S wrote:
> Export bio_map_kern() so that copy offload emulation can use
> it to add vmalloced memory to bio.
> 
> Signed-off-by: SelvaKumar S 
> ---
>  block/blk-map.c| 3 ++-
>  include/linux/blkdev.h | 2 ++
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 21630dccac62..50d61475bb68 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -378,7 +378,7 @@ static void bio_map_kern_endio(struct bio *bio)
>   *   Map the kernel address into a bio suitable for io to a block
>   *   device. Returns an error pointer in case of error.
>   */
> -static struct bio *bio_map_kern(struct request_queue *q, void *data,
> +struct bio *bio_map_kern(struct request_queue *q, void *data,
>   unsigned int len, gfp_t gfp_mask)
>  {
>   unsigned long kaddr = (unsigned long)data;
> @@ -428,6 +428,7 @@ static struct bio *bio_map_kern(struct request_queue *q, 
> void *data,
>   bio->bi_end_io = bio_map_kern_endio;
>   return bio;
>  }
> +EXPORT_SYMBOL(bio_map_kern);

Simple copy support is a block layer code, so you I do not think you need this.
You only need to remove the static declaration of the function.

>  
>  static void bio_copy_kern_endio(struct bio *bio)
>  {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 070de09425ad..81f9e7bec16c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -936,6 +936,8 @@ extern int blk_rq_map_user(struct request_queue *, struct 
> request *,
>  struct rq_map_data *, void __user *, unsigned long,
>  gfp_t);
>  extern int blk_rq_unmap_user(struct bio *);
> +extern struct bio *bio_map_kern(struct request_queue *q, void *data,
> + unsigned int len, gfp_t gfp_mask);
>  extern int blk_rq_map_kern(struct request_queue *, struct request *, void *, 
> unsigned int, gfp_t);
>  extern int blk_rq_map_user_iov(struct request_queue *, struct request *,
>  struct rq_map_data *, const struct iov_iter *,
> 


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH v4 2/3] block: add simple copy support

2021-01-04 Thread Damien Le Moal

y(bdev, crange.nr_range, rlist, bdev, dest,
> + GFP_KERNEL);
> +out:
> + kfree(rlist);
> + return ret;
> +}
> +
>  static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
>   unsigned long arg)
>  {
> @@ -458,6 +499,8 @@ static int blkdev_common_ioctl(struct block_device *bdev, 
> fmode_t mode,
>   case BLKSECDISCARD:
>   return blk_ioctl_discard(bdev, mode, arg,
>   BLKDEV_DISCARD_SECURE);
> + case BLKCOPY:
> + return blk_ioctl_copy(bdev, mode, arg);
>   case BLKZEROOUT:
>   return blk_ioctl_zeroout(bdev, mode, arg);
>   case BLKREPORTZONE:
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1edda614f7ce..164313bdfb35 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -71,6 +71,7 @@ static inline bool bio_has_data(struct bio *bio)
>  static inline bool bio_no_advance_iter(const struct bio *bio)
>  {
>   return bio_op(bio) == REQ_OP_DISCARD ||
> +bio_op(bio) == REQ_OP_COPY ||
>  bio_op(bio) == REQ_OP_SECURE_ERASE ||
>  bio_op(bio) == REQ_OP_WRITE_SAME ||
>  bio_op(bio) == REQ_OP_WRITE_ZEROES;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 866f74261b3b..d4d11e9ff814 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -380,6 +380,8 @@ enum req_opf {
>   REQ_OP_ZONE_RESET   = 15,
>   /* reset all the zone present on the device */
>   REQ_OP_ZONE_RESET_ALL   = 17,
> + /* copy ranges within device */
> + REQ_OP_COPY = 19,
>  
>   /* SCSI passthrough using struct scsi_request */
>   REQ_OP_SCSI_IN  = 32,
> @@ -506,6 +508,11 @@ static inline bool op_is_discard(unsigned int op)
>   return (op & REQ_OP_MASK) == REQ_OP_DISCARD;
>  }
>  
> +static inline bool op_is_copy(unsigned int op)
> +{
> + return (op & REQ_OP_MASK) == REQ_OP_COPY;
> +}
> +
>  /*
>   * Check if a bio or request operation is a zone management operation, with
>   * the exception of REQ_OP_ZONE_RESET_ALL which is treated as a special case
> @@ -565,4 +572,12 @@ struct blk_rq_stat {
>   u64 batch;
>  };
>  
> +struct blk_copy_payload {
> + sector_tdest;
> + int copy_range;
> + int copy_size;
> + int err;
> + struct  range_entry range[];
> +};
> +
>  #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 81f9e7bec16c..4c7e861e57e4 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -340,10 +340,14 @@ struct queue_limits {
>   unsigned intmax_zone_append_sectors;
>   unsigned intdiscard_granularity;
>   unsigned intdiscard_alignment;
> + unsigned intcopy_offload;
> + unsigned intmax_copy_sectors;
>  
>   unsigned short  max_segments;
>   unsigned short  max_integrity_segments;
>   unsigned short  max_discard_segments;
> + unsigned short  max_copy_range_sectors;
> + unsigned short  max_copy_nr_ranges;
>  
>   unsigned char   misaligned;
>   unsigned char   discard_misaligned;
> @@ -625,6 +629,7 @@ struct request_queue {
>  #define QUEUE_FLAG_RQ_ALLOC_TIME 27  /* record rq->alloc_time_ns */
>  #define QUEUE_FLAG_HCTX_ACTIVE   28  /* at least one blk-mq hctx is 
> active */
>  #define QUEUE_FLAG_NOWAIT   29   /* device supports NOWAIT */
> +#define QUEUE_FLAG_COPY  30  /* supports copy */

I think this should be called QUEUE_FLAG_SIMPLE_COPY to indicate more precisely
the type of copy supported. SCSI XCOPY is more advanced...

>  
>  #define QUEUE_FLAG_MQ_DEFAULT((1 << QUEUE_FLAG_IO_STAT) |
> \
>(1 << QUEUE_FLAG_SAME_COMP) |  \
> @@ -647,6 +652,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, 
> struct request_queue *q);
>  #define blk_queue_io_stat(q) test_bit(QUEUE_FLAG_IO_STAT, &(q)->queue_flags)
>  #define blk_queue_add_random(q)  test_bit(QUEUE_FLAG_ADD_RANDOM, 
> &(q)->queue_flags)
>  #define blk_queue_discard(q) test_bit(QUEUE_FLAG_DISCARD, &(q)->queue_flags)
> +#define blk_queue_copy(q)test_bit(QUEUE_FLAG_COPY, &(q)->queue_flags)
>  #define blk_queue_zone_resetall(q)   \
>   test_bit(QUEUE_FLAG_ZONE_RESETALL, &(q)->queue_flags)
>  #define blk_queue_secure_erase(q) \
> @@ -1061,6 +1067,9 @@ static inline unsigned int 
> blk_queue_get_max_sectors(struct request_queue *q,
>   return min(q->limits.max_discard_sectors,
>  UINT_MAX >> SECTOR_SHIFT);
>  
> + if (unlikely(op == REQ_OP_COPY))
> + return q->limits.max_copy_sectors;
> +
>   if (unlikely(op == REQ_OP_WRITE_SAME))
>   return q->limits.max_write_same_sectors;
>  
> @@ -1335,6 +1344,10 @@ extern int __blkdev_issue_discard(struct block_device 
> *bdev, sector_t sector,
>   sector_t nr_sects, gfp_t gfp_mask, int flags,
>   struct bio **biop);
>  
> +extern int blkdev_issue_copy(struct block_device *bdev, int nr_srcs,
> + struct range_entry *src_rlist, struct block_device *dest_bdev,
> + sector_t dest, gfp_t gfp_mask);
> +
>  #define BLKDEV_ZERO_NOUNMAP  (1 << 0)  /* do not free blocks */
>  #define BLKDEV_ZERO_NOFALLBACK   (1 << 1)  /* don't write explicit 
> zeroes */
>  
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index f44eb0a04afd..5cadb176317a 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -64,6 +64,18 @@ struct fstrim_range {
>   __u64 minlen;
>  };
>  
> +struct range_entry {
> + __u64 src;
> + __u64 len;
> +};
> +
> +struct copy_range {
> + __u64 dest;
> + __u64 nr_range;
> + __u64 range_list;
> + __u64 rsvd;
> +};
> +
>  /* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions 
> */
>  #define FILE_DEDUPE_RANGE_SAME   0
>  #define FILE_DEDUPE_RANGE_DIFFERS1
> @@ -184,6 +196,7 @@ struct fsxattr {
>  #define BLKSECDISCARD _IO(0x12,125)
>  #define BLKROTATIONAL _IO(0x12,126)
>  #define BLKZEROOUT _IO(0x12,127)
> +#define BLKCOPY _IOWR(0x12, 128, struct copy_range)
>  /*
>   * A jump here: 130-131 are reserved for zoned block devices
>   * (see uapi/linux/blkzoned.h)
> 


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH 02/34] block: introduce and use bio_new

2021-01-27 Thread Damien Le Moal

On 2021/01/28 16:12, Chaitanya Kulkarni wrote:
> Introduce bio_new() helper and use it in blk-lib.c to allocate and
> initialize various non-optional or semi-optional members of the bio
> along with bio allocation done with bio_alloc(). Here we also calmp the
> max_bvecs for bio with BIO_MAX_PAGES before we pass to bio_alloc().
> 
> Signed-off-by: Chaitanya Kulkarni 
> ---
>  block/blk-lib.c |  6 +-
>  include/linux/bio.h | 25 +
>  2 files changed, 26 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index fb486a0bdb58..ec29415f00dd 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -14,17 +14,13 @@ struct bio *blk_next_bio(struct bio *bio, struct 
> block_device *bdev,
>   sector_t sect, unsigned op, unsigned opf,
>   unsigned int nr_pages, gfp_t gfp)
>  {
> - struct bio *new = bio_alloc(gfp, nr_pages);
> + struct bio *new = bio_new(bdev, sect, op, opf, gfp, nr_pages);
>  
>   if (bio) {
>   bio_chain(bio, new);
>   submit_bio(bio);
>   }
>  
> - new->bi_iter.bi_sector = sect;
> - bio_set_dev(new, bdev);
> - bio_set_op_attrs(new, op, opf);
> -
>   return new;
>  }
>  
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index c74857cf1252..2a09ba100546 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -826,5 +826,30 @@ static inline void bio_set_polled(struct bio *bio, 
> struct kiocb *kiocb)
>   if (!is_sync_kiocb(kiocb))
>   bio->bi_opf |= REQ_NOWAIT;
>  }
> +/**
> + * bio_new - allcate and initialize new bio
> + * @bdev:blockdev to issue discard for
> + * @sector:  start sector
> + * @op:  REQ_OP_XXX from enum req_opf
> + * @op_flags:REQ_XXX from enum req_flag_bits
> + * @max_bvecs:   maximum bvec to be allocated for this bio
> + * @gfp_mask:memory allocation flags (for bio_alloc)
> + *
> + * Description:
> + *Allocates, initializes common members, and returns a new bio.
> + */
> +static inline struct bio *bio_new(struct block_device *bdev, sector_t sector,
> +   unsigned int op, unsigned int op_flags,
> +   unsigned int max_bvecs, gfp_t gfp_mask)
> +{
> + unsigned nr_bvec = clamp_t(unsigned int, max_bvecs, 0, BIO_MAX_PAGES);
> + struct bio *bio = bio_alloc(gfp_mask, nr_bvec);

I think that depending on the gfp_mask passed, bio can be NULL. So this should
be checked.

> +
> + bio_set_dev(bio, bdev);
> + bio->bi_iter.bi_sector = sector;
> + bio_set_op_attrs(bio, op, op_flags);

This function is obsolete. Open code this.

> +
> + return bio;
> +}
>  
>  #endif /* __LINUX_BIO_H */
> 


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH 28/34] zonefs: use bio_new

2021-01-27 Thread Damien Le Moal

On 2021/01/28 16:15, Chaitanya Kulkarni wrote:
> Signed-off-by: Chaitanya Kulkarni 
> ---
>  fs/zonefs/super.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index ab68e27bb322..620d67965a22 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -661,6 +661,7 @@ static const struct iomap_dio_ops zonefs_write_dio_ops = {
>  
>  static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter 
> *from)
>  {
> + unsigned int op = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;

I do not see the point of adding this variable since it is used only for the
bio_new() call. Pass the op value directly.

>   struct inode *inode = file_inode(iocb->ki_filp);
>   struct zonefs_inode_info *zi = ZONEFS_I(inode);
>   struct block_device *bdev = inode->i_sb->s_bdev;
> @@ -678,15 +679,12 @@ static ssize_t zonefs_file_dio_append(struct kiocb 
> *iocb, struct iov_iter *from)
>   if (!nr_pages)
>   return 0;
>  
> - bio = bio_alloc(GFP_NOFS, nr_pages);
> + bio = bio_new(bdev, zi->i_zsector, op, 0, GFP_NOFS, nr_pages);
>   if (!bio)
>   return -ENOMEM;
>  
> - bio_set_dev(bio, bdev);
> - bio->bi_iter.bi_sector = zi->i_zsector;
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_ioprio = iocb->ki_ioprio;
> - bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
>   if (iocb->ki_flags & IOCB_DSYNC)
>   bio->bi_opf |= REQ_FUA;
>  
> 


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH 02/34] block: introduce and use bio_new

2021-01-27 Thread Damien Le Moal

On 2021/01/28 16:21, Damien Le Moal wrote:
> On 2021/01/28 16:12, Chaitanya Kulkarni wrote:
>> Introduce bio_new() helper and use it in blk-lib.c to allocate and
>> initialize various non-optional or semi-optional members of the bio
>> along with bio allocation done with bio_alloc(). Here we also calmp the
>> max_bvecs for bio with BIO_MAX_PAGES before we pass to bio_alloc().
>>
>> Signed-off-by: Chaitanya Kulkarni 
>> ---
>>  block/blk-lib.c |  6 +-
>>  include/linux/bio.h | 25 +
>>  2 files changed, 26 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/blk-lib.c b/block/blk-lib.c
>> index fb486a0bdb58..ec29415f00dd 100644
>> --- a/block/blk-lib.c
>> +++ b/block/blk-lib.c
>> @@ -14,17 +14,13 @@ struct bio *blk_next_bio(struct bio *bio, struct 
>> block_device *bdev,
>>  sector_t sect, unsigned op, unsigned opf,
>>  unsigned int nr_pages, gfp_t gfp)
>>  {
>> -struct bio *new = bio_alloc(gfp, nr_pages);
>> +struct bio *new = bio_new(bdev, sect, op, opf, gfp, nr_pages);
>>  
>>  if (bio) {
>>  bio_chain(bio, new);
>>  submit_bio(bio);
>>  }
>>  
>> -new->bi_iter.bi_sector = sect;
>> -bio_set_dev(new, bdev);
>> -bio_set_op_attrs(new, op, opf);
>> -
>>  return new;
>>  }
>>  
>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>> index c74857cf1252..2a09ba100546 100644
>> --- a/include/linux/bio.h
>> +++ b/include/linux/bio.h
>> @@ -826,5 +826,30 @@ static inline void bio_set_polled(struct bio *bio, 
>> struct kiocb *kiocb)
>>  if (!is_sync_kiocb(kiocb))
>>  bio->bi_opf |= REQ_NOWAIT;
>>  }
>> +/**
>> + * bio_new -allcate and initialize new bio
>> + * @bdev:   blockdev to issue discard for
>> + * @sector: start sector
>> + * @op: REQ_OP_XXX from enum req_opf
>> + * @op_flags:   REQ_XXX from enum req_flag_bits
>> + * @max_bvecs:  maximum bvec to be allocated for this bio
>> + * @gfp_mask:   memory allocation flags (for bio_alloc)
>> + *
>> + * Description:
>> + *Allocates, initializes common members, and returns a new bio.
>> + */
>> +static inline struct bio *bio_new(struct block_device *bdev, sector_t 
>> sector,
>> +  unsigned int op, unsigned int op_flags,
>> +  unsigned int max_bvecs, gfp_t gfp_mask)
>> +{
>> +unsigned nr_bvec = clamp_t(unsigned int, max_bvecs, 0, BIO_MAX_PAGES);
>> +struct bio *bio = bio_alloc(gfp_mask, nr_bvec);
> 
> I think that depending on the gfp_mask passed, bio can be NULL. So this should
> be checked.
> 
>> +
>> +bio_set_dev(bio, bdev);
>> +bio->bi_iter.bi_sector = sector;
>> +bio_set_op_attrs(bio, op, op_flags);
> 
> This function is obsolete. Open code this.

And that also mean that you could remove one argument to bio_new(): combine op
and op_flags into "unsigned int opf"

> 
>> +
>> +return bio;
>> +}
>>  
>>  #endif /* __LINUX_BIO_H */
>>
> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [RFC PATCH v4 2/3] block: add simple copy support

2021-01-05 Thread Damien Le Moal

On 2021/01/05 21:24, Selva Jove wrote:
> Thanks for the review, Damien.
> 
> On Mon, Jan 4, 2021 at 6:17 PM Damien Le Moal  wrote:
>>
>> On 2021/01/04 19:48, SelvaKumar S wrote:
>>> Add new BLKCOPY ioctl that offloads copying of one or more sources
>>> ranges to a destination in the device. Accepts copy_ranges that contains
>>> destination, no of sources and pointer to the array of source
>>> ranges. Each range_entry contains start and length of source
>>> ranges (in bytes).
>>>
>>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
>>> bio with control information as payload and submit to the device.
>>> REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
>>> to zoned device.
>>>
>>> If the device doesn't support copy or copy offload is disabled, then
>>> copy is emulated by allocating memory of total copy size. The source
>>> ranges are read into memory by chaining bio for each source ranges and
>>> submitting them async and the last bio waits for completion. After data
>>> is read, it is written to the destination.
>>>
>>> bio_map_kern() is used to allocate bio and add pages of copy buffer to
>>> bio. As bio->bi_private and bio->bi_end_io is needed for chaining the
>>> bio and over written, invalidate_kernel_vmap_range() for read is called
>>> in the caller.
>>>
>>> Introduce queue limits for simple copy and other helper functions.
>>> Add device limits as sysfs entries.
>>>   - copy_offload
>>>   - max_copy_sectors
>>>   - max_copy_ranges_sectors
>>>   - max_copy_nr_ranges
>>>
>>> copy_offload(= 0) is disabled by default.
>>> max_copy_sectors = 0 indicates the device doesn't support native copy.
>>>
>>> Native copy offload is not supported for stacked devices and is done via
>>> copy emulation.
>>>
>>> Signed-off-by: SelvaKumar S 
>>> Signed-off-by: Kanchan Joshi 
>>> Signed-off-by: Nitesh Shetty 
>>> Signed-off-by: Javier González 
>>> ---
>>>  block/blk-core.c  |  94 ++--
>>>  block/blk-lib.c   | 223 ++
>>>  block/blk-merge.c |   2 +
>>>  block/blk-settings.c  |  10 ++
>>>  block/blk-sysfs.c |  50 +
>>>  block/blk-zoned.c |   1 +
>>>  block/bounce.c|   1 +
>>>  block/ioctl.c |  43 
>>>  include/linux/bio.h   |   1 +
>>>  include/linux/blk_types.h |  15 +++
>>>  include/linux/blkdev.h|  13 +++
>>>  include/uapi/linux/fs.h   |  13 +++
>>>  12 files changed, 458 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 96e5fcd7f071..4a5cd3f53cd2 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -719,6 +719,17 @@ static noinline int should_fail_bio(struct bio *bio)
>>>  }
>>>  ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
>>>
>>> +static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
>>> + sector_t nr_sectors, sector_t maxsector)
>>> +{
>>> + if (nr_sectors && maxsector &&
>>> + (nr_sectors > maxsector || start > maxsector - nr_sectors)) {
>>> + handle_bad_sector(bio, maxsector);
>>> + return -EIO;
>>> + }
>>> + return 0;
>>> +}
>>> +
>>>  /*
>>>   * Check whether this bio extends beyond the end of the device or 
>>> partition.
>>>   * This may well happen - the kernel calls bread() without checking the 
>>> size of
>>> @@ -737,6 +748,65 @@ static inline int bio_check_eod(struct bio *bio, 
>>> sector_t maxsector)
>>>   return 0;
>>>  }
>>>
>>> +/*
>>> + * Check for copy limits and remap source ranges if needed.
>>> + */
>>> +static int blk_check_copy(struct bio *bio)
>>> +{
>>> + struct block_device *p = NULL;
>>> + struct request_queue *q = bio->bi_disk->queue;
>>> + struct blk_copy_payload *payload;
>>> + int i, maxsector, start_sect = 0, ret = -EIO;
>>> + unsigned short nr_range;
>>> +
>>> + rcu_read_lock();
>>> +
>>> + p = __disk_get_part(bio->bi_disk, bio->bi_partno);
>>> + if (unlikely(!p))
&g

Re: [PATCH] drivers: block: skd: remove skd_pci_info()

2020-12-13 Thread Damien Le Moal

On Fri, 2020-12-11 at 22:11 +0530, Puranjay Mohan wrote:
> PCI core calls __pcie_print_link_status() for every device, it prints
> both the link width and the link speed. skd_pci_info() does the same
> thing again, hence it can be removed.

Hmmm... On my box, I see this for the skd card:

[8.509243] pci :d8:00.0: [1b39:0001] type 00 class 0x018000
[8.515933] pci :d8:00.0: reg 0x10: [mem 0xfbe0-0xfbe0]
[8.521924] pci :d8:00.0: reg 0x14: [mem 0xfbe1-0xfbe10fff]
[8.527957] pci :d8:00.0: reg 0x30: [mem 0xfbd0-0xfbdf
pref]
[8.534999] pci :d8:00.0: supports D1 D2

No link speed. Checking the code, I think you need to actually call
pcie_print_link_status() (which calls __pcie_print_link_status() with
verbose = true) from the driver to see anything. Otherwise, the PCIe
core will not print anything if the driver is just probing and getting
resources for the card.

> 
> Signed-off-by: Puranjay Mohan 
> ---
>  drivers/block/skd_main.c | 31 ---
>  1 file changed, 31 deletions(-)
> 
> diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
> index a962b4551bed..da7aac5335d9 100644
> --- a/drivers/block/skd_main.c
> +++ b/drivers/block/skd_main.c

[...]
>  static int skd_pci_probe(struct pci_dev *pdev, const struct pci_device_id 
> *ent)
>  {
>   int i;
>   int rc = 0;
> - char pci_str[32];
>   struct skd_device *skdev;
> 
>   dev_dbg(&pdev->dev, "vendor=%04X device=%04x\n", pdev->vendor,
> @@ -3201,8 +3172,6 @@ static int skd_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
>   goto err_out_regions;
>   }
> 
> - skd_pci_info(skdev, pci_str);
> - dev_info(&pdev->dev, "%s 64bit\n", pci_str);

Replace these 2 lines with:

pcie_print_link_status(pdev);

And the link speed information will be printed.



-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v1] drivers: block: skd: remove skd_pci_info()

2020-12-14 Thread Damien Le Moal

On 2020/12/15 0:27, Puranjay Mohan wrote:
> Change the call to skd_pci_info() to pcie_print_link_status().
> pcie_print_link_status() can be used to print the link speed and
> the link width, skd_pci_info() does the same and hence it is removed.
> 
> Signed-off-by: Puranjay Mohan 
> ---
> v1 - Add call to pcie_print_link_status()
> ---
>  drivers/block/skd_main.c | 33 +
>  1 file changed, 1 insertion(+), 32 deletions(-)
> 
> diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
> index a962b4551bed..efd69f349043 100644
> --- a/drivers/block/skd_main.c
> +++ b/drivers/block/skd_main.c
> @@ -3134,40 +3134,10 @@ static const struct pci_device_id skd_pci_tbl[] = {
>  
>  MODULE_DEVICE_TABLE(pci, skd_pci_tbl);
>  
> -static char *skd_pci_info(struct skd_device *skdev, char *str)
> -{
> - int pcie_reg;
> -
> - strcpy(str, "PCIe (");
> - pcie_reg = pci_find_capability(skdev->pdev, PCI_CAP_ID_EXP);
> -
> - if (pcie_reg) {
> -
> - char lwstr[6];
> - uint16_t pcie_lstat, lspeed, lwidth;
> -
> - pcie_reg += 0x12;
> - pci_read_config_word(skdev->pdev, pcie_reg, &pcie_lstat);
> - lspeed = pcie_lstat & (0xF);
> - lwidth = (pcie_lstat & 0x3F0) >> 4;
> -
> - if (lspeed == 1)
> - strcat(str, "2.5GT/s ");
> - else if (lspeed == 2)
> - strcat(str, "5.0GT/s ");
> - else
> - strcat(str, " ");
> - snprintf(lwstr, sizeof(lwstr), "%dX)", lwidth);
> - strcat(str, lwstr);
> - }
> - return str;
> -}
> -
>  static int skd_pci_probe(struct pci_dev *pdev, const struct pci_device_id 
> *ent)
>  {
>   int i;
>   int rc = 0;
> - char pci_str[32];
>   struct skd_device *skdev;
>  
>   dev_dbg(&pdev->dev, "vendor=%04X device=%04x\n", pdev->vendor,
> @@ -3201,8 +3171,7 @@ static int skd_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
>   goto err_out_regions;
>   }
>  
> - skd_pci_info(skdev, pci_str);
> - dev_info(&pdev->dev, "%s 64bit\n", pci_str);
> + pcie_print_link_status(pdev);
>  
>   pci_set_master(pdev);
>   rc = pci_enable_pcie_error_reporting(pdev);
> 

Note: V1 of this patch was the one I commented on. This one should thus be V2.

In any case, this looks OK to me.

Acked-by: Damien Le Moal 


-- 
Damien Le Moal
Western Digital Research

Re: [v1 PATCH 4/4] RISC-V: Support nr_cpus command line option.

2019-03-19 Thread Damien Le Moal

On 2019/03/20 7:20, Atish Patra wrote:
> If nr_cpus command line option is set, maximum possible cpu should be
> set to that value.
> 
> Signed-off-by: Atish Patra 
> ---
>  arch/riscv/kernel/smpboot.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
> index 609475c5..a8fe590c 100644
> --- a/arch/riscv/kernel/smpboot.c
> +++ b/arch/riscv/kernel/smpboot.c
> @@ -86,11 +86,19 @@ void __init setup_smp(void)
>   }
>  
>   cpuid_to_hartid_map(cpuid) = hart;
> - set_cpu_possible(cpuid, true);

This looks weird: the code being changed does not match what patch 3/4 did.

>   cpuid++;
>   }
>  
>   BUG_ON(!found_boot_cpu);
> +
> + if (cpuid > nr_cpu_ids)
> + pr_warn("Total number of cpus [%d] are greater than configured 
> via nr_cpus [%d]\n",

"The total number of cpus [%d] is greater than nr_cpus option value [%d]\n"

> + cpuid, nr_cpu_ids);
> +
> + for (cpuid = 1; cpuid < nr_cpu_ids; cpuid++) {
> + if (cpuid_to_hartid_map(cpuid) != INVALID_HARTID)
> + set_cpu_possible(cpuid, true);
> + }
>  }
>  
>  int __cpu_up(unsigned int cpu, struct task_struct *tidle)
> 


-- 
Damien Le Moal
Western Digital Research

Re: [v1 PATCH 4/4] RISC-V: Support nr_cpus command line option.

2019-03-19 Thread Damien Le Moal

On 2019/03/20 8:56, Damien Le Moal wrote:
> On 2019/03/20 7:20, Atish Patra wrote:
>> If nr_cpus command line option is set, maximum possible cpu should be
>> set to that value.
>>
>> Signed-off-by: Atish Patra 
>> ---
>>  arch/riscv/kernel/smpboot.c | 10 +-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
>> index 609475c5..a8fe590c 100644
>> --- a/arch/riscv/kernel/smpboot.c
>> +++ b/arch/riscv/kernel/smpboot.c
>> @@ -86,11 +86,19 @@ void __init setup_smp(void)
>>  }
>>  
>>  cpuid_to_hartid_map(cpuid) = hart;
>> -set_cpu_possible(cpuid, true);
> 
> This looks weird: the code being changed does not match what patch 3/4 did.

Arg... Ignore this one. My bad. Morning here, I need more coffee :)

> 
>>  cpuid++;
>>  }
>>  
>>  BUG_ON(!found_boot_cpu);
>> +
>> +if (cpuid > nr_cpu_ids)
>> +pr_warn("Total number of cpus [%d] are greater than configured 
>> via nr_cpus [%d]\n",
> 
> "The total number of cpus [%d] is greater than nr_cpus option value [%d]\n"
> 
>> +cpuid, nr_cpu_ids);
>> +
>> +for (cpuid = 1; cpuid < nr_cpu_ids; cpuid++) {
>> +    if (cpuid_to_hartid_map(cpuid) != INVALID_HARTID)
>> +set_cpu_possible(cpuid, true);
>> +}
>>  }
>>  
>>  int __cpu_up(unsigned int cpu, struct task_struct *tidle)
>>
> 
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-26 Thread Damien Le Moal

On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>
>>> This looks like something that could hit every file systems, so
>>> shouldn't we fix this in common code?  We could also look into
>>> just using memalloc_nofs_save for the page cache allocation path
>>> instead of the per-mapping gfp_mask.
>>
>> I think it has to be the entire IO path - any allocation from the
>> underlying filesystem could recurse into the top level filesystem
>> and then deadlock if the memory reclaim submits IO or blocks on
>> IO completion from the upper filesystem. That's a bloody big hammer
>> for something that is only necessary when there are stacked
>> filesystems like this
> 
> Yeah that's why using memalloc_nofs_save() probably makes the most
> sense, and dm_zoned should use that before it calls into ext4.

Unfortunately, with this particular setup, that will not solve the problem.
dm-zoned submit BIOs to its backend drive in response to XFS activity. The
requests for these BIOs are passed along to the kernel tcmu HBA and end up in
that HBA command ring. The commands themselves are read from the ring and
executed by the tcmu-runner user process which executes them doing
pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
context than the dm-zoned worker thread issuing the BIO,
memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

We tried a simpler setup using loopback mount (XFS used directly in an ext4
file) and running the same workload. We failed to recreate a similar deadlock in
this case, but I am strongly suspecting that it can happen too. It is simply
much harder to hit because the IO path from XFS to ext4 is all in-kernel and
asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for IOs
which makes it relatively easy to get inter-dependent writes or read+write
queued back-to-back and create the deadlock.

So back to Dave's point, we may be needing the big-hammer solution in the case
of stacked file systems, while a non-stack setups do not necessarily need it
(that is for the FS to decide). But I do not see how to implement this big
hammer conditionally. How can a file system tell if it is at the top of the
stack (big hammer not needed) or lower than the top level (big hammer needed) ?

One simple hack would be an fcntl() or mount option to tell the FS to use
GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
applications or system setup is correct. So not so safe.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-28 Thread Damien Le Moal

On 2019/07/29 8:42, Dave Chinner wrote:
> On Sat, Jul 27, 2019 at 02:59:59AM +0000, Damien Le Moal wrote:
>> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>>>
>>>>> This looks like something that could hit every file systems, so
>>>>> shouldn't we fix this in common code?  We could also look into
>>>>> just using memalloc_nofs_save for the page cache allocation path
>>>>> instead of the per-mapping gfp_mask.
>>>>
>>>> I think it has to be the entire IO path - any allocation from the
>>>> underlying filesystem could recurse into the top level filesystem
>>>> and then deadlock if the memory reclaim submits IO or blocks on
>>>> IO completion from the upper filesystem. That's a bloody big hammer
>>>> for something that is only necessary when there are stacked
>>>> filesystems like this
>>>
>>> Yeah that's why using memalloc_nofs_save() probably makes the most
>>> sense, and dm_zoned should use that before it calls into ext4.
>>
>> Unfortunately, with this particular setup, that will not solve the problem.
>> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
>> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
>> that HBA command ring. The commands themselves are read from the ring and
>> executed by the tcmu-runner user process which executes them doing
>> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
>> context than the dm-zoned worker thread issuing the BIO,
>> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.
> 
> Right, I'm talking about using memalloc_nofs_save() as a huge hammer
> in the pread/pwrite() calling context, not the bio submission
> context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO
> below).

Yes, I understood your point. And I agree that it indeed would be a big hammer.
We should be able to do better than that :)

>> One simple hack would be an fcntl() or mount option to tell the FS to use
>> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that 
>> the
>> applications or system setup is correct. So not so safe.
> 
> Wasn't there discussion at some point in the past about an interface
> for special processes to be able to mark themselves as PF_MEMALLOC
> (some kind of prctl, I think) for things like FUSE daemons? That
> would prevent direct reclaim recursion for these userspace daemons
> that are in the kernel memory reclaim IO path. It's the same
> situation there, isn't it? How does fuse deal with this problem?

I do not recall such discussion. But indeed FUSE may give some hints. Good idea.
Thanks. I will check.


-- 
Damien Le Moal
Western Digital Research

[PATCH] ext4: Fix deadlock on page reclaim

2019-07-25 Thread Damien Le Moal

In ext4_[da_]write_begin(), grab_cache_page_write_begin() is being
called without GFP_NOFS set for the context. This is considered adequate
as any eventual memory reclaim triggered by a page allocation is being
done before the transaction handle for the write operation is started.

However, with the following setup:

* fo creating and writing files on XFS
* XFS file system on top of dm-zoned target device
* dm-zoned target created on tcmu-runner emulated ZBC disk
* emulated ZBC disk backend file on ext4
* ext4 file system on regular SATA SSD

A deadlock was observed under the heavy file write workload generated
using fio. The deadlock is clearly apparent from the backtrace of the
tcmu-runner handler task backtrace:

tcmu-runner call Trace:
wait_for_completion+0x12c/0x170 <-- deadlock (see below text)
xfs_buf_iowait+0x39/0x1c0 [xfs]
__xfs_buf_submit+0x118/0x2e0 [xfs]  <-- XFS issues writes to
dm-zoned device, which in
turn will issue writes to
the emulated ZBC device
handled by tcmu-runner.
xfs_bwrite+0x25/0x60 [xfs]
xfs_reclaim_inode+0x303/0x330 [xfs]
xfs_reclaim_inodes_ag+0x223/0x450 [xfs]
xfs_reclaim_inodes_nr+0x31/0x40 [xfs]   <-- Absence of GFP_NOFS allows
reclaim to call in XFS
reclaim for the XFS file
system on the emulated
device
super_cache_scan+0x153/0x1a0
do_shrink_slab+0x17b/0x3c0
shrink_slab+0x170/0x2c0
shrink_node+0x1d6/0x4a0
do_try_to_free_pages+0xdb/0x3c0
try_to_free_pages+0x112/0x2e0   <-- Page reclaim triggers
__alloc_pages_slowpath+0x422/0x1020
__alloc_pages_nodemask+0x37f/0x400
pagecache_get_page+0xb4/0x390
grab_cache_page_write_begin+0x1d/0x40
ext4_da_write_begin+0xd6/0x530
generic_perform_write+0xc2/0x1e0
__generic_file_write_iter+0xf9/0x1d0
ext4_file_write_iter+0xc6/0x3b0
new_sync_write+0x12d/0x1d0
vfs_write+0xdb/0x1d0
ksys_pwrite64+0x65/0xa0 <-- tcmu-runner ZBC handler writes to
the backend file in response to
dm-zoned target write IO

When XFS reclaim issues write IOs to the dm-zoned target device,
dm-zoned issue write IOs to the tcmu-runner emulated device. However,
the tcmu ZBC handler is singled threaded and blocked waiting for the
completion of the started pwrite() call and so does not process the
newly issued write IOs necessary to complete page reclaim. The system
is in a deadlocked state. This problem is 100% reproducible, the
deadlock happening after fio running for a few minutes.

A similar deadlock was also observed for read operations into the ext4
backend file on page cache miss trigering a page allocation and page
reclaim. Switching the tcmu emulated ZBC disk backend file from an ext4
file to an XFS file, none of these deadlocks are observed.

Fix this problem by removing __GFP_FS from ext4 inode mapping gfp_mask.
The code used for this fix is borrowed from XFS xfs_setup_inode(). The
inode mapping gfp_mask initialization is added to ext4_set_aops().

Reported-by: Masato Suzuki 
Signed-off-by: Damien Le Moal 
---
 fs/ext4/inode.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 420fe3deed39..f882929037df 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1292,8 +1292,7 @@ static int ext4_write_begin(struct file *file, struct 
address_space *mapping,
 * grab_cache_page_write_begin() can take a long time if the
 * system is thrashing due to memory pressure, or if the page
 * is being written back.  So grab it first before we start
-* the transaction handle.  This also allows us to allocate
-* the page (if needed) without using GFP_NOFS.
+* the transaction handle.
 */
 retry_grab:
page = grab_cache_page_write_begin(mapping, index, flags);
@@ -3084,8 +3083,7 @@ static int ext4_da_write_begin(struct file *file, struct 
address_space *mapping,
 * grab_cache_page_write_begin() can take a long time if the
 * system is thrashing due to memory pressure, or if the page
 * is being written back.  So grab it first before we start
-* the transaction handle.  This also allows us to allocate
-* the page (if needed) without using GFP_NOFS.
+* the transaction handle.
 */
 retry_grab:
page = grab_cache_page_write_begin(mapping, index, flags);
@@ -4003,6 +4001,8 @@ static const struct address_space_operations 
ext4_dax_aops = {
 
 void ext4_set_aops(struct inode *inode)
 {
+   gfp_t gfp_mask;
+
switch (ext4_inode_journal_mode(inode)) {
case EXT4_INODE_ORDERED_DATA_MODE:
case EXT4_INODE_WRIT

Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-25 Thread Damien Le Moal

On 2019/07/25 20:54, Christoph Hellwig wrote:
> On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
>> +gfp_t gfp_mask;
>> +
>>  switch (ext4_inode_journal_mode(inode)) {
>>  case EXT4_INODE_ORDERED_DATA_MODE:
>>  case EXT4_INODE_WRITEBACK_DATA_MODE:
>> @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
>>  inode->i_mapping->a_ops = &ext4_da_aops;
>>  else
>>  inode->i_mapping->a_ops = &ext4_aops;
>> +
>> +/*
>> + * Ensure all page cache allocations are done from GFP_NOFS context to
>> + * prevent direct reclaim recursion back into the filesystem and blowing
>> + * stacks or deadlocking.
>> + */
>> +gfp_mask = mapping_gfp_mask(inode->i_mapping);
>> +mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> 
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?  We could also look into
> just using memalloc_nofs_save for the page cache allocation path
> instead of the per-mapping gfp_mask.

I agree. I did a quick scan and it looks to me like not all file systems are
using GFP_NOFS when grabbing pages for read or write. XFS, btrfs, f2fs, jfs,
gfs2 and ramfs seem OK, but I did not dig very deep for others where the use of
GFP_NOFS is not obvious.

I am not sure though how to approach a global fix. At the very least, it may be
good to have GFP_NOFS set by default in the inode mapping flags in
__address_space_init_once() which is called from inode_init_once(). But I am not
sure if that is enough nor if all file systems are using this function.

The other method as you suggest would be to add calls to
memalloc_nofs_save/restore() in functions like grab_cache_page_write_begin(),
but since that would cover writes only, we may want to do that at a higher level
in the various generic_xxx() and mpage_xxx() helper functions to cover more
ground. But that would still not be enough for the files systems not using these
helpers (plenty of examples for that for the write path).

Or do we go as high to VFS layer to add memalloc_nofs_save/restore() calls ?
That would be a big hammer fix... I personally think this would be OK though,
but I may be missing some points here.

Thoughts on the best approach ?

Best regards.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-29 Thread Damien Le Moal

Andreas,

On 2019/07/30 3:40, Andreas Dilger wrote:
> On Jul 26, 2019, at 8:59 PM, Damien Le Moal  wrote:
>>
>> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>>>
>>>>> This looks like something that could hit every file systems, so
>>>>> shouldn't we fix this in common code?  We could also look into
>>>>> just using memalloc_nofs_save for the page cache allocation path
>>>>> instead of the per-mapping gfp_mask.
>>>>
>>>> I think it has to be the entire IO path - any allocation from the
>>>> underlying filesystem could recurse into the top level filesystem
>>>> and then deadlock if the memory reclaim submits IO or blocks on
>>>> IO completion from the upper filesystem. That's a bloody big hammer
>>>> for something that is only necessary when there are stacked
>>>> filesystems like this
>>>
>>> Yeah that's why using memalloc_nofs_save() probably makes the most
>>> sense, and dm_zoned should use that before it calls into ext4.
>>
>> Unfortunately, with this particular setup, that will not solve the problem.
>> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
>> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
>> that HBA command ring. The commands themselves are read from the ring and
>> executed by the tcmu-runner user process which executes them doing
>> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
>> context than the dm-zoned worker thread issuing the BIO,
>> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.
> 
> One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state
> in the BIO so that the worker thread will also get the correct behaviour
> when it is processing this IO submission.

But there is no BIO to work with here: dm-zoned BIOs are transformed into an
application (tcmu-runner running the ZBC file handler) calls to pread()/pwrite()
into an ext4 file. And processing of these calls trigger the direct reclaim and
FS recursion which cause the deadlock. So BIOs are not the proper vehicle to
transport the information about the NOFS allocation.

If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a
RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be the
cleanest approach.

Best regards.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-30 Thread Damien Le Moal

Dave,

On 2019/07/31 8:48, Dave Chinner wrote:
> On Tue, Jul 30, 2019 at 02:06:33AM +0000, Damien Le Moal wrote:
>> If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a
>> RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be 
>> the
>> cleanest approach.
> 
> Clean, yes, but I'm not sure we want to expose kernel memory reclaim
> capabilities to userspace... It would be misleading, too, because we
> still want to allow reclaim to occur, just not have reclaim recurse
> into other filesystems

When I wrote RWF_NORECLAIM, I was really thinking of RWF_NOFSRECLAIM. So
suppressing direct reclaim recursing into another FS rather than completely
disabling reclaim. Sorry for the confusion.

Would this be better ? This is still application controlled, so debatable if
control should be given. Most likely this would need to be limited to CAP_SYS
capable user processes (e.g. root processes).

I still need to check on FUSE if anything at all along these lines exists there.
I will dig.

Best regards.

-- 
Damien Le Moal
Western Digital Research

Re: lift the xfs writepage code into iomap v3

2019-07-30 Thread Damien Le Moal

Darrick,

On 2019/07/30 10:17, Darrick J. Wong wrote:
> On Mon, Jul 22, 2019 at 11:50:12AM +0200, Christoph Hellwig wrote:
>> Hi all,
>>
>> this series cleans up the xfs writepage code and then lifts it to
>> fs/iomap.c so that it could be use by other file system.  I've been
>> wanting to this for a while so that I could eventually convert gfs2
>> over to it, but I never got to it.  Now Damien has a new zonefs
>> file system for semi-raw access to zoned block devices that would
>> like to use the iomap code instead of reinventing it, so I finally
>> had to do the work.
> 
> I've posted a branch against -rc2 for us all to work from:
> 
> https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/log/?h=iomap-writeback
> 
> and will be emailing the patches shortly to the list for completeness.

I rebased zonefs on this branch and all tests passed, no compilation problems
either. I will send a v2 of zonefs patch with all of Dave's comments addressed
shortly.

Thank you.

Best regards.



-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 2/4] null_blk: add zone open, close, and finish support

2019-06-21 Thread Damien Le Moal

case BLK_ZONE_COND_EMPTY:
break;
case BLK_ZONE_COND_EXP_OPEN:
if (zone->wp == zone->start) {
zone->cond = BLK_ZONE_COND_EMPTY;
break;
}
/* fallthrough */
default:
        zone->cond = BLK_ZONE_COND_CLOSED;
}

> + return;
> + case REQ_OP_ZONE_FINISH:
> + zone->cond = BLK_ZONE_COND_FULL;
> + zone->wp = zone->start + zone->len;
> + return;
> + default:
> + /* Invalid zone condition */
> + cmd->error = BLK_STS_IOERR;
> + return;
> + }
>  }
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-18 Thread Damien Le Moal

t nvme_process_cq(struct nvme_queue 
> *nvmeq, u16 *start,
>  
>   *start = nvmeq->cq_head;
>   while (nvme_cqe_pending(nvmeq)) {
> - if (tag == -1U || nvmeq->cqes[nvmeq->cq_head].command_id == tag)
> + u16 ctag = nvmeq->cqes[nvmeq->cq_head].command_id;
> +
> + ctag -= nvmeq->tag_offset;
> + if (tag == -1U || ctag == tag)
>   found++;
>   nvme_update_cq_head(nvmeq);
>   }
> @@ -1487,6 +1492,10 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
> qid, int depth)
>   nvmeq->qid = qid;
>   dev->ctrl.queue_count++;
>  
> + if (qid && (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS))
> + nvmeq->tag_offset = NVME_AQ_DEPTH;
> + else
> + nvmeq->tag_offset = 0;
>   return 0;
>  
>   free_cqdma:
> @@ -2106,6 +2115,14 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
>   unsigned long size;
>  
>   nr_io_queues = max_io_queues();
> +
> + /*
> +  * If tags are shared with admin queue (Apple bug), then
> +  * make sure we only use one queue.
> +  */
> + if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
> + nr_io_queues = 1;
> +
>   result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
>   if (result < 0)
>   return result;
> @@ -2300,6 +2317,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  {
>   int result = -ENOMEM;
>   struct pci_dev *pdev = to_pci_dev(dev->dev);
> + unsigned int mqes;
>  
>   if (pci_enable_device_mem(pdev))
>   return result;
> @@ -2325,8 +2343,8 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>   dev->ctrl.cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
>  
> - dev->q_depth = min_t(int, NVME_CAP_MQES(dev->ctrl.cap) + 1,
> - io_queue_depth);
> + mqes = NVME_CAP_MQES(dev->ctrl.cap);
> + dev->q_depth = min_t(int, mqes + 1, io_queue_depth);
>   dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap);
>   dev->dbs = dev->bar + 4096;
>  
> @@ -2340,6 +2358,23 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>   else
>   dev->io_sqes = NVME_NVM_IOSQES;
>  
> + /*
> +  * Another Apple one: If we're going to offset the IO queue tags,
> +  * we still want to make sure they are no bigger than MQES,
> +  * it *looks* like otherwise, bad things happen (I suspect some
> +  * of the tag tracking in that device is limited).
> +  */
> + if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) {
> + if (mqes <= NVME_AQ_DEPTH) {
> + dev_err(dev->ctrl.device, "Apple shared tags quirk"
> + " not compatible with device mqes %d\n", mqes);
> + result = -ENODEV;
> + goto disable;
> + }
> + dev->q_depth = min_t(int, dev->q_depth,
> +  mqes - NVME_AQ_DEPTH + 1);
> + }
> +
>   /*
>* Temporary fix for the Apple controller found in the MacBook8,1 and
>* some MacBook7,1 to avoid controller resets and data loss.
> @@ -3057,7 +3092,8 @@ static const struct pci_device_id nvme_id_table[] = {
>   { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
>   { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2005),
>   .driver_data = NVME_QUIRK_SINGLE_VECTOR |
> - NVME_QUIRK_128_BYTES_SQES },
> + NVME_QUIRK_128_BYTES_SQES |
> + NVME_QUIRK_SHARED_TAGS },
>   { 0, }
>  };
>  MODULE_DEVICE_TABLE(pci, nvme_id_table);
> 
> 
> 
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH v2] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-18 Thread Damien Le Moal

On 2019/07/19 13:49, Benjamin Herrenschmidt wrote:
> On Fri, 2019-07-19 at 04:43 +0000, Damien Le Moal wrote:
>> On 2019/07/19 13:37, Benjamin Herrenschmidt wrote:
>>> Another issue with the Apple T2 based 2018 controllers seem to be
>>> that they blow up (and shut the machine down) if there's a tag
>>> collision between the IO queue and the Admin queue.
>>>
>>> My suspicion is that they use our tags for their internal tracking
>>> and don't mix them with the queue id. They also seem to not like
>>> when tags go beyond the IO queue depth, ie 128 tags.
>>>
>>> This adds a quirk that offsets all the tags in the IO queue by 32
>>> to avoid those collisions. It also limits the number of IO queues
>>> to 1 since the code wouldn't otherwise make sense (the device
>>> supports only one queue anyway but better safe than sorry) and
>>> reduces the size of the IO queue
>>
>> What about keeping the IO queue QD at 128, but marking the first 32 tags as
>> "allocated" when the device is initialized ? This way, these tags numbers are
>> never used for regular IOs and you can avoid the entire tag +/- offset dance 
>> at
>> runtime. The admin queue uses tags 0-31 and the IO queue uses tags 32-127. 
>> No ?
> 
> I suppose that would work and be simpler. I honestly don't know much
> about the block layer and tag allocation so I stayed away from it :-)
> 
> I'll dig, but a hint would be welcome :)

Uuuh.. Never played with the tag management code directly myself either. A quick
look seem to indicate that blk_mq_get/put_tag() is what you should be using. But
further looking, struct blk_mq_tags has the field nr_reserved_tags which is used
as an offset start point for searching free tags, which is exactly what you
would need.


-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 2/8] scsi: take the DMA max mapping size into account

2019-07-22 Thread Damien Le Moal

On 2019/07/22 15:01, Ming Lei wrote:
> On Tue, Jun 18, 2019 at 4:57 AM Bart Van Assche  wrote:
>>
>> On 6/17/19 5:19 AM, Christoph Hellwig wrote:
>>> We need to limit the devices max_sectors to what the DMA mapping
>>> implementation can support.  If not we risk running out of swiotlb
>>> buffers easily.
>>>
>>> Signed-off-by: Christoph Hellwig 
>>> ---
>>>   drivers/scsi/scsi_lib.c | 2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index d333bb6b1c59..f233bfd84cd7 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -1768,6 +1768,8 @@ void __scsi_init_queue(struct Scsi_Host *shost, 
>>> struct request_queue *q)
>>>   blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
>>>   }
>>>
>>> + shost->max_sectors = min_t(unsigned int, shost->max_sectors,
>>> + dma_max_mapping_size(dev) << SECTOR_SHIFT);
>>>   blk_queue_max_hw_sectors(q, shost->max_sectors);
>>>   if (shost->unchecked_isa_dma)
>>>   blk_queue_bounce_limit(q, BLK_BOUNCE_ISA);
>>
>> Does dma_max_mapping_size() return a value in bytes? Is
>> shost->max_sectors a number of sectors? If so, are you sure that "<<
>> SECTOR_SHIFT" is the proper conversion? Shouldn't that be ">>
>> SECTOR_SHIFT" instead?
> 
> Now the patch has been committed, '<< SECTOR_SHIFT' needs to be fixed.
> 
> Also the following kernel oops is triggered on qemu, and looks
> device->dma_mask is NULL.

Just hit the exact same problem using tcmu-runner (ZBC file handler) on bare
metal (no QEMU). dev->dma_mask is NULL. No problem with real disks though.

> 
> [5.826483] scsi host0: Virtio SCSI HBA
> [5.829302] st: Version 20160209, fixed bufsize 32768, s/g segs 256
> [5.831042] SCSI Media Changer driver v0.25
> [5.832491] 
> ==
> [5.82] BUG: KASAN: null-ptr-deref in
> dma_direct_max_mapping_size+0x30/0x94
> [5.82] Read of size 8 at addr  by task kworker/u17:0/7
> [5.835506] nvme nvme0: pci function :00:07.0
> [5.82]
> [5.82] CPU: 2 PID: 7 Comm: kworker/u17:0 Not tainted 5.3.0-rc1 #1328
> [5.836999] ahci :00:1f.2: version 3.0
> [5.82] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS ?-20180724_192412-buildhw-07.phx4
> [5.82] Workqueue: events_unbound async_run_entry_fn
> [5.82] Call Trace:
> [5.82]  dump_stack+0x6f/0x9d
> [5.82]  ? dma_direct_max_mapping_size+0x30/0x94
> [5.82]  __kasan_report+0x161/0x189
> [5.82]  ? dma_direct_max_mapping_size+0x30/0x94
> [5.82]  kasan_report+0xe/0x12
> [5.82]  dma_direct_max_mapping_size+0x30/0x94
> [5.82]  __scsi_init_queue+0xd8/0x1f3
> [5.82]  scsi_mq_alloc_queue+0x62/0x89
> [5.82]  scsi_alloc_sdev+0x38c/0x479
> [5.82]  scsi_probe_and_add_lun+0x22d/0x1093
> [5.82]  ? kobject_set_name_vargs+0xa4/0xb2
> [5.82]  ? mutex_lock+0x88/0xc4
> [5.82]  ? scsi_free_host_dev+0x4a/0x4a
> [5.82]  ? _raw_spin_lock_irqsave+0x8c/0xde
> [5.82]  ? _raw_write_unlock_irqrestore+0x23/0x23
> [5.82]  ? ata_tdev_match+0x22/0x45
> [5.82]  ? attribute_container_add_device+0x160/0x17e
> [5.82]  ? rpm_resume+0x26a/0x7c0
> [5.82]  ? kobject_get+0x12/0x43
> [5.82]  ? rpm_put_suppliers+0x7e/0x7e
> [5.82]  ? _raw_spin_lock_irqsave+0x8c/0xde
> [5.82]  ? _raw_write_unlock_irqrestore+0x23/0x23
> [5.82]  ? scsi_target_destroy+0x135/0x135
> [5.82]  __scsi_scan_target+0x14b/0x6aa
> [5.82]  ? pvclock_clocksource_read+0xc0/0x14e
> [5.82]  ? scsi_add_device+0x20/0x20
> [5.82]  ? rpm_resume+0x1ae/0x7c0
> [5.82]  ? rpm_put_suppliers+0x7e/0x7e
> [5.82]  ? _raw_spin_lock_irqsave+0x8c/0xde
> [5.82]  ? _raw_write_unlock_irqrestore+0x23/0x23
> [5.82]  ? pick_next_task_fair+0x976/0xa3d
> [5.82]  ? mutex_lock+0x88/0xc4
> [5.82]  scsi_scan_channel+0x76/0x9e
> [5.82]  scsi_scan_host_selected+0x131/0x176
> [5.82]  ? scsi_scan_host+0x241/0x241
> [5.82]  do_scan_async+0x27/0x219
> [5.82]  ? scsi_scan_host+0x241/0x241
> [5.82]  async_run_entry_fn+0xdc/0x23d
> [5.82]  process_one_work+0x327/0x539
> [5.82]  worker_thread+0x330/0x492
> [5.82]  ? rescuer_thread+0x41f/0x41f
> [5.82]  kthread+0x1c6/0x1d5
> [5.82]  ? kthread_park+0xd3/0xd3
> [5.82]  ret_from_fork+0x1f/0x30
> [5.82] 
> ==
> 
> 
> 
> Thanks,
> Ming Lei
> 


-- 
Damien Le Moal
Western Digital Research

1 2 3 >

1 - 100 of 223 matches

Mail list logo