Hello,

Some time ago, after upgrading the Kernel on our i.MX6Q-based boards to 
mainline 4.18, and now to LTS 4.19 line, during stress tests we started 
noticing strange warnings coming from 'read' syscall, when page_copy_sane() 
check failed. Typical reproducibility is up to ~4 events per 24h. Warnings 
origin from different processes, mostly involved with the stress tests, but not 
necessarily with block devices we're stressing. If the warning appeared in 
process relating to block device stress test, it would be accompanied by 
corrupted data, as the read operation gets aborted. 
Our reason for dropping caches is to ensure that the actual block device gets 
accessed during the test without resorting to O_DIRECT.
When dropping caches was disabled in the tests, the issue would also disappear, 
at least in a single-week run.

Example processes caught in the last run (grepped from stacktraces):
[38128.430694] CPU: 2 PID: 32735 Comm: glmark2 Not tainted 
4.19.32-devboardimx6q+g0a64e3717985 #1
[44168.081821] CPU: 2 PID: 21555 Comm: cat Tainted: G        W         
4.19.32-devboardimx6q+g0a64e3717985 #1
[75879.424076] CPU: 3 PID: 1478 Comm: flashcp Tainted: G        W         
4.19.32-devboardimx6q+g0a64e3717985 #1
[117060.951333] CPU: 1 PID: 14150 Comm: cat Tainted: G        W         
4.19.32-devboardimx6q+g0a64e3717985 #1
[201668.438219] CPU: 1 PID: 14370 Comm: glmark2 Tainted: G        W         
4.19.32-devboardimx6q+g0a64e3717985 #1
[214332.212960] CPU: 0 PID: 25633 Comm: cp Tainted: G        W         
4.19.32-devboardimx6q+g0a64e3717985 #1
[283484.619058] CPU: 0 PID: 5091 Comm: glmark2 Tainted: G        W         
4.19.32-devboardimx6q+g0a64e3717985 #1
[330256.709124] CPU: 2 PID: 4565 Comm: sensors Tainted: G        W         
4.19.32-devboardimx6q+g0a64e3717985 #1
[337685.549556] CPU: 2 PID: 30132 Comm: sensors Tainted: G        W         
4.19.32-devboardimx6q+g0a64e3717985 #1

When I started debugging the issue, I noticed that in all cases we're dealing 
with highmem zero-order pages. In this case, page_head(page) == page, so 
page_address(page) should be equal to page_address(head).
However, it isn't the case, as page_address(head) in each case returns zero, 
causing the value of "v" to explode, and the check to fail.

At first I thought, that it could be caused by modification of struct page 
itself, but it turned out to not be the case - sometimes struct page would be 
unmodified during duration of the check. With lowmem pages, this wouldn't also 
happen, because page_address is derived directly from struct page pointer 
itself.

Finally, after gathering a few traces with added debug logs I decided to 
trigger a panic on the warning and capture a vmcore.
When analyzing the vmcore, I confirmed that the struct page_address_slot 
corresponding to the struct page had an empty list of mappings, explaining NULL 
returned from page_address(head) in page_copy_sane().
Yeah, I had to manually calculate hash of struct page pointer ;)

In the meantime I noticed a patch by Eric Dumazet ("iov_iter: optimize 
page_copy_sane()") [1]. Applying this patch silenced the warnings, at least for 
test run going for over a week. This is quite expected, as with this change 
'page' isn't dereferenced at all, nor is page_address() called. Here I suspect 
that applying this patch may only hide a real issue I'm facing, and might be 
missing the case of highmem pages as well.

An example stacktrace I captured, with added debug logs and copy of struct page 
before and after the check in page_copy_sane:

This last one shows no changes to struct page over the call to 
page_copy_sane(), where first call to page_address(page) returns a valid page 
address (page_address=2146996224) and second call returns 0 (head_address=0). 
Subsequent calls to page_address inside WARN() argument list also return 0. The 
same was true for the occurence I captured vmcore for.

(For your reference, a patch containing my debug prints is at the end of the 
message)

[337685.344204] ------------[ cut here ]------------
[337685.356870] WARNING: CPU: 0 PID: 30132 at 
/mnt/work/git/platforms-manifest/build_devboardimx6q/tmp/work-shared/devboardimx6q/kernel-source/lib/iov_iter.c:834
 page_copy_sane+0x13c/0x178
[337685.420706] page_copy_sane: page=7721123e, offset=0, n=4096, v=2147000320, 
head=7721123e, page_address=2146996224, head_address=0, compound_order(head)=0, 
page_address(page)=0, page_address(head)=0 page_address(page_copy)=0, 
flags(page)=referenced|uptodate|lru|active|arch_1, 
flags(page_copy)=referenced|uptodate|lru|active|arch_1
[337685.494736] Modules linked in: xt_nat iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_tables x_tables 
usb_f_mass_storage usb_f_rndis u_ether ci_hdrc_imx ci_hdrc usbmisc_imx ulpi 
libcomposite configfs udc_core
[337685.549556] CPU: 2 PID: 30132 Comm: sensors Tainted: G        W         
4.19.32-devboardimx6q+g0a64e3717985 #1
[337685.558314] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[337685.563600] [<8010fe44>] (unwind_backtrace) from [<8010b840>] 
(show_stack+0x10/0x14)
[337685.570081] [<8010b840>] (show_stack) from [<807a5958>] 
(dump_stack+0x88/0x9c)
[337685.576033] [<807a5958>] (dump_stack) from [<8011f500>] (__warn+0xfc/0x114)
[337685.581721] [<8011f500>] (__warn) from [<8011f560>] 
(warn_slowpath_fmt+0x48/0x6c)
[337685.587936] [<8011f560>] (warn_slowpath_fmt) from [<803a0664>] 
(page_copy_sane+0x13c/0x178)
[337685.595004] [<803a0664>] (page_copy_sane) from [<803a2fe4>] 
(copy_page_to_iter+0x18/0x474)
[337685.602020] [<803a2fe4>] (copy_page_to_iter) from [<801bd9d4>] 
(generic_file_read_iter+0x2cc/0x980)
[337685.609794] [<801bd9d4>] (generic_file_read_iter) from [<8020949c>] 
(__vfs_read+0xf8/0x158)
[337685.616866] [<8020949c>] (__vfs_read) from [<80209588>] 
(vfs_read+0x8c/0x118)
[337685.622717] [<80209588>] (vfs_read) from [<80209aac>] (ksys_read+0x4c/0xac)
[337685.628396] [<80209aac>] (ksys_read) from [<80101000>] 
(ret_fast_syscall+0x0/0x54)
[337685.634668] Exception stack(0xa4e79fa8 to 0xa4e79ff0)
[337685.638427] 9fa0:                   00000008 00000000 00000003 7ed03208 
00000200 00000000
[337685.645318] 9fc0: 00000008 00000000 76fe8958 00000003 00000000 7ed03208 
76fe5640 7ed031b4
[337685.652205] 9fe0: 76fe7cf0 7ed0316c 76fbeee0 76fd1b1c
[337685.717080] ---[ end trace 660d072e57b3d168 ]---
[337685.720698] page_copy 00000000: 6c 04 00 40 84 1f f4 ef dc aa c4 80 5c e6 
9b ed  l..@........\...
[337685.720713] page_copy 00000010: 00 00 00 00 00 00 00 00 ff ff ff ff 02 00 
00 00  ................
[337685.720721] *page     00000000: 6c 04 00 40 84 1f f4 ef dc aa c4 80 5c e6 
9b ed  l..@........\...
[337685.720728] *page     00000010: 00 00 00 00 00 00 00 00 ff ff ff ff 02 00 
00 00  ................

The scenario with dropping caches via /proc/sys/vm/drop_caches seems to be 
rarely used (and tested), however I suspect that during dropping caches, some 
pages might be unmapped too early causing other hard-to notice problems.

As my findings are quite worrisome, I'd be glad to hear if my analysis is 
valid, and if optimization made by Eric can be considered 'the fix', or maybe 
it is too optimistic.
Also as I'd like to investigate the issue further, I'd be glad for any hints on 
where to continue.

[1] https://lkml.org/lkml/2019/2/26/686, merged as 6daef95b8c91 in master

Patch containing debug logs:

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 8be175df3075..ded7e073434c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -815,14 +815,25 @@ bool _copy_from_iter_full_nocache(void *addr, size_t 
bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(_copy_from_iter_full_nocache);
 
-static inline bool page_copy_sane(struct page *page, size_t offset, size_t n)
+static noinline bool page_copy_sane(struct page *page, size_t offset, size_t n)
 {
+       struct page page_copy = *page;
        struct page *head = compound_head(page);
-       size_t v = n + offset + page_address(page) - page_address(head);
+       size_t page_addr = (size_t) page_address(page);
+       size_t head_addr = (size_t) page_address(head);
+       size_t v = n + offset + page_addr - head_addr;
+       unsigned int order = compound_order(head);
 
-       if (likely(n <= v && v <= (PAGE_SIZE << compound_order(head))))
+       if (likely(n <= v && v <= (PAGE_SIZE << order)))
                return true;
-       WARN_ON(1);
+       WARN(1, "%s: page=%p, offset=%zu, n=%zu, v=%zu, head=%p, 
page_address=%zu,"
+               " head_address=%zu, compound_order(head)=%u, 
page_address(page)=%zu,"
+               " page_address(head)=%zu page_address(page_copy)=%zu, 
flags(page)=%pGp, flags(page_copy)=%pGp\n",
+            __func__, page, offset, n, v, head, page_addr, head_addr, order,
+           (size_t) page_address(page), (size_t) page_address(head),
+           (size_t) page_address(&page_copy), &page->flags, &page_copy.flags);
+       print_hex_dump_bytes("page_copy ", DUMP_PREFIX_OFFSET, &page_copy, 
sizeof(page_copy));
+       print_hex_dump_bytes("*page     ", DUMP_PREFIX_OFFSET, page, 
sizeof(*page));
        return false;
 }

-- Pozdrawiam/With kind regards, Lech Perczak Sr. Software Engineer Camlin 
Technologies Poland Limited Sp. z o.o.

Reply via email to