On 2021-May-23, at 01:27, Mark Millard <marklmi at yahoo.com> wrote: > On 2021-May-23, at 00:44, Mark Millard <marklmi at yahoo.com> wrote: > >> On 2021-May-21, at 17:56, Rick Macklem <rmacklem at uoguelph.ca> wrote: >> >>> Mark Millard wrote: >>> [stuff snipped] >>>> Well, why is it that ls -R, find, and diff -r all get file >>>> name problems via genet0 but diff -r gets no problems >>>> comparing the content of files that it does match up (the >>>> vast majority)? Any clue how could the problems possibly >>>> be unique to the handling of file names/paths? Does it >>>> suggest anything else to look into for getting some more >>>> potentially useful evidence? >>> Well, all I can do is describe the most common TSO related >>> failure: >>> - When a read RPC reply (including NFS/RPC/TCP/IP headers) >>> is slightly less than 64K bytes (many TSO implementations are >>> limited to 64K or 32 discontiguous segments, think 32 2K >>> mbuf clusters), the driver decides it is ok, but when the MAC >>> header is added it exceeds what the hardware can handle correctly... >>> --> This will happen when reading a regular file that is slightly less >>> than a multiple of 64K in size. >>> or >>> --> This will happen when reading just about any large directory, >>> since the directory reply for a 64K request is converted to Sun XDR >>> format and clipped at the last full directory entry that will fit within >>> 64K. >>> For ports, where most files are small, I think you can tell which is more >>> likely to happen. >>> --> If TSO is disabled, I have no idea how this might matter, but?? >>> >>>> I'll note that netstat -I ue0 -d and netstat -I genet0 -d >>>> do not report changes in Ierrs or Idrop in a before vs. >>>> after failures comparison. (There may be better figures >>>> to look at for all I know.) >>>> >>>> I tried "ifconfig genet0 -rxcsum -rxcsum -rxcsum6 -txcsum6" >>>> and got no obvious change in behavior. >>> All we know is that the data is getting corrupted somehow. >>> >>> NFS traffic looks very different than typical TCP traffic. It is >>> mostly small messages travelling in both directions concurrently, >>> with some large messages thrown in the mix. >>> All I'm saying is that, testing a net interface with something like >>> bulk data transfer in one direction doesn't verify it works for NFS >>> traffic. >>> >>> Also, the large RPC messages are a chain of about 33 mbufs of >>> various lengths, including a mix of partial clusters and regular >>> data mbufs, whereas a bulk send on a socket will typically >>> result in an mbuf chain of a lot of full 2K clusters. >>> --> As such, NFS can be good at tickling subtle bugs it the >>> net driver related to mbuf handling. >>> >>> rick >>> >>>>> W.r.t. reverting r367492...the patch to replace r367492 was just >>>>> committed to "main" by rscheff@ with a two week MFC, so it >>>>> should be in stable/13 soon. Not sure if an errata can be done >>>>> for it for releng13.0? >>>> >>>> That update is reported to be causing "rack" related panics: >>>> >>>> https://lists.freebsd.org/pipermail/dev-commits-src-main/2021-May/004440.html >>>> >>>> reports (via links): >>>> >>>> panic: _mtx_lock_sleep: recursed on non-recursive mutex so_snd @ >>>> /syzkaller/managers/i386/kernel/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:10632 >>>> >>>> Still, I have a non-debug update to main building and will >>>> likely do a debug build as well. llvm is rebuilding, so >>>> the builds will take a notable time. >> >> I got the following built and installed on the two >> machines: >> >> # uname -apKU >> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 >> main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021 >> root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72 >> arm64 aarch64 1400013 1400013 >> >> # uname -apKU >> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 >> main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021 >> root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72 >> arm64 aarch64 1400013 1400013 >> >> Note that both are booted with debug builds of main. >> >> Using the context with the alternate EtherNet device that has not >> had an associated diff -r, find, pr ls -R failure yet >> yet got a panic that looks likely to be unrelated: >> >> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/ >> # diff -r /usr/ports/ /mnt/ | more >> nvme0: cpl does not map to outstanding cmd >> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0 >> panic: received completion for unknown cmd >> cpuid = 3 >> time = 1621743752 >> KDB: stack backtrace: >> db_trace_self() at db_trace_self >> db_trace_self_wrapper() at db_trace_self_wrapper+0x30 >> vpanic() at vpanic+0x188 >> panic() at panic+0x44 >> nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc >> nvme_timeout() at nvme_timeout+0x3c >> softclock_call_cc() at softclock_call_cc+0x124 >> softclock() at softclock+0x60 >> ithread_loop() at ithread_loop+0x2a8 >> fork_exit() at fork_exit+0x74 >> fork_trampoline() at fork_trampoline+0x14 >> KDB: enter: panic >> [ thread pid 12 tid 100028 ] >> Stopped at kdb_enter+0x48: undefined f904411f >> db> >> >> Based on the "nvme" references, I expect this is tied to >> handling the Optane 480 GiByte that is in the PCIe slot >> and is the boot/only media for the machine doing the diff. >> >> "db> dump" seems to have worked. >> >> After reboot, zpool scrub found no errors. >> >> So, trying again . . . >> >> I got some "Expensive timeout(9) function" notices: >> >> Expensive timeout(9) function: 0xffff000000717b64(0) 1.210285924 s >> Expensive timeout(9) function: 0xffff000000717b64(0) 4.001010935 s >> >> 0xffff000000717b64 looks to be uma_timeout: >> >> ffff000000717b60 <uma_startup3+0x118> b ffff000000717b3c <uma_startup3+0xf4> >> ffff000000717b64 <uma_timeout> stp x29, x30, [sp, #-32]! >> ffff000000717b68 <uma_timeout+0x4> stp x20, x19, [sp, #16] >> . . . >> >> . . . Hmm. The debug kernel test context seems to take a >> very long time. It has not failed so far but is still >> going. >> >> So I stopped it and switch to testing with the genet0 device >> that was involved for the earlier failures. . . . >> >> It did not fail. Nor did the debug kernel report anything >> beyond: >> >> if_delmulti_locked: detaching ifnet instance 0xffffa00000fc8000 >> if_delmulti_locked: detaching ifnet instance 0xffffa00000fc8000 >> Expensive timeout(9) function: 0xffff00000050c088(0) 6.318652023 s >> >> on one machine and: >> >> if_delmulti_locked: detaching ifnet instance 0xffffa0000b56b800 >> >> on the other. >> >> So I may reboot into the also-updated non-debug builds on both >> machines and try in that context. >> > > The non-debug build pair of machines got the problem: > > # diff -r /usr/ports/ /mnt/ | more > Only in /mnt/devel/electron12/files: > Only in /usr/ports/devel/electron12/files: > patch-chrome_browser_media_webrtc_webrtc__logging__controller.cc > Only in /usr/ports/devel/electron12/files: > patch-components_previews_core_previews__features.cc > Only in /mnt/devel/electron12/files: <A0><CE><C8>֏<DC>ܦ2<B2><E2><AA>^H > Only in /mnt/www/chromium/files: patch-chrome_browser_chrome__browser > Only in /usr/ports/www/chromium/files: > patch-chrome_browser_chrome__browser__main__posix.cc > > I'll note that it turns out that the debug build had more > than is typical enabled: DIAGNOSTICS, BUF_TRACKING, and > FULL_BUF_TRACKING were also enabled. I'd forgotten that > I'd previously had a reason to add those to what my debug > builds included (for a prior problem investigation). I'd > not done debug builds in some time.
Without DIAGNOSTIC, BUF_TRACKING, and FULL_BUF_TRACKING (so based on a more normal debug build on both sides), the diff -r progressed at a more normal, sustained rate. Yet . . . # mount -onoatime 192.168.1.170:/usr/ports/ /mnt/ # diff -r /usr/ports/ /mnt/ | more # In other words: no failure from the debug build. Also no reports of anything by the debug kernel. Multiple attempts (including some with reboots between): same results. So, it appears that only non-debug builds are broken, for whatever reason. For reference: # uname -apKU FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #2 main-n246854-03b0505b8fe8-dirty: Sun May 23 05:57:01 PDT 2021 root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72 arm64 aarch64 1400013 1400013 === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar) _______________________________________________ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"