Re: Troubles building world on stable/13 [the little bit of evidence about the compiler failures: a jemalloc-tie/ASLR-tie?]
0 00 00 .@.U 0xaa10: 01 00 00 00 00 00 00 00 c0 0d b1 55 00 00 00 00 ...U 0xaa20: 00 00 00 00 00 00 00 00 e2 40 b2 55 00 00 00 00 .@.U 0xaa30: 01 00 00 00 00 00 00 00 c0 0d b1 55 00 00 00 00 ...U 0xaa40: 00 00 00 00 00 00 00 00 2a 41 b2 55 00 00 00 00 *A.U 0xaa50: 01 00 00 00 00 00 00 00 c0 0d b1 55 00 00 00 00 ...U 0xaa60: 00 00 00 00 00 00 00 00 72 41 b2 55 00 00 00 00 rA.U 0xaa70: 01 00 00 00 00 00 00 00 c0 0d b1 55 00 00 00 00 ...U 0xaa80: 00 00 00 00 00 00 00 00 ba 41 b2 55 00 00 00 00 .A.U When the 0x05's show up they are instead of the 0x01's, just after the ": ". After that the pattern is different. But quickly something looks like another fp/lr pair in memory, and tha, in turn, it references another: 0xaa90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0xaaa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0xaab0: 00 00 00 00 00 00 00 00 44 c4 95 07 0e 02 46 57 D.FW 0xaac0: 10 ab ff ff ff ff 00 00 8c c6 aa 02 00 00 00 00 . . . 0xab10: 90 ac ff ff ff ff 00 00 e0 18 ab 02 00 00 00 00 . . . But after that the following does not seem to fit the pattern: 0xac90: 00 ac ff ff ff ff 00 00 44 c4 95 07 0e 02 46 57 D.FW and: 0xac00: 01 00 00 00 00 00 00 00 18 ae ff ff ff ff 00 00 The a5 sequences make me wonder if jemalloc assigned a memory allocation to stack space or was told to handle a stack address as if it was an assigned address for some aspects of an allocation (if that can even be requested). I wonder if there is any chance of ASLR being involved with the stack and memory allocation possibly overlapping. But I've really no clue. I've given up on trying to isolate what is going on for the compiler failures. I've only been able to see after the failure, not just before: debugger interactions with the compiler process in times close to the failure point in the code prevent the failure. I've not found any alternative that avoids such. This is on top of the issue that the plain-runs (no debugger) vary in behavior, sometimes running to completion, sometimes stopping at similar but varying places in the source code being processed. There is still no known way to get a full reproduction of failure details each time. (Which instance of the example type of source code being compiled at the point of failure does vary.) For reference: I've been using .sh/.cpp pairs that Bob published and a copy of the c++ from his system to investigate. The .cpp is large. Bob's RPi3* is a RAM+SWAP context of: 1 GiBYTe + 2 GiByte and I made such a context on a RPi3* as well. But I ran his stable/13 c++ on a system with a non-debug main [so: 14] kernel and either a main world or a stable/13 chroot. From the chroot: # uname -apKU FreeBSD Rock64_RPi_4_3_2v1p2 14.0-CURRENT FreeBSD 14.0-CURRENT #28 main-n252475-e76c0108990b-dirty: Sat Jan 15 23:39:27 PST 2022 root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA53-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA53 arm64 aarch64 1400047 1300524 # freebsd-version -ru 14.0-CURRENT 13.0-STABLE # ~/fbsd-based-on-what-commit.sh -C /usr/13S-src/ branch: stable/13 merge-base: a5f69859956049b5153b0e1b67f8f4a99622dc6f merge-base: CommitDate: 2022-01-15 12:55:32 + a5f698599560 (HEAD -> stable/13, freebsd/stable/13) Ignore debugger-injected signals left after detaching Bob's recent stable/13 context (kernel too) is more recent than mine. So the problems has been observed over a range of contexts. But, as I said, I've given up on finding a way to isolate whatever is going on. === Mark Millard marklmi at yahoo.com
Re: git: 4a864f624a70 - main - vm_pageout: Print a more accurate message to the console before an OOM kill [MFC in time for 13.1?]
On 2022-Jan-15, at 07:55, Mark Johnston wrote: > On Fri, Jan 14, 2022 at 09:38:56PM -0800, Mark Millard wrote: >> Thanks. This will allow me to remove part of my personal additions >> in this area --and my having to explain the misnomer when trying >> to help someone analyze why they end up with OOM activity so they >> can figure out what to do about it. >> >> There seem to be two separate sources of VM_OOM_SWAPZ. Showing >> my personal additions for them (just making them explicit in the >> sequence of messages generated): >> >> diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c >> index 01cf9233329f..280621ca51be 100644 >> --- a/sys/vm/swap_pager.c >> +++ b/sys/vm/swap_pager.c >> @@ -2091,6 +2091,7 @@ swp_pager_meta_build(vm_object_t object, vm_pindex_t >> pindex, daddr_t swapblk) >>0, 1)) >>printf("swap blk zone exhausted, " >>"increase kern.maxswzone\n"); >> + printf("swp_pager_meta_build: swap blk uma >> zone exhausted\n"); >>vm_pageout_oom(VM_OOM_SWAPZ); >>pause("swzonxb", 10); >>} else >> @@ -2121,6 +2122,7 @@ swp_pager_meta_build(vm_object_t object, vm_pindex_t >> pindex, daddr_t swapblk) >>0, 1)) >>printf("swap pctrie zone exhausted, " >>"increase kern.maxswzone\n"); >> + printf("swp_pager_meta_build: swap pctrie >> uma zone exhausted\n"); >>vm_pageout_oom(VM_OOM_SWAPZ); >>pause("swzonxp", 10); >>} else >> >> Care to comment on the distinctions and why there are two >> contexts classified as "out of swap space"? Would either >> one show the swap space as (nearly?) all used in, say, top? >> Or might one of them still end up looking like a misnomer >> from just a top (or whatever) display? > > Hmm, those cases should likely be changed from "out of swap space" to > "failed to allocate swap metadata" or something like that. The above does not seem to have happened yet in main [so: 14]. Will 13.1 get an MFC of 4a864f624a70 in time, possibly with the above change also in place to fully avoid misnomer reporting that misleads folks? 4a864f624a70 listed: MFC after: 2 weeks but it has been more than a month. > . . . > === Mark Millard marklmi at yahoo.com
Re: git: 4a864f624a70 - main - vm_pageout: Print a more accurate message to the console before an OOM kill [MFC in time for 13.1?]
On 2022-Feb-26, at 17:10, Mark Millard wrote: > On 2022-Jan-15, at 07:55, Mark Johnston wrote: > >> On Fri, Jan 14, 2022 at 09:38:56PM -0800, Mark Millard wrote: >>> Thanks. This will allow me to remove part of my personal additions >>> in this area --and my having to explain the misnomer when trying >>> to help someone analyze why they end up with OOM activity so they >>> can figure out what to do about it. >>> >>> There seem to be two separate sources of VM_OOM_SWAPZ. Showing >>> my personal additions for them (just making them explicit in the >>> sequence of messages generated): >>> >>> diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c >>> index 01cf9233329f..280621ca51be 100644 >>> --- a/sys/vm/swap_pager.c >>> +++ b/sys/vm/swap_pager.c >>> @@ -2091,6 +2091,7 @@ swp_pager_meta_build(vm_object_t object, vm_pindex_t >>> pindex, daddr_t swapblk) >>> 0, 1)) >>> printf("swap blk zone exhausted, " >>> "increase kern.maxswzone\n"); >>> + printf("swp_pager_meta_build: swap blk uma >>> zone exhausted\n"); >>> vm_pageout_oom(VM_OOM_SWAPZ); >>> pause("swzonxb", 10); >>> } else >>> @@ -2121,6 +2122,7 @@ swp_pager_meta_build(vm_object_t object, vm_pindex_t >>> pindex, daddr_t swapblk) >>> 0, 1)) >>> printf("swap pctrie zone exhausted, " >>> "increase kern.maxswzone\n"); >>> + printf("swp_pager_meta_build: swap pctrie >>> uma zone exhausted\n"); >>> vm_pageout_oom(VM_OOM_SWAPZ); >>> pause("swzonxp", 10); >>> } else >>> >>> Care to comment on the distinctions and why there are two >>> contexts classified as "out of swap space"? Would either >>> one show the swap space as (nearly?) all used in, say, top? >>> Or might one of them still end up looking like a misnomer >>> from just a top (or whatever) display? >> >> Hmm, those cases should likely be changed from "out of swap space" to >> "failed to allocate swap metadata" or something like that. > > The above does not seem to have happened yet in main [so: 14]. > > Will 13.1 get an MFC of 4a864f624a70 in time, possibly with the > above change also in place to fully avoid misnomer reporting > that misleads folks? > > 4a864f624a70 listed: > > MFC after:2 weeks > > but it has been more than a month. > >> . . . >> > Thanks for the stable/13 MFC as 13ba1d283676. It provides a big improvement over the prior messaging for the OOM kills. For reference, I do still view: + case VM_OOM_SWAPZ: + reason = "out of swap space"; + break; as using a confusing misnomer ("swap space") for its message. But, so far as I know, VM_OOM_SWAPZ is a rarity and possibly very difficult to produce. If so, any confusions from the message should also be rare. === Mark Millard marklmi at yahoo.com
Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))
On 2022-Mar-7, at 08:45, Mark Johnston wrote: > On Mon, Mar 07, 2022 at 04:25:22PM +, Andrew Turner wrote: >> >>> On 7 Mar 2022, at 15:13, Mark Johnston wrote: >>> ... >>> A (the?) problem is that the compiler is treating "pc" as an alias >>> for x18, but the rmlock code assumes that the pcpu pointer is loaded >>> once, as it dereferences "pc" outside of the critical section. On >>> arm64, if a context switch occurs between the store at _rm_rlock+144 and >>> the load at +152, and the thread is migrated to another CPU, then we'll >>> end up using the wrong CPU ID in the rm->rm_writecpus test. >>> >>> I suspect the problem is unique to arm64 as its get_pcpu() >>> implementation is different from the others in that it doesn't use >>> volatile-qualified inline assembly. This has been the case since >>> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762 >>> >>> <https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762> >>> . >>> >>> I haven't been able to reproduce any crashes running poudriere in an >>> arm64 AWS instance, though. Could you please try the patch below and >>> confirm whether it fixes your panics? I verified that the apparent >>> problem described above is gone with the patch. >> >> Alternatively (or additionally) we could do something like the following. >> There are only a few MI users of get_pcpu with the main place being in rm >> locks. >> >> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h >> index 09f6361c651c..59b890e5c2ea 100644 >> --- a/sys/arm64/include/pcpu.h >> +++ b/sys/arm64/include/pcpu.h >> @@ -58,7 +58,14 @@ struct pcpu; >> >> register struct pcpu *pcpup __asm ("x18"); >> >> -#defineget_pcpu() pcpup >> +static inline struct pcpu * >> +get_pcpu(void) >> +{ >> + struct pcpu *pcpu; >> + >> + __asm __volatile("mov %0, x18" : "=&r"(pcpu)); >> + return (pcpu); >> +} >> >> static inline struct thread * >> get_curthread(void) > > Indeed, I think this is probably the best solution. Is this just partially reverting: https://cgit.freebsd.org/src/commit/?id=63c858a04d56 If so, there might need to be comments about why the updated code is as it will be. Looks like stable/13 picked up sensitivity to the get_pcpu details in rmlock in: https://cgit.freebsd.org/src/commit/?h=stable/13&id=543157870da5 (a 2022-03-04 commit) and stable/13 also has the get_pcpu misdefinition in: https://cgit.freebsd.org/src/commit/sys/arm64/include/pcpu.h?h=stable/13&id=63c858a04d56 . So an MFC would be appropriate in order for aarch64 to be reliable for any variations in get_pcpu in stable/13 (and for 13.1 to be so as well). === Mark Millard marklmi at yahoo.com
Re: https://ci.freebsd.org/job/FreeBSD-main-amd64-gcc9_build broken again after openzfs merge: multiple definitions building --- all_subdir_rescue ---
On 2022-Mar-18, at 12:32, Mark Millard wrote: > Looks like . . . > > /workspace/src/sys/contrib/openzfs/module/zstd/lib/common/error_private.h > and: > /workspace/src/sys/contrib/zstd/lib/common/error_private.h > > are both used in building in: > > /tmp/obj/workspace/src/amd64.amd64/rescue/rescue > > and each is providing various definitions that the other also does: > > multiple definition of `ZSTD_versionNumber' > multiple definition of `ZSTD_versionString'; > multiple definition of `ZSTD_isError'; > multiple definition of `ZSTD_getErrorName'; > multiple definition of `ZSTD_getErrorCode'; > multiple definition of `ZSTD_getErrorString'; > > Looks like this goes back to: > > Build #3075 (Mar 8, 2022 9:33:24 PM) > [c03c5b1c8091: "zfs: merge openzfs/zfs@a86e08941 (master) into main"] > > after Build #3074 (Mar 8, 2022 6:16:32 PM) had built fine. > FYI: I tried to build 13.1-BETA2 with a gcc9 xtoolchain and got: --- all_subdir_stand/efi/gptboot --- . . . /local/bin/x86_64-unknown-freebsd13.0-ld: gptboot.sym.full: error: PHDR segment not covered by LOAD segment collect2: error: ld returned 1 exit status So I tried continuing using WITHOUT_BOOT= and the next stopping points were: --- all_subdir_cxgbe --- /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/overflow.h:45:2: error: #error "Compiler does not support __builtin_add_overflow" 45 | #error "Compiler does not support __builtin_add_overflow" | ^ /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/overflow.h:62:2: error: #error "Compiler does not support __builtin_mul_overflow" 62 | #error "Compiler does not support __builtin_mul_overflow" | ^ . . . --- all_subdir_cxgbe/iw_cxgbe --- In file included from /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/slab.h:42, from /usr/13_1R-src/sys/dev/cxgbe/iw_cxgbe/ev.c:40: /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/overflow.h:45:2: error: #error "Compiler does not support __builtin_add_overflow" 45 | #error "Compiler does not support __builtin_add_overflow" | ^ . . . --- device.o --- from /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/sched.h:41, from /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/kernel.h:50, from /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/kobject.h:36, from /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/module.h:43, from /usr/13_1R-src/sys/dev/cxgbe/iw_cxgbe/device.c:41: /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/overflow.h: At top level: /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/overflow.h:45:2: error: #error "Compiler does not support __builtin_add_overflow" 45 | #error "Compiler does not support __builtin_add_overflow" | ^ /usr/13_1R-src/sys/compat/linuxkpi/common/include/linux/overflow.h:62:2: error: #error "Compiler does not support __builtin_mul_overflow" 62 | #error "Compiler does not support __builtin_mul_overflow" | ^ With that I stopped the experiments. === Mark Millard marklmi at yahoo.com
Re: git: 43741377b143 - main - security/openssl: Security update to 1.1.1n
> From: Mark Johnston > Date: Sat, 19 Mar 2022 12:42:59 -0400 > On Sat, Mar 19, 2022 at 12:40:45PM +0100, Thomas Zander wrote: > > On Sat, 19 Mar 2022 at 12:11, Rene Ladan wrote: > > > > > > On Sat, Mar 19, 2022 at 11:04:58AM +0100, Thomas Zander wrote: > > > > On Sat, 19 Mar 2022 at 09:00, Matthias Fechner > > > > wrote: > > > > > > > > > I can confirm now, the problem is definitely related to the -p8 > > > > > update. > > > > > I rolled back now to -p7 using `freebsd-update rollback`. > > > > > [...] > > > > > System is now up and running again. > > > > > This all works even if poudriere jail is using -p8. No need to > > > > > downgrade the jail/base version poudriere is using. > > > > > It is caused by the kernel so the ZFS patch seems to be broken and > > > > > -p8 should maybe not rolled out to not break more systems of users. > > > > > > > > On top of "stop rollout", there is the question how to identify the > > > > broken files for the users who have already upgraded to -p8. A `zpool > > > > scrub` presumably won't help. > > > > > > I think it also applies to 13.1-BETA2 ? > > > > > > Should we involve/CC some src committers? > > > > I have just rolled back to -p7 and run a number of test builds in > > poudriere (the jails still have the -p8 user land). I see the same as > > Matthias and Christoph, the rollback to the -p7 kernel/zfs resolved > > the build problems, there are no NUL byte files generated anymore. > > Adding markj_at_ to the discussion. Mark, the TLDR so far: > > - One of the zfs patches in -p8 seems to cause erroneous writes. > > - We noticed because of many build failures with poudriere (presumably > > highly io-loaded during build). > > - Symptom: Production of files with large runs of NUL-bytes. > > I've had zero luck reproducing this locally. I built several hundred > ports, including textproc/py-pystemmer mentioned elsewhere in the > thread, without any failures or instances of zero-filled files. Another > member of secteam also hasn't been able to trigger any build failures on > -p8. Any hints on a reproducer would be useful. > > We can simply push a -p9 which reverts EN-22:10 and :11, but of course > it would be preferable to precisely identify the problem. Anything about the types of hardware involved that is different for those getting the problem vs. those that do not get the problem? May be it would be appropriate for folks getting the problem to detail their hardware configurations, including storage hardware. === Mark Millard marklmi at yahoo.com
Re: git: 43741377b143 - main - security/openssl: Security update to 1.1.1n
On 2022-Mar-19, at 11:07, Thomas Zander wrote: > On Sat, 19 Mar 2022 at 18:32, Mark Millard wrote: >> May be report to Mark J. how to run the same test builds >> that failed for -p8 but worked for -p7? > > Sure, good point. > A build that reliably causes broken packages on p8 but not on p7 for > me is running: > > poudriere testport -o multimedia/mplayer -j <13.0-amd64-jail here> > > This caused the broken png and python packages when they were built as > dependencies. > In poudriere.conf I set this: > DISTFILES_CACHE=/vcache/distfiles > CCACHE_DIR=/vcache/ccache > ALLOW_MAKE_JOBS=yes > > The ALLOW_MAKE_JOBS should increase the number of parallel IO > operations in-flight on the pool, maybe this increases the likelihood > of triggering the issue? > The DISTFILES_CACHE and CCACHE_DIR are in the same zfs pool as > /poudriere, not sure if this is relevant. > The zfs pool is a single disk, no raid, mirror or anything fancy. On a ThreadRipper 1950X, PCIe Optane storage, 128 GiBytes of RAM, I've used bectl to boot the 13.0_RELEASE-p8 environment and have started: poudriere testport -o multimedia/mplayer -j13_0R-amd64-bulk_a where the jail had nothing built in it at the start. So: [00:00:08] Building 271 packages using up to 32 builders The primary difference is that I've never used ccache and did not try to do so here. The "zfs pool is a single disk, no raid, mirror or anything fancy" is accurate, as is the use of ALLOW_MAKE_JOBS= . That did not take long . . . It proves that ccache is not required. Also some files seem to get only small blocks of zero-bytes, others large ones. But I've not checked for the null characters being at the end instead of earlier in the file. libXcomposite-0.4.5,1.log : --- Xcomposite.lo --- /bin/sh ../libtool --tag=CC--mode=compile cc -DHAVE_CONFIG_H -I. -I.. -I../include -Wall -Wpointer-arith -Wmissing-declarations -Wformat=2 -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Wbad-function-cast -Wold-style-definition -Wdeclaration-after-statement -Wunused -Wuninitialized -Wshadow -Wmissing-noreturn -Wmissing-format-attribute -Wredundant-decls -Werror=implicit -Werror=nonnull -Werror=init-self -Werror=main -Werror=missing-braces -Werror=sequence-point -Werror=return-type -Werror=trigraphs -Werror=array-bounds -Werror=write-strings -Werror=address -Werror=int-to-pointer-cast -Werror=pointer-to-int-cast -fno-strict-aliasing -I/usr/local/include -D_THREAD_SAFE -pthread -I/usr/local/include -D_THREAD_SAFE -pthread -pipe -Werror=uninitialized -g -fstack-protector-strong -fno-strict-aliasing -MT Xcomposite.lo -MD -MP -MF .deps/Xcomposite.Tpo -c -o Xcomposite.lo Xcomposite.c libtool: compile: cc -DHAVE_CONFIG_H -I. -I.. -I../include -Wall -Wpointer-arith -Wmissing-declarations -Wformat=2 -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Wbad-function-cast -Wold-style-definition -Wdeclaration-after-statement -Wunused -Wuninitialized -Wshadow -Wmissing-noreturn -Wmissing-format-attribute -Wredundant-decls -Werror=implicit -Werror=nonnull -Werror=init-self -Werror=main -Werror=missing-braces -Werror=sequence-point -Werror=return-type -Werror=trigraphs -Werror=array-bounds -Werror=write-strings -Werror=address -Werror=int-to-pointer-cast -Werror=pointer-to-int-cast -fno-strict-aliasing -I/usr/local/include -D_THREAD_SAFE -pthread -I/usr/local/include -D_THREAD_SAFE -pthread -pipe -Werror=uninitialized -g -fstack-protector-strong -fno-strict-aliasing -MT Xcomposite.lo -MD -MP -MF .deps/Xcomposite.Tpo -c Xcomposite.c -fPIC -DPIC -o .libs/Xcomposite.o In file included from Xcomposite.c:45: In file included from ./xcompositeint.h:53: In file included from ../include/X11/extensions/Xcomposite.h:49: /usr/local/include/X11/extensions/Xfixes.h:1:1: warning: null character ignored [-Wnull-character] /usr/local/include/X11/extensions/Xfixes.h:1:2: warning: null character ignored [-Wnull-character] /usr/local/include/X11/extensions/Xfixes.h:1:3: warning: null character ignored [-Wnull-character] /usr/local/include/X11/extensions/Xfixes.h:1:4: warning: null character ignored [-Wnull-character] /usr/local/include/X11/extensions/Xfixes.h:1:5: warning: null character ignored [-Wnull-character] /usr/local/include/X11/extensions/Xfixes.h:1:6: warning: null character ignored [-Wnull-character] /usr/local/include/X11/extensions/Xfixes.h:1:7: warning: null character ignored [-Wnull-character] /usr/local/include/X11/extensions/Xfixes.h:1:8: warning: null character ignored [-Wnull-character] . . . (the list is long) . . . libXdamage-1.1.5.log . . . --- Xdamage.lo --- /bin/sh ../libtool --tag=CC--mode=compile cc -DHAVE_CONFIG_H -I. -I.. -I../include/X11/extensions -Wall -Wpointer-arith -Wmissing-declarations -Wformat=2 -Wstrict-prototypes -Wmissing-prototy pes -Wnested-externs -Wbad-function-cast -Wold-style-defini
Re: git: 43741377b143 - main - security/openssl: Security update to 1.1.1n
The reports of corrupted files with zero-bytes seemed vaguely familiar, including my own addition to the list. Turns out I'd reported such for main [so: 14] back in 2021-Nov. My context at the time was aarch64. The following has my report sequence, including where I got past the problem vs. before that point. May be the history will help. https://lists.freebsd.org/archives/freebsd-current/2021-November/001052.html I've not had problems with the issue on main since then. === Mark Millard marklmi at yahoo.com
Re: git: 43741377b143 - main - security/openssl: Security update to 1.1.1n
On 2022-Mar-19, at 12:54, Mark Johnston wrote: > On Sat, Mar 19, 2022 at 12:00:20PM -0700, Mark Millard wrote: >> On 2022-Mar-19, at 11:07, Thomas Zander wrote: >> >>> On Sat, 19 Mar 2022 at 18:32, Mark Millard wrote: >>>> May be report to Mark J. how to run the same test builds >>>> that failed for -p8 but worked for -p7? >>> >>> Sure, good point. >>> A build that reliably causes broken packages on p8 but not on p7 for >>> me is running: >>> >>> poudriere testport -o multimedia/mplayer -j <13.0-amd64-jail here> >>> >>> This caused the broken png and python packages when they were built as >>> dependencies. >>> In poudriere.conf I set this: >>> DISTFILES_CACHE=/vcache/distfiles >>> CCACHE_DIR=/vcache/ccache >>> ALLOW_MAKE_JOBS=yes >>> >>> The ALLOW_MAKE_JOBS should increase the number of parallel IO >>> operations in-flight on the pool, maybe this increases the likelihood >>> of triggering the issue? >>> The DISTFILES_CACHE and CCACHE_DIR are in the same zfs pool as >>> /poudriere, not sure if this is relevant. >>> The zfs pool is a single disk, no raid, mirror or anything fancy. >> >> On a ThreadRipper 1950X, PCIe Optane storage, 128 GiBytes of >> RAM, I've used bectl to boot the 13.0_RELEASE-p8 environment >> and have started: >> >> poudriere testport -o multimedia/mplayer -j13_0R-amd64-bulk_a >> >> where the jail had nothing built in it at the start. So: >> >> [00:00:08] Building 271 packages using up to 32 builders >> >> The primary difference is that I've never used ccache and >> did not try to do so here. The "zfs pool is a single disk, >> no raid, mirror or anything fancy" is accurate, as is the >> use of ALLOW_MAKE_JOBS= . >> >> That did not take long . . . >> >> It proves that ccache is not required. Also some files >> seem to get only small blocks of zero-bytes, others >> large ones. But I've not checked for the null characters >> being at the end instead of earlier in the file. > > I still am not able to reproduce it. I think it's indeed a concurrency > problem, and I found a possible culprit. Mark or Thomas, if you're able > to build a new kernel from the releng/13.0 branch and test it, could you > please try this patch? > Sure. (I build ports in a way that allows large load averages relative to the hardware-thread count. I also have a lot of swap configured. I avoid significant use of tmpfs.) > diff --git a/sys/contrib/openzfs/module/zfs/dnode.c > b/sys/contrib/openzfs/module/zfs/dnode.c > index 8592c5f8c3a9..b69ba68ec780 100644 > --- a/sys/contrib/openzfs/module/zfs/dnode.c > +++ b/sys/contrib/openzfs/module/zfs/dnode.c > @@ -1661,7 +1661,7 @@ dnode_is_dirty(dnode_t *dn) > mutex_enter(&dn->dn_mtx); > > for (int i = 0; i < TXG_SIZE; i++) { > - if (list_head(&dn->dn_dirty_records[i]) != NULL) { > + if (multilist_link_active(&dn->dn_dirty_link[i])) { > mutex_exit(&dn->dn_mtx); > return (B_TRUE); > } > Change made. Rebuilt. Reinstalled. Rebooted into the 13_0R-amd64 be area. Bulk build started. Bulk build completed. (Took longer because I let it run to completion.) No explicit reports of null characters. The same 2 ports that failed before, not reporting zero-byte issues, failed again. Likely independent issues: [00:28:28] Failed ports: security/libgcrypt:build print/freetype2:package Overall it skipped something like 54 ports. libgcrypt-1.9.4.log . . . --- basic.o --- basic.c:315:16: error: inline assembly requires more registers than available asm volatile("movdqu %[data0], %%xmm0\n" ^ basic.c:315:16: error: inline assembly requires more registers than available basic.c:315:16: error: inline assembly requires more registers than available basic.c:315:16: error: inline assembly requires more registers than available basic.c:315:16: error: inline assembly requires more registers than available basic.c:315:16: error: inline assembly requires more registers than available basic.c:315:16: error: inline assembly requires more registers than available basic.c:315:16: error: inline assembly requires more registers than available basic.c:315:16: error: inline assembly requires more registers than available basic.c:315:16: error: inline assembly requires more registers than available basic.c:315:16: error: inline assembly requires more registers than available basic.c:315:16: error: inline assembly requires more registers than available --- mpitests --- . . . --- basi
Re: git: 43741377b143 - main - security/openssl: Security update to 1.1.1n
On 2022-Mar-19, at 14:24, Mark Millard wrote: > On 2022-Mar-19, at 12:54, Mark Johnston wrote: > >> On Sat, Mar 19, 2022 at 12:00:20PM -0700, Mark Millard wrote: >>> On 2022-Mar-19, at 11:07, Thomas Zander wrote: >>> >>>> On Sat, 19 Mar 2022 at 18:32, Mark Millard wrote: >>>>> May be report to Mark J. how to run the same test builds >>>>> that failed for -p8 but worked for -p7? >>>> >>>> Sure, good point. >>>> A build that reliably causes broken packages on p8 but not on p7 for >>>> me is running: >>>> >>>> poudriere testport -o multimedia/mplayer -j <13.0-amd64-jail here> >>>> >>>> This caused the broken png and python packages when they were built as >>>> dependencies. >>>> In poudriere.conf I set this: >>>> DISTFILES_CACHE=/vcache/distfiles >>>> CCACHE_DIR=/vcache/ccache >>>> ALLOW_MAKE_JOBS=yes >>>> >>>> The ALLOW_MAKE_JOBS should increase the number of parallel IO >>>> operations in-flight on the pool, maybe this increases the likelihood >>>> of triggering the issue? >>>> The DISTFILES_CACHE and CCACHE_DIR are in the same zfs pool as >>>> /poudriere, not sure if this is relevant. >>>> The zfs pool is a single disk, no raid, mirror or anything fancy. >>> >>> On a ThreadRipper 1950X, PCIe Optane storage, 128 GiBytes of >>> RAM, I've used bectl to boot the 13.0_RELEASE-p8 environment >>> and have started: >>> >>> poudriere testport -o multimedia/mplayer -j13_0R-amd64-bulk_a >>> >>> where the jail had nothing built in it at the start. So: >>> >>> [00:00:08] Building 271 packages using up to 32 builders >>> >>> The primary difference is that I've never used ccache and >>> did not try to do so here. The "zfs pool is a single disk, >>> no raid, mirror or anything fancy" is accurate, as is the >>> use of ALLOW_MAKE_JOBS= . >>> >>> That did not take long . . . >>> >>> It proves that ccache is not required. Also some files >>> seem to get only small blocks of zero-bytes, others >>> large ones. But I've not checked for the null characters >>> being at the end instead of earlier in the file. >> >> I still am not able to reproduce it. I think it's indeed a concurrency >> problem, and I found a possible culprit. Mark or Thomas, if you're able >> to build a new kernel from the releng/13.0 branch and test it, could you >> please try this patch? >> > > Sure. (I build ports in a way that allows large load > averages relative to the hardware-thread count. I also > have a lot of swap configured. I avoid significant use > of tmpfs.) > >> diff --git a/sys/contrib/openzfs/module/zfs/dnode.c >> b/sys/contrib/openzfs/module/zfs/dnode.c >> index 8592c5f8c3a9..b69ba68ec780 100644 >> --- a/sys/contrib/openzfs/module/zfs/dnode.c >> +++ b/sys/contrib/openzfs/module/zfs/dnode.c >> @@ -1661,7 +1661,7 @@ dnode_is_dirty(dnode_t *dn) >> mutex_enter(&dn->dn_mtx); >> >> for (int i = 0; i < TXG_SIZE; i++) { >> -if (list_head(&dn->dn_dirty_records[i]) != NULL) { >> +if (multilist_link_active(&dn->dn_dirty_link[i])) { >> mutex_exit(&dn->dn_mtx); >> return (B_TRUE); >> } >> > > Change made. > Rebuilt. > Reinstalled. > Rebooted into the 13_0R-amd64 be area. > Bulk build started. > Bulk build completed. > (Took longer because I let it run to completion.) > > No explicit reports of null characters. The same 2 ports that > failed before, not reporting zero-byte issues, failed again. > Likely independent issues: > > [00:28:28] Failed ports: security/libgcrypt:build print/freetype2:package These are what happens for WITH_DEBUG= style builds. Turns out that the *make.conf files from my last bulk -a experiment were still in place and were causing WITH_DEBUG= builds. (Not my normal context.) I'll disable that and rerun the bulk from scratch. > Overall it skipped something like 54 ports. > > libgcrypt-1.9.4.log . . . > > --- basic.o --- > basic.c:315:16: error: inline assembly requires more registers than available > asm volatile("movdqu %[data0], %%xmm0\n" > ^ > basic.c:315:16: error: inline assembly requires more registers than available > basic.c:315:16: error: inline assembly requires more registers than av
Re: git: 43741377b143 - main - security/openssl: Security update to 1.1.1n
On 2022-Mar-19, at 14:34, Mark Millard wrote: > On 2022-Mar-19, at 14:24, Mark Millard wrote: > >> On 2022-Mar-19, at 12:54, Mark Johnston wrote: >> >>> On Sat, Mar 19, 2022 at 12:00:20PM -0700, Mark Millard wrote: >>>> On 2022-Mar-19, at 11:07, Thomas Zander wrote: >>>> >>>>> On Sat, 19 Mar 2022 at 18:32, Mark Millard wrote: >>>>>> May be report to Mark J. how to run the same test builds >>>>>> that failed for -p8 but worked for -p7? >>>>> >>>>> Sure, good point. >>>>> A build that reliably causes broken packages on p8 but not on p7 for >>>>> me is running: >>>>> >>>>> poudriere testport -o multimedia/mplayer -j <13.0-amd64-jail here> >>>>> >>>>> This caused the broken png and python packages when they were built as >>>>> dependencies. >>>>> In poudriere.conf I set this: >>>>> DISTFILES_CACHE=/vcache/distfiles >>>>> CCACHE_DIR=/vcache/ccache >>>>> ALLOW_MAKE_JOBS=yes >>>>> >>>>> The ALLOW_MAKE_JOBS should increase the number of parallel IO >>>>> operations in-flight on the pool, maybe this increases the likelihood >>>>> of triggering the issue? >>>>> The DISTFILES_CACHE and CCACHE_DIR are in the same zfs pool as >>>>> /poudriere, not sure if this is relevant. >>>>> The zfs pool is a single disk, no raid, mirror or anything fancy. >>>> >>>> On a ThreadRipper 1950X, PCIe Optane storage, 128 GiBytes of >>>> RAM, I've used bectl to boot the 13.0_RELEASE-p8 environment >>>> and have started: >>>> >>>> poudriere testport -o multimedia/mplayer -j13_0R-amd64-bulk_a >>>> >>>> where the jail had nothing built in it at the start. So: >>>> >>>> [00:00:08] Building 271 packages using up to 32 builders >>>> >>>> The primary difference is that I've never used ccache and >>>> did not try to do so here. The "zfs pool is a single disk, >>>> no raid, mirror or anything fancy" is accurate, as is the >>>> use of ALLOW_MAKE_JOBS= . >>>> >>>> That did not take long . . . >>>> >>>> It proves that ccache is not required. Also some files >>>> seem to get only small blocks of zero-bytes, others >>>> large ones. But I've not checked for the null characters >>>> being at the end instead of earlier in the file. >>> >>> I still am not able to reproduce it. I think it's indeed a concurrency >>> problem, and I found a possible culprit. Mark or Thomas, if you're able >>> to build a new kernel from the releng/13.0 branch and test it, could you >>> please try this patch? >>> >> >> Sure. (I build ports in a way that allows large load >> averages relative to the hardware-thread count. I also >> have a lot of swap configured. I avoid significant use >> of tmpfs.) >> >>> diff --git a/sys/contrib/openzfs/module/zfs/dnode.c >>> b/sys/contrib/openzfs/module/zfs/dnode.c >>> index 8592c5f8c3a9..b69ba68ec780 100644 >>> --- a/sys/contrib/openzfs/module/zfs/dnode.c >>> +++ b/sys/contrib/openzfs/module/zfs/dnode.c >>> @@ -1661,7 +1661,7 @@ dnode_is_dirty(dnode_t *dn) >>> mutex_enter(&dn->dn_mtx); >>> >>> for (int i = 0; i < TXG_SIZE; i++) { >>> - if (list_head(&dn->dn_dirty_records[i]) != NULL) { >>> + if (multilist_link_active(&dn->dn_dirty_link[i])) { >>> mutex_exit(&dn->dn_mtx); >>> return (B_TRUE); >>> } >>> >> >> Change made. >> Rebuilt. >> Reinstalled. >> Rebooted into the 13_0R-amd64 be area. >> Bulk build started. >> Bulk build completed. >> (Took longer because I let it run to completion.) >> >> No explicit reports of null characters. The same 2 ports that >> failed before, not reporting zero-byte issues, failed again. >> Likely independent issues: >> >> [00:28:28] Failed ports: security/libgcrypt:build print/freetype2:package > > These are what happens for WITH_DEBUG= style builds. Turns > out that the *make.conf files from my last bulk -a experiment > were still in place and were causing WITH_DEBUG= builds. (Not > my normal context.) > > I'll disable that and rerun the bulk fr
Re: FreeBSD Errata Notice FreeBSD-EN-22:13.zfs
This went through with no change to releng/13.0 's sys/conf/newvers.sh , so it still has: BRANCH="RELEASE-p8" from releng/13.0 c3540b3a2bdf . Similarly, UPDATING still has just: 20220315: 13.0-RELEASE-p8 FreeBSD-EN-22:10.zfs FreeBSD-EN-22:11.zfs FreeBSD-EN-22:12.zfs FreeBSD-SA-22:02.wifi FreeBSD-SA-22:03.openssl . . . from the same commit. This might make it more difficult for some to verify what status they have for the zfs problem. === Mark Millard marklmi at yahoo.com
A possible unintended difference in 13.1-RELEASE vs., for example, 13.1-RELEASE-p3
I downloaded and looked at: FreeBSD-13.1-RELEASE-arm64-aarch64-RPI.img # mdconfig -u md0 -f FreeBSD-13.1-RELEASE-arm64-aarch64-RPI.img # mount -onoatime /dev/md0s2a /mnt # strings /mnt/boot/kern*/kernel | grep 13.1-RELEASE @(#)FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC 13.1-RELEASE Note the: releng/13.1-n250148-fc952ac2212 Looking at the live system after the freebsd-update to -p3 : # strings /boot/kernel/kernel | grep 13.1-RELEASE @(#)FreeBSD 13.1-RELEASE-p3 GENERIC FreeBSD 13.1-RELEASE-p3 GENERIC 13.1-RELEASE-p3 No text analogous to: releng/13.1-n250148-fc952ac2212 I'll note that the actual 13.1-RELEASE-p3 for the binary release build appears to have been a build of at: QUOTE author Mark Johnston2022-11-01 20:54:33 + committer Mark Johnston2022-11-01 20:55:10 + commit c3c13035ef270dcf0d24d2d847dd590edc535ed0 (patch) treef6582d69009a70d8ae8b52e00da4cabe6d159fb7 parent e81b1bd17fb4e83865d60461c2554d90f72cd395 (diff) downloadsrc-c3c13035ef270dcf0d24d2d847dd590edc535ed0.tar.gz src-c3c13035ef270dcf0d24d2d847dd590edc535ed0.zip zfs: Fix an improperly resolved merge conflict releng/13.1 Approved by:so Fixes: 8838c650cb59 ("Fix use-after-free in btree code") Diffstat -rw-r--r-- sys/contrib/openzfs/module/zfs/btree.c 1 1 files changed, 0 insertions, 1 deletions diff --git a/sys/contrib/openzfs/module/zfs/btree.c b/sys/contrib/openzfs/module/zfs/btree.c index 77cb2543e93d..09625bc92f92 100644 --- a/sys/contrib/openzfs/module/zfs/btree.c +++ b/sys/contrib/openzfs/module/zfs/btree.c @@ -1766,7 +1766,6 @@ zfs_btree_remove_idx(zfs_btree_t *tree, zfs_btree_index_t *where) zfs_btree_poison_node_at(tree, keep_hdr, keep_hdr->bth_count); rm_hdr->bth_count = 0; - zfs_btree_node_destroy(tree, rm_hdr); /* Remove the emptied node from the parent. */ zfs_btree_remove_from_node(tree, parent, rm_hdr); zfs_btree_node_destroy(tree, rm_hdr); END QUOTE I'll also note that none of the FreeBSD-EN-22:* notices lists an exact match to what was actually built for the binary update. That would have been true even without the merge conflict fix, in that, without such involved, the final build would normally be based on a "Add UPDATING entries and bump version" type of commit after the last of the FreeBSD-EN-* commits reported. In other words, nothing seems to record and show anything identifying the actual commit used for the binary update. That could also be of interest to folks that want to build by starting with the exact same source code vintage as the binary update did. In this case: https://lists.freebsd.org/archives/freebsd-announce/2022-November/48.html looks like it needs an update because the reference: releng/13.1/8838c650cb59 releng/13.1-n250167 is to before the "zfs: Fix an improperly resolved merge conflict". That update will identify the commit built for the binary update. === Mark Millard marklmi at yahoo.com
Re: A possible unintended difference in 13.1-RELEASE vs., for example, 13.1-RELEASE-p3
On 2022-Nov-4, at 11:37, Paul Mather wrote: > On Nov 3, 2022, at 11:50 PM, Mark Millard wrote: > >> I downloaded and looked at: >> >> FreeBSD-13.1-RELEASE-arm64-aarch64-RPI.img >> >> # mdconfig -u md0 -f FreeBSD-13.1-RELEASE-arm64-aarch64-RPI.img >> # mount -onoatime /dev/md0s2a /mnt >> # strings /mnt/boot/kern*/kernel | grep 13.1-RELEASE >> @(#)FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC >> FreeBSD 13.1-RELEASE releng/13.1-n250148-fc952ac2212 GENERIC >> 13.1-RELEASE >> >> Note the: releng/13.1-n250148-fc952ac2212 >> >> Looking at the live system after the freebsd-update to >> -p3 : >> >> # strings /boot/kernel/kernel | grep 13.1-RELEASE >> @(#)FreeBSD 13.1-RELEASE-p3 GENERIC >> FreeBSD 13.1-RELEASE-p3 GENERIC >> 13.1-RELEASE-p3 >> >> No text analogous to: releng/13.1-n250148-fc952ac2212 > > > I'm just wondering, but could this have anything to reproducible builds? > It's my understanding that setting is standard for -RELEASE branches. Note > this entry in /usr/src/UPDATING: > > = > 20180913: >Reproducible build mode is now on by default, in preparation for >FreeBSD 12.0. This eliminates build metadata such as the user, >host, and time from the kernel (and uname), unless the working tree >corresponds to a modified checkout from a version control system. >The previous behavior can be obtained by setting the /etc/src.conf >knob WITHOUT_REPRODUCIBLE_BUILD. > = > Could be, but text like releng/13.1-n250148-fc952ac2212 is reproducible for use of the same commit to do mulitple builds. Use of different commits across builds should be detectable even for reproducible build style activity, or so I would expect. The releng/13.1-n250148-fc952ac2212 type of text does have an issue if incremental style builds are sometimes used instead of from-scratch builds: It is for/from the kernel build and if the kernel is not built but is left at an older build and, say, only part of world is built, the identification would then be out of date (inaccurate) overall. I was not really trying to claim that the releng/13.1-n250148-fc952ac2212 text I referenced was the best place to have the identification of the exact commit used for binary updates. It is just that right now there is no place and some manual inspection/analysis is required to (hopefully) identify the right commit. (The mismerge-fix being an example of something that would need to be noticed.) Glen, Warner, etc. may well determine that the current status relative to the build that produced the binary update is sufficient overall. I've primarily identified related questions for consideration. === Mark Millard marklmi at yahoo.com
RE: LLVM error when building www/firefox 107.0_2,2
Hiroo Ono (小野寛生) wrote on Date: Sat, 19 Nov 2022 13:10:15 UTC : > while building www/firefox with poudriere, following error occurred. > Should I report it to LLVM project as the error message says? Or is > this just a bug in firefox or the combination of options I chose? > > The system is: > FreeBSD 13.1-STABLE #7 stable/13-af3ccd7b6d: Thu Nov 10 08:00:43 JST 2022 > and the llvm suite is LLVM 13.0.1 from ports. > . . . > LLVM ERROR: Type mismatch in constant table! > . . . That looks like a report of an internally detected error of soemtihng that should never happen, something that firefox code or build options should not result in. As a somewhat confirming note . . . http://beefy18.nyi.freebsd.org/data/main-amd64-default/pf5ce9b7ee067_s183088934a/logs/firefox-107.0_2,2.log is a log from a successful build of 107.0_2,2 on/for main [so: 14] --thus using a more recent clang++ vintage as well. Thus, it suggests that "LLVM 13.0.1 from ports" has a problem that firefox's build ran into. === Mark Millard marklmi at yahoo.com
RE: Unusual errors on recent stable/13 22-Dec-2022 (a related problem report on freebsd-ports?)
Jonathan Chen wrote on Date: Thu, 22 Dec 2022 19:21:37 UTC : > I recently updated my package builder machine to the > stable/13-n253297-fc15d5bf1109of (22-Dec-2022); and appear to be having > some unusual issues when building with a high number of jobs. My package > builder uses synth (which uses nullfs on ZFS), and I have had failures > with missing files, as well as what appears to be sequencing issues with > Makefiles. If I re-run the build, these errors usually do not reoccur. > > I'm puzzled as to what is happening. Is this just happening to me? It > appears that the ZFS code has been updated recently, and I'm wondering > whether a regression has crept in. [Or it could just be my hardware?] https://lists.freebsd.org/archives/freebsd-ports/2022-December/003153.html indicates a problem with tmpfs use such that using USE_TMPFS=no avoids a problem for poudriere bulk builds on 13.1 amd64. (Unclear if the note is for stable/13 vs. releng/13.1 vs. both.) I'll note repeat the material here but it might be worth a look. === Mark Millard marklmi at yahoo.com
Re: Unusual errors on recent stable/13 22-Dec-2022 (a related problem report on freebsd-ports?)
On Dec 23, 2022, at 10:14, Mark Millard wrote: > Jonathan Chen wrote on > Date: Thu, 22 Dec 2022 19:21:37 UTC : > >> I recently updated my package builder machine to the >> stable/13-n253297-fc15d5bf1109of (22-Dec-2022); and appear to be having >> some unusual issues when building with a high number of jobs. My package >> builder uses synth (which uses nullfs on ZFS), and I have had failures >> with missing files, as well as what appears to be sequencing issues with >> Makefiles. If I re-run the build, these errors usually do not reoccur. >> >> I'm puzzled as to what is happening. Is this just happening to me? It >> appears that the ZFS code has been updated recently, and I'm wondering >> whether a regression has crept in. [Or it could just be my hardware?] > > > https://lists.freebsd.org/archives/freebsd-ports/2022-December/003153.html > > indicates a problem with tmpfs use such that using USE_TMPFS=no > avoids a problem for poudriere bulk builds on 13.1 amd64. > (Unclear if the note is for stable/13 vs. releng/13.1 vs. both.) A note on Discord indicates: stable/13 as a context with the devel/nasm example build problem. > I'll note repeat the material here but it might be worth a look. === Mark Millard marklmi at yahoo.com
Re: Unusual errors on recent stable/13 22-Dec-2022 (a related problem report on freebsd-ports?)
Jonathan Chen wrote on Date: Fri, 23 Dec 2022 18:40:27 UTC : > On 24/12/22 07:14, Mark Millard wrote: > > Jonathan Chen wrote on > > Date: Thu, 22 Dec 2022 19:21:37 UTC : > > > >> I recently updated my package builder machine to the > >> stable/13-n253297-fc15d5bf1109of (22-Dec-2022); and appear to be having > >> some unusual issues when building with a high number of jobs. My package > >> builder uses synth (which uses nullfs on ZFS), and I have had failures > >> with missing files, as well as what appears to be sequencing issues with > >> Makefiles. If I re-run the build, these errors usually do not reoccur. > >> > >> I'm puzzled as to what is happening. Is this just happening to me? It > >> appears that the ZFS code has been updated recently, and I'm wondering > >> whether a regression has crept in. [Or it could just be my hardware?] > > > > > > https://lists.freebsd.org/archives/freebsd-ports/2022-December/003153.html > > > > indicates a problem with tmpfs use such that using USE_TMPFS=no > > avoids a problem for poudriere bulk builds on 13.1 amd64. > > (Unclear if the note is for stable/13 vs. releng/13.1 vs. both.) > > I'll try disabling tmpfs on synth. > FYI . . . The following is about the tmpfs issue referenced in freebsd-ports/2022-December/003153.html . Here is what is going on (manually entered commands, not a script). First under a tmpfs: # df -m Filesystem 1M-blocks Used Avail Capacity Mounted on /dev/ufs/rootfs 221683 97879 106068 48% / devfs 0 0 0 100% /dev /dev/msdosfs/MSDOSBOOT 49 31 18 62% /boot/msdos tmpfs 7716 0 7716 0% /tmp # cd /tmp # : > mmjnk.test # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 0 Mar 9 08:56:53 2022 mmjnk.test # : > mmjnk.test # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 0 Mar 9 08:56:53 2022 mmjnk.test (no time change). The makefile involved is using ": > NAME" notation to try to update timestamps on deliberately empty files. Vs. under (for example) UFS: # cd ~/ # : > mmjnk.test # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 0 Mar 9 09:00:45 2022 mmjnk.test # : > mmjnk.test # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 0 Mar 9 09:00:54 2022 mmjnk.test (time changed). Back in tmpfs land . . . Part of this is that the file is already of size zero and continues to be so. By contrast, starting with a file with 15 bytes in it: # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 15 Mar 9 09:07:38 2022 mmjnk.test # : > mmjnk.test # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 0 Mar 9 09:07:49 2022 mmjnk.test # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 0 Mar 9 09:07:49 2022 mmjnk.test The lack of a timestamp change when the file already has size zero looks like an example of a bug to me. truncate for tmpfs files behaves similarly (showing just the lack of timestamp change context): # truncate -s 0 mmjnk.test # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 0 Mar 9 09:11:31 2022 mmjnk.test # truncate -s 0 mmjnk.test # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 0 Mar 9 09:11:31 2022 mmjnk.test (UFS got a timestamp update from such a sequence.) I'll note that touch does not get this tmpfs behavior: # touch mmjnk.test # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 0 Mar 9 09:11:26 2022 mmjnk.test # touch mmjnk.test # ls -Tld mmjnk.test -rw-r--r-- 1 root wheel 0 Mar 9 09:11:31 2022 mmjnk.test (But it would not force size zero on its down.) I did these tests on: # uname -apKU FreeBSD generic 13.1-STABLE FreeBSD 13.1-STABLE #0 stable/13-n253133-b51ee7ac252c: Wed Nov 23 03:36:16 UTC 2022 r...@releng3.nyi.freebsd.org:/usr/obj/usr/src/arm64.aarch64/sys/GENERIC arm64 aarch64 1301509 1301509 However, I previously did a devel/nasm bulk test with with USE_TMPFS=all on: # uname -apKU FreeBSD amd64_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #55 main-n259064-f83db6441a2f-dirty: Sun Nov 6 16:31:55 PST 2022 root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-NODBG amd64 amd64 1400073 1400073 and it got the problem. (I normally use USE_TMPFS=data , which does not get the problem because the files in question end up not on a tmpfs.) So: not specific to amd64 , not specific to stable/13 , existed in early November in main. This may have been around for some time for tmpfs. === Mark Millard marklmi at yahoo.com
RE: ofw_pci: Fix incorrectly sized softc causing pci(4) out-of-bounds reads (Should it have been MFC'd?)
Should the following have been MFC'd? (I ran into this while looking to see why I see a boot message oddity on 13.* that I do not see on main [so: 14]. There was a time when main also produced the odd messages. But I'm not claiming that this is what makes the difference. The oddity was observed on aarch64 RPi4B's.) author Jessica Clarke 2022-01-15 19:03:53 + committer Jessica Clarke 2022-01-15 19:03:53 + commit 4e3a43905e3ff7b9fcf228022f05d636f79c4b42 (patch) tree b6be66e54604bb2c1fbdfde27bf8a6644e04fd05 parent 3266a0c5d5abe8dd14de8478edec3e878e4a1c0b (diff) download src-4e3a43905e3ff7b9fcf228022f05d636f79c4b42.tar.gz src-4e3a43905e3ff7b9fcf228022f05d636f79c4b42.zip ofw_pci: Fix incorrectly sized softc causing pci(4) out-of-bounds reads We do not include sys/rman.h and so machine/resource.h ends up not being included by the time pci_private.h is included. This means PCI_RES_BUS is never defined, and so the sc_bus member of pci_softc is not present when compiling ofw_pci, resulting in the wrong softc size being passed to DEFINE_CLASS_1 and thus any attempts by pci(4) to access that member are out-of-bounds reads or writes. This is pretty fragile; arguably pci_private.h should be including sys/rman.h, but this is the minimal needed change to fix the bug whilst maintaining the status quo. Found by: CHERI Reported by: andrew Diffstat -rw-r--r-- sys/dev/ofw/ofw_pci.c 1 1 files changed, 1 insertions, 0 deletions diff --git a/sys/dev/ofw/ofw_pci.c b/sys/dev/ofw/ofw_pci.c index 7f7aad379ddc..4bd6ccd64420 100644 --- a/sys/dev/ofw/ofw_pci.c +++ b/sys/dev/ofw/ofw_pci.c @@ -33,6 +33,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #include (Note: leading whitespace might not be preserved.) === Mark Millard marklmi at yahoo.com
Re: ofw_pci: Fix incorrectly sized softc causing pci(4) out-of-bounds reads (Should it have been MFC'd?)
On Dec 26, 2022, at 19:54, Mark Millard wrote: > Should the following have been MFC'd? (I ran into this while > looking to see why I see a boot message oddity on 13.* that > I do not see on main [so: 14]. There was a time when main > also produced the odd messages. But I'm not claiming that > this is what makes the difference. The oddity was observed > on aarch64 RPi4B's.) > Never mind. I got myself confused over the history. 13.* does not have the file at all. > author Jessica Clarke 2022-01-15 19:03:53 + > committer Jessica Clarke 2022-01-15 19:03:53 + > commit 4e3a43905e3ff7b9fcf228022f05d636f79c4b42 (patch) > tree b6be66e54604bb2c1fbdfde27bf8a6644e04fd05 > parent 3266a0c5d5abe8dd14de8478edec3e878e4a1c0b (diff) > download src-4e3a43905e3ff7b9fcf228022f05d636f79c4b42.tar.gz > src-4e3a43905e3ff7b9fcf228022f05d636f79c4b42.zip > > ofw_pci: Fix incorrectly sized softc causing pci(4) out-of-bounds reads > > We do not include sys/rman.h and so machine/resource.h ends up not being > included by the time pci_private.h is included. This means PCI_RES_BUS is > never defined, and so the sc_bus member of pci_softc is not present when > compiling ofw_pci, resulting in the wrong softc size being passed to > DEFINE_CLASS_1 and thus any attempts by pci(4) to access that member are > out-of-bounds reads or writes. > > This is pretty fragile; arguably pci_private.h should be including > sys/rman.h, but this is the minimal needed change to fix the bug whilst > maintaining the status quo. > > Found by: CHERI > Reported by: andrew > > > Diffstat > -rw-r--r-- sys/dev/ofw/ofw_pci.c 1 > 1 files changed, 1 insertions, 0 deletions > > diff --git a/sys/dev/ofw/ofw_pci.c b/sys/dev/ofw/ofw_pci.c > index 7f7aad379ddc..4bd6ccd64420 100644 > --- a/sys/dev/ofw/ofw_pci.c > +++ b/sys/dev/ofw/ofw_pci.c > @@ -33,6 +33,7 @@ __FBSDID("$FreeBSD$"); > #include > #include > #include > +#include > > #include > #include > > > > > (Note: leading whitespace might not be preserved.) === Mark Millard marklmi at yahoo.com
stable/13 snapshot's /etc/rc.d/machine_id has use of main's startmsg from /etc/rc.subr so it reports 2 "eval: startmsg: not found"
When I booted a new stable/13 snapshot install, the messaging included: . . . Updating motd:. eval: startmsg: not found eval: startmsg: not found Clearing /tmp (X related). . . . It looks like the "eval: startmsg: not found" lines are from: # grep -r "\" /etc/ /etc/rc.d/machine_id: startmsg -n "Creating ${machine_id_file} " /etc/rc.d/machine_id: startmsg 'done.' (No more matches found.) The following was not found on stable/13: /etc/rc.subr:# startmsg /etc/rc.subr:startmsg() /etc/rc.subr: startmsg "Starting ${name}." === Mark Millard marklmi at yahoo.com
FYI: upcoming 13.2-RELEASE vs. 8 GiByte RPi4B's based on U-Boot 2023.01 recently in use, given UEFI style booting
This is an FYI about 8 GiByte RPI4B coverage by 13.2-RELEASE. (The existing snapshots and such show the issue now --but I'm noting the 13.2-RELEASE consequences for as things are.) When sysutils/u-boot-rpi-arm64 and sysutils/u-boot-rpi4 recently changed to be based on U-Boot 2023.01, the U-Boot produced no longer boots 8 GiByte RPi4B's for FreeBSD: U-Boot increased the number of U-Boot "lmb" regions it uses for RPi4B's for UEFI booting --without adjusting the bound imposed on the number that can be in use. The 8 GiByte RPI4B's end up over the limit during part of the activity and U-Boot aborts the UEFI-boot attempt. The U-Boot message about this is misleading. The middle line of: Found EFI removable media binary efi/boot/bootaa64.efi ** Reading file would overwrite reserved memory ** Failed to load 'efi/boot/bootaa64.efi' is actually caused by the rejection of adding another lmb range, having nothing to do with potentially overwriting reserved memory. (The message is generated far from the code that did the rejection and no rejection reason is propogated.) A FreeBSD bugzilla for this issue is: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=269181 I'd not be surprised if U-Boot 2023.04 has things working by default again. But, until such, either an older U-Boot, such as 2022.10, or a patched 2023.01 U-Boot, would be needed for 8 GiByte RPi4B's to end up being directly bootable by 13.2-RELEASE as-built. I'm not aware of any other type of FreeBSD aarch64 context broken via the use of 2023.01 U-Boot. === Mark Millard marklmi at yahoo.com
RE: 13.2 BETA2: how do debug META_MODE?
Peter wrote on Date: Tue, 21 Feb 2023 03:45:12 UTC : > on /some/ of my nodes, META_MODE seems not being honored anymore: > I had to build them another time, and the lengthy lib/clang gets > built all over again (tried two times). > This is so since 13.2 (BETA2). It did work in 13.1 (RELENG), at least > according to the timing from the logfiles. > > Now I'm trying to figure out the difference, because I have some > nodes where it appears to more-or-less work (have seen buildworld > take 5 minutes), and others where it doesn't (take an hour to build). > The thing is scripted, so it is not so very likely an operator error > (while not impossible either). > > But it seems difficult to figure out details: "make -n" seems to not > care about META_MODE, while META_MODE suppresses all useful output from > make. And the docs say there are *.meta files (yes there are), but no > info about how to verify their content, or how to get make tell what > it is going to do and why (and the buildworld is not the most easy > to understand target)... > > So, some inspiration would be welcome... On thing to check on is if filemon.ko is loaded and operational. META_MODE greatly depends on it. Another thing to know is that the following are very different for what all is built for the "(again #0)" line vs. the other two "again" lines, using buildworld as an example context. Imagine here the the first buildworld rebuilds llvm/clang materials. # cd /usr/src/ # env WITH_META_MODE=yes make buildworld # env WITH_META_MODE=yes make installworld # env WITH_META_MODE=yes make buildworld (again #0) ## no more rebuilds below? # env WITH_META_MODE=yes make buildworld (again #1) # env WITH_META_MODE=yes make buildworld (again #2) Unfortunately, the some of the install activity registers as activity that is to cause later rebuild activity: updated file dates. There are also issues of sort of a feedback loop: rm ends up updated (deleted and replaced) by install but rm was also listed as part of the sequence of replacing some other files. Result? The rm removal/replacement ends up meaning the files are to be regenerated, not just recopied. There is a long list of such commands, not just rm. "again #0" will rebuild llvm/clang. The other two "again"s will not. See: https://lists.freebsd.org/pipermail/freebsd-current/2021-January/078488.html and: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257616 . === Mark Millard marklmi at yahoo.com
Re: 13.2 BETA2: how do debug META_MODE?
On Feb 21, 2023, at 04:55, Peter wrote: > On Mon, Feb 20, 2023 at 08:44:59PM -0800, Mark Millard wrote: > ! Peter wrote on > ! Date: Tue, 21 Feb 2023 03:45:12 UTC : > ! > ! > on /some/ of my nodes, META_MODE seems not being honored anymore: > ! > I had to build them another time, and the lengthy lib/clang gets > ! > built all over again (tried two times). > ! > This is so since 13.2 (BETA2). It did work in 13.1 (RELENG), at least > ! > according to the timing from the logfiles. > ! > > ! > Now I'm trying to figure out the difference, because I have some > ! > nodes where it appears to more-or-less work (have seen buildworld > ! > take 5 minutes), and others where it doesn't (take an hour to build). > ! > The thing is scripted, so it is not so very likely an operator error > ! > (while not impossible either). > ! > > ! > But it seems difficult to figure out details: "make -n" seems to not > ! > care about META_MODE, while META_MODE suppresses all useful output from > ! > make. And the docs say there are *.meta files (yes there are), but no > ! > info about how to verify their content, or how to get make tell what > ! > it is going to do and why (and the buildworld is not the most easy > ! > to understand target)... > ! > > ! > So, some inspiration would be welcome... > ! > ! On thing to check on is if filemon.ko is loaded and operational. > ! META_MODE greatly depends on it. > > That should be the case - 'kldstat' shows it (and I've seen warnings > where it didn't). > > ! Another thing to know is that the following are very different > ! for what all is built for the "(again #0)" line vs. the other > ! two "again" lines, using buildworld as an example context. > ! Imagine here the the first buildworld rebuilds llvm/clang > ! materials. > ! > ! # cd /usr/src/ > ! # env WITH_META_MODE=yes make buildworld > ! # env WITH_META_MODE=yes make installworld > ! # env WITH_META_MODE=yes make buildworld (again #0) > ! ## no more rebuilds below? > ! # env WITH_META_MODE=yes make buildworld (again #1) > ! # env WITH_META_MODE=yes make buildworld (again #2) > > But what is the difference between #0 and #1? awk, cp, ln, rm, sed, and many more from . . ./tmp/legacy/usr/sbin/have new dates for rebuilds after installworld (that targets the running system). Not true for #1 and #2. The dates on these tools being more recent than the files that they were involved in producing leads to rebuilding those files. That in turn leads to other files being rebuilt. make with -dM reports the likes of: file '. . ./tmp/legacy/usr/sbin/awk' is newer than the target... explicitly as it goes. As I remember tmp/legacy/usr/sbin/ was always part of the path for what I found. One still has to trace back to were rebuild a rebuild is not due to something rebuilt in earlier in the same build. Noting that tmp/legacy/usr/sbin/awk is reported as newer than its target, leaves the question of how it ended up being newer: earlier in same build vs. before build activity? It too must be traced back to something based on just material from prior to the build in question. Note that the above make sequence was only intended for showing the dependency, not as instructions for a normal update sequence. > . . . > > ! See: > ! > ! https://lists.freebsd.org/pipermail/freebsd-current/2021-January/078488.html This (and later messages in the thread) are about the "awk, cp, ln, rm, sed, and many more" that make with -dM explicitly reports (likely from tmp/legacy/usr/sbin/ ). If you trust the make date comparisons, it is the easiest way to find out what has "is newer than the target" status that leads to starting a rebuild sequence. (Other dependent things then rebuild based on this rebuild. One still has to trace back to where things start.) I did not do the analysis of how (e.g.) tmp/legacy/usr/sbin/awk ended up being newer than such a target and, so, causing a rebuild of that target. I was going the direction: that it is newer really is unlikely to justify the rebuild for the target(s) in question. The other direction about how it got to be newer is also relevant. > ! > ! and: > ! > ! https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257616 . > > Thank You, that's exactly the inspiration I was looking for! > Diving back in... === Mark Millard marklmi at yahoo.com
Re: 13.2 BETA2: how do debug META_MODE?
On Feb 21, 2023, at 11:56, Mark Millard wrote: > On Feb 21, 2023, at 04:55, Peter wrote: > >> On Mon, Feb 20, 2023 at 08:44:59PM -0800, Mark Millard wrote: >> ! Peter wrote on >> ! Date: Tue, 21 Feb 2023 03:45:12 UTC : >> ! >> ! > on /some/ of my nodes, META_MODE seems not being honored anymore: >> ! > I had to build them another time, and the lengthy lib/clang gets >> ! > built all over again (tried two times). >> ! > This is so since 13.2 (BETA2). It did work in 13.1 (RELENG), at least >> ! > according to the timing from the logfiles. >> ! > >> ! > Now I'm trying to figure out the difference, because I have some >> ! > nodes where it appears to more-or-less work (have seen buildworld >> ! > take 5 minutes), and others where it doesn't (take an hour to build). >> ! > The thing is scripted, so it is not so very likely an operator error >> ! > (while not impossible either). >> ! > >> ! > But it seems difficult to figure out details: "make -n" seems to not >> ! > care about META_MODE, while META_MODE suppresses all useful output from >> ! > make. And the docs say there are *.meta files (yes there are), but no >> ! > info about how to verify their content, or how to get make tell what >> ! > it is going to do and why (and the buildworld is not the most easy >> ! > to understand target)... >> ! > >> ! > So, some inspiration would be welcome... >> ! >> ! On thing to check on is if filemon.ko is loaded and operational. >> ! META_MODE greatly depends on it. >> >> That should be the case - 'kldstat' shows it (and I've seen warnings >> where it didn't). >> >> ! Another thing to know is that the following are very different >> ! for what all is built for the "(again #0)" line vs. the other >> ! two "again" lines, using buildworld as an example context. >> ! Imagine here the the first buildworld rebuilds llvm/clang >> ! materials. >> ! >> ! # cd /usr/src/ >> ! # env WITH_META_MODE=yes make buildworld >> ! # env WITH_META_MODE=yes make installworld >> ! # env WITH_META_MODE=yes make buildworld (again #0) >> ! ## no more rebuilds below? >> ! # env WITH_META_MODE=yes make buildworld (again #1) >> ! # env WITH_META_MODE=yes make buildworld (again #2) >> >> But what is the difference between #0 and #1? > > awk, cp, ln, rm, sed, and many more from > . . ./tmp/legacy/usr/sbin/have new dates > for rebuilds after installworld (that targets > the running system). Not true for #1 and #2. > > The dates on these tools being more recent than > the files that they were involved in producing > leads to rebuilding those files. That in turn > leads to other files being rebuilt. > > make with -dM reports the likes of: > > file '. . ./tmp/legacy/usr/sbin/awk' is newer than the target... > > explicitly as it goes. As I remember tmp/legacy/usr/sbin/ > was always part of the path for what I found. > > One still has to trace back to were rebuild a rebuild > is not due to something rebuilt in earlier in the same > build. Noting that tmp/legacy/usr/sbin/awk is reported > as newer than its target, leaves the question of how > it ended up being newer: earlier in same build vs. > before build activity? It too must be traced back > to something based on just material from prior to > the build in question. > > Note that the above make sequence was only intended > for showing the dependency, not as instructions for a > normal update sequence. > >> . . . >> >> ! See: >> ! >> ! >> https://lists.freebsd.org/pipermail/freebsd-current/2021-January/078488.html > > This (and later messages in the thread) are about the > "awk, cp, ln, rm, sed, and many more" that make with -dM > explicitly reports (likely from tmp/legacy/usr/sbin/ ). > If you trust the make date comparisons, it is the easiest > way to find out what has "is newer than the target" status > that leads to starting a rebuild sequence. (Other dependent > things then rebuild based on this rebuild. One still has > to trace back to where things start.) > > I did not do the analysis of how (e.g.) tmp/legacy/usr/sbin/awk > ended up being newer than such a target and, so, causing a > rebuild of that target. I was going the direction: that > it is newer really is unlikely to justify the rebuild for > the target(s) in question. The other direction about how > it got to be newer is also relevant. > >> ! >> ! and: >> ! >> ! https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=257616 . >> >> Thank You, that's exactly the inspiration I was looking for! >> Diving back in... > I had forgotten about Simon J. Gerraty's notes in his reply: https://lists.freebsd.org/pipermail/freebsd-current/2021-January/078628.html It is about telling META_MODE ignore things that would otherwise cause rebuild activity. Had I remembered, I would have also listed it explicitly, not just listing the start of the thread. === Mark Millard marklmi at yahoo.com
Re: 13.2 BETA2: how do debug META_MODE?
On Feb 21, 2023, at 11:56, Mark Millard wrote: > On Feb 21, 2023, at 04:55, Peter wrote: > >> On Mon, Feb 20, 2023 at 08:44:59PM -0800, Mark Millard wrote: >> ! Peter wrote on >> ! Date: Tue, 21 Feb 2023 03:45:12 UTC : >> ! >> ! > on /some/ of my nodes, META_MODE seems not being honored anymore: >> ! > I had to build them another time, and the lengthy lib/clang gets >> ! > built all over again (tried two times). >> ! > This is so since 13.2 (BETA2). It did work in 13.1 (RELENG), at least >> ! > according to the timing from the logfiles. >> ! > >> ! > Now I'm trying to figure out the difference, because I have some >> ! > nodes where it appears to more-or-less work (have seen buildworld >> ! > take 5 minutes), and others where it doesn't (take an hour to build). >> ! > The thing is scripted, so it is not so very likely an operator error >> ! > (while not impossible either). >> ! > >> ! > But it seems difficult to figure out details: "make -n" seems to not >> ! > care about META_MODE, while META_MODE suppresses all useful output from >> ! > make. And the docs say there are *.meta files (yes there are), but no >> ! > info about how to verify their content, or how to get make tell what >> ! > it is going to do and why (and the buildworld is not the most easy >> ! > to understand target)... >> ! > >> ! > So, some inspiration would be welcome... >> ! >> ! On thing to check on is if filemon.ko is loaded and operational. >> ! META_MODE greatly depends on it. >> >> That should be the case - 'kldstat' shows it (and I've seen warnings >> where it didn't). >> >> ! Another thing to know is that the following are very different >> ! for what all is built for the "(again #0)" line vs. the other >> ! two "again" lines, using buildworld as an example context. >> ! Imagine here the the first buildworld rebuilds llvm/clang >> ! materials. >> ! >> ! # cd /usr/src/ >> ! # env WITH_META_MODE=yes make buildworld >> ! # env WITH_META_MODE=yes make installworld >> ! # env WITH_META_MODE=yes make buildworld (again #0) >> ! ## no more rebuilds below? >> ! # env WITH_META_MODE=yes make buildworld (again #1) >> ! # env WITH_META_MODE=yes make buildworld (again #2) >> >> But what is the difference between #0 and #1? > > awk, cp, ln, rm, sed, and many more from > . . ./tmp/legacy/usr/sbin/have new dates > for rebuilds after installworld (that targets > the running system). Not true for #1 and #2. > > The dates on these tools being more recent than > the files that they were involved in producing > leads to rebuilding those files. That in turn > leads to other files being rebuilt. > > make with -dM reports the likes of: > > file '. . ./tmp/legacy/usr/sbin/awk' is newer than the target... > > explicitly as it goes. As I remember tmp/legacy/usr/sbin/ > was always part of the path for what I found. > > One still has to trace back to were rebuild a rebuild > is not due to something rebuilt in earlier in the same > build. Noting that tmp/legacy/usr/sbin/awk is reported > as newer than its target, leaves the question of how > it ended up being newer: earlier in same build vs. > before build activity? It too must be traced back > to something based on just material from prior to > the build in question. > > Note that the above make sequence was only intended > for showing the dependency, not as instructions for a > normal update sequence. > >> . . . >> >> ! See: >> ! >> ! >> https://lists.freebsd.org/pipermail/freebsd-current/2021-January/078488.html > > This (and later messages in the thread) are about the > "awk, cp, ln, rm, sed, and many more" that make with -dM > explicitly reports (likely from tmp/legacy/usr/sbin/ ). > If you trust the make date comparisons, it is the easiest > way to find out what has "is newer than the target" status > that leads to starting a rebuild sequence. (Other dependent > things then rebuild based on this rebuild. One still has > to trace back to where things start.) > > I did not do the analysis of how (e.g.) tmp/legacy/usr/sbin/awk > ended up being newer than such a target and, so, causing a > rebuild of that target. I was going the direction: that > it is newer really is unlikely to justify the rebuild for > the target(s) in question. The other direction about how > it got to be newer is also relevant. Using awk as an example, for the (re)build of awk in: /usr/obj/BUILDs/main-amd
Re: 13.2 BETA2: how do debug META_MODE?
On Feb 21, 2023, at 18:10, Peter wrote: > On Tue, Feb 21, 2023 at 11:56:13AM -0800, Mark Millard wrote: > ! On Feb 21, 2023, at 04:55, Peter wrote: > ! > ! > ! # cd /usr/src/ > ! > ! # env WITH_META_MODE=yes make buildworld > ! > ! # env WITH_META_MODE=yes make installworld > ! > ! # env WITH_META_MODE=yes make buildworld (again #0) > ! > ! ## no more rebuilds below? > ! > ! # env WITH_META_MODE=yes make buildworld (again #1) > ! > ! # env WITH_META_MODE=yes make buildworld (again #2) > ! > > ! > But what is the difference between #0 and #1? > ! > ! awk, cp, ln, rm, sed, and many more from > ! . . ./tmp/legacy/usr/sbin/have new dates > ! for rebuilds after installworld (that targets > ! the running system). Not true for #1 and #2. > ! > ! The dates on these tools being more recent than > ! the files that they were involved in producing > ! leads to rebuilding those files. That in turn > ! leads to other files being rebuilt. > ! > ! make with -dM reports the likes of: > ! > !file '. . ./tmp/legacy/usr/sbin/awk' is newer than the target... > ! > ! explicitly as it goes. As I remember tmp/legacy/usr/sbin/ > ! was always part of the path for what I found. > > Mark, thanks a lot for the proper input at the right time! > > This put me on the right track and I mananged to analyze and > understand what is actually happening. > > It looks like my issue does resolve itself somehow, and things > start to behave as expected again after four builds. Intersting. > ! I did not do the analysis of how (e.g.) tmp/legacy/usr/sbin/awk > ! ended up being newer than such a target and, so, causing a > ! rebuild of that target. I was going the direction: that > ! it is newer really is unlikely to justify the rebuild for > ! the target(s) in question. The other direction about how > ! it got to be newer is also relevant. > > I have now analyzed some parts of it. META_MODE typically finds some > build-tools to rebuild, but then if the result is not different > from what was there before, then "install" will not copy it to the > bin-dir, and so the avalanche gets usually avoided. > The implication is that "install -C" is in use, quoting the man page: -C Copy the file. If the target file already exists and the files are the same, then do not change the modification time of the target. If the target's file flags and mode need not to be changed, the target's inode change time is also unchanged. -c Copy the file. This is actually the default. The -c option is only included for backwards compatibility. -C might have more of an effect in a reproducible-build style build process than on a non-reproducible-build style one. === Mark Millard marklmi at yahoo.com
Re: [analysis] 13.2 BETA2: how do debug META_MODE?
t; 230217231308.admn.pass1.jail.sst:real 9346.90 jail w/ compiler > 230217231308.admn.pass2.jail.sst:real 5460.16 > 230217231308.data.pass1.jail.sst:real 4094.04 jail w/o compiler > 230217231308.data.pass2.jail.sst:real 143.39 > 230217231308.iamk.pass1.jail.sst:real 8050.27 jail w/ compiler > 230217231308.iamk.pass2.jail.sst:real 5226.32 > 230217231308.oper.pass1.jail.sst:real 2910.28 jail w/o compiler > 230217231308.oper.pass2.jail.sst:real 92.05 > 230217231308.rail.pass1.jail.sst:real 3236.29 jail w/o compiler > 230217231308.rail.pass2.jail.sst:real 99.49 > 230217231308.tele.pass1.jail.sst:real 3170.34 jail w/o compiler > 230217231308.tele.pass2.jail.sst:real 180.65 > > pass3 > (10 vcore) > 230222000242.base.std.sst: real 1162.80 base w/ kernels > 230222000242.admn.std.jail.sst:real 1759.15 jail w/ compiler > 230222000242.data.std.jail.sst:real 155.54 jail w/o compiler > 230222000242.iamk.std.jail.sst:real 1715.07 jail w/ compiler > 230222000242.oper.std.jail.sst:real 149.51 jail w/o compiler > 230222000242.rail.std.jail.sst:real 151.73 jail w/o compiler > 230222000242.tele.std.jail.sst:real 150.52 jail w/o compiler > > pass4 > (10 vcore) > 230222021535.edge.std.sst: real 1018.79 base w/ kernels > 230222021535.admn.std.jail.sst:real 101.61 jail w/ compiler > 230222021535.data.std.jail.sst:real 67.47 jail w/o compiler > 230222021535.iamk.std.jail.sst:real 100.91 jail w/ compiler > 230222021535.oper.std.jail.sst:real 66.52 jail w/o compiler > 230222021535.rail.std.jail.sst:real 68.00 jail w/o compiler > 230222021535.tele.std.jail.sst:real 66.54 jail w/o compiler === Mark Millard marklmi at yahoo.com
Re: 13.2 BETA2: how do debug META_MODE?
On Feb 21, 2023, at 19:11, Peter wrote: > On Tue, Feb 21, 2023 at 06:44:09PM -0800, Mark Millard wrote: > ! On Feb 21, 2023, at 18:10, Peter wrote: > ! > ! > On Tue, Feb 21, 2023 at 11:56:13AM -0800, Mark Millard wrote: > ! > ! On Feb 21, 2023, at 04:55, Peter wrote: > ! > ! > ! > ! > ! # cd /usr/src/ > ! > ! > ! # env WITH_META_MODE=yes make buildworld > ! > ! > ! # env WITH_META_MODE=yes make installworld > ! > ! > ! # env WITH_META_MODE=yes make buildworld (again #0) > ! > ! > ! ## no more rebuilds below? > ! > ! > ! # env WITH_META_MODE=yes make buildworld (again #1) > ! > ! > ! # env WITH_META_MODE=yes make buildworld (again #2) > ! > ! > > ! > ! > But what is the difference between #0 and #1? > ! > ! > ! > ! awk, cp, ln, rm, sed, and many more from > ! > ! . . ./tmp/legacy/usr/sbin/have new dates > ! > ! for rebuilds after installworld (that targets > ! > ! the running system). Not true for #1 and #2. > ! > ! > ! > ! The dates on these tools being more recent than > ! > ! the files that they were involved in producing > ! > ! leads to rebuilding those files. That in turn > ! > ! leads to other files being rebuilt. > ! > ! > ! > ! make with -dM reports the likes of: > ! > ! > ! > !file '. . ./tmp/legacy/usr/sbin/awk' is newer than the target... > ! > ! > ! > ! explicitly as it goes. As I remember tmp/legacy/usr/sbin/ > ! > ! was always part of the path for what I found. > ! > > ! > Mark, thanks a lot for the proper input at the right time! > ! > > ! > This put me on the right track and I mananged to analyze and > ! > understand what is actually happening. > ! > > ! > It looks like my issue does resolve itself somehow, and things > ! > start to behave as expected again after four builds. > ! > ! Intersting. > ! > ! > ! I did not do the analysis of how (e.g.) tmp/legacy/usr/sbin/awk > ! > ! ended up being newer than such a target and, so, causing a > ! > ! rebuild of that target. I was going the direction: that > ! > ! it is newer really is unlikely to justify the rebuild for > ! > ! the target(s) in question. The other direction about how > ! > ! it got to be newer is also relevant. > ! > > ! > I have now analyzed some parts of it. META_MODE typically finds some > ! > build-tools to rebuild, but then if the result is not different > ! > from what was there before, then "install" will not copy it to the > ! > bin-dir, and so the avalanche gets usually avoided. > ! > > ! > ! The implication is that "install -C" is in use, quoting the > ! man page: > ! > ! -C Copy the file. If the target file already exists and the files > ! are the same, then do not change the modification time of the > ! target. If the target's file flags and mode need not to be > ! changed, the target's inode change time is also unchanged. > ! > ! -c Copy the file. This is actually the default. The -c option is > ! only included for backwards compatibility. > ! > ! -C might have more of an effect in a reproducible-build > ! style build process than on a non-reproducible-build > ! style one. > > Yepp. "install -p" is used, see /usr/src/tools/install.sh > The code for the _bootstap_tools_links uses "cp -pf", not install, to establish part of . . ./tmp/legacy/bin/ . (Note: . . ./tmp/legacy/sbin -> ../bin so is a via a symbolic link.) Before the "cp -pf" there is a "rm -f" deleting the target file before the copy: the prior file in . . ./tmp/legacy/bin/ is never directly preserved. (The new copy might still be identical to the old one: the source path one might happen to be identical as well.) # Link the tools that we need for building but don't need to bootstrap because # the host version is known to be compatible into ${WORLDTMP}/legacy # We do this before building any of the bootstrap tools in case they depend on # the presence of any of the links (e.g. as m4/lex/awk) ${_bt}-links: .PHONY .for _tool in ${_bootstrap_tools_links} ${_bt}-link-${_tool}: .PHONY @rm -f "${WORLDTMP}/legacy/bin/${_tool}"; \ source_path=`which ${_tool}`; \ if [ ! -e "$${source_path}" ] ; then \ echo "Cannot find host tool '${_tool}'"; false; \ fi; \ cp -pf "$${source_path}" "${WORLDTMP}/legacy/bin/${_tool}" ${_bt}-links: ${_bt}-link-${_tool} .endfor Note: This is for the !defined(BOOTSTRAP_ALL_TOOLS) case. Note: the code uses the abbreviation: _bt=_bootst
Re: 13.2 BETA2: how do debug META_MODE?
On Feb 21, 2023, at 20:51, Mark Millard wrote: > On Feb 21, 2023, at 19:11, Peter wrote: > >> On Tue, Feb 21, 2023 at 06:44:09PM -0800, Mark Millard wrote: >> ! On Feb 21, 2023, at 18:10, Peter wrote: >> ! >> ! > On Tue, Feb 21, 2023 at 11:56:13AM -0800, Mark Millard wrote: >> ! > ! On Feb 21, 2023, at 04:55, Peter wrote: >> ! > ! >> ! > ! > ! # cd /usr/src/ >> ! > ! > ! # env WITH_META_MODE=yes make buildworld >> ! > ! > ! # env WITH_META_MODE=yes make installworld >> ! > ! > ! # env WITH_META_MODE=yes make buildworld (again #0) >> ! > ! > ! ## no more rebuilds below? >> ! > ! > ! # env WITH_META_MODE=yes make buildworld (again #1) >> ! > ! > ! # env WITH_META_MODE=yes make buildworld (again #2) >> ! > ! > >> ! > ! > But what is the difference between #0 and #1? >> ! > ! >> ! > ! awk, cp, ln, rm, sed, and many more from >> ! > ! . . ./tmp/legacy/usr/sbin/have new dates >> ! > ! for rebuilds after installworld (that targets >> ! > ! the running system). Not true for #1 and #2. >> ! > ! >> ! > ! The dates on these tools being more recent than >> ! > ! the files that they were involved in producing >> ! > ! leads to rebuilding those files. That in turn >> ! > ! leads to other files being rebuilt. >> ! > ! >> ! > ! make with -dM reports the likes of: >> ! > ! >> ! > !file '. . ./tmp/legacy/usr/sbin/awk' is newer than the target... >> ! > ! >> ! > ! explicitly as it goes. As I remember tmp/legacy/usr/sbin/ >> ! > ! was always part of the path for what I found. >> ! > >> ! > Mark, thanks a lot for the proper input at the right time! >> ! > >> ! > This put me on the right track and I mananged to analyze and >> ! > understand what is actually happening. >> ! > >> ! > It looks like my issue does resolve itself somehow, and things >> ! > start to behave as expected again after four builds. >> ! >> ! Intersting. >> ! >> ! > ! I did not do the analysis of how (e.g.) tmp/legacy/usr/sbin/awk >> ! > ! ended up being newer than such a target and, so, causing a >> ! > ! rebuild of that target. I was going the direction: that >> ! > ! it is newer really is unlikely to justify the rebuild for >> ! > ! the target(s) in question. The other direction about how >> ! > ! it got to be newer is also relevant. >> ! > >> ! > I have now analyzed some parts of it. META_MODE typically finds some >> ! > build-tools to rebuild, but then if the result is not different >> ! > from what was there before, then "install" will not copy it to the >> ! > bin-dir, and so the avalanche gets usually avoided. >> ! > >> ! >> ! The implication is that "install -C" is in use, quoting the >> ! man page: >> ! >> ! -C Copy the file. If the target file already exists and the >> files >> ! are the same, then do not change the modification time of the >> ! target. If the target's file flags and mode need not to be >> ! changed, the target's inode change time is also unchanged. >> ! >> ! -c Copy the file. This is actually the default. The -c option >> is >> ! only included for backwards compatibility. >> ! >> ! -C might have more of an effect in a reproducible-build >> ! style build process than on a non-reproducible-build >> ! style one. >> >> Yepp. "install -p" is used, see /usr/src/tools/install.sh That may be incorrect about what is happening for _bootstap_tools_links and other things. Why do I say that? Several points . . . I do not see "tools" in any PATH= so far, making implicit use unlikely. /usr/main-src/share/mk/sys.mk:INSTALL ?= ${INSTALL_CMD:Uinstall} /usr/main-src/share/mk/src.tools.mk:INSTALL_CMD?= install vs. /usr/main-src/Makefile: INSTALL="sh ${.CURDIR}/tools/install.sh" /usr/main-src/Makefile.inc1:BMAKEENV= INSTALL="sh ${.CURDIR}/tools/install.sh" \ /usr/main-src/Makefile.inc1:KTMAKEENV= INSTALL="sh ${.CURDIR}/tools/install.sh" \ Also: # kernel-tools stage KTMAKEENV= INSTALL="sh ${.CURDIR}/tools/install.sh" \ vs. # world stage WMAKEENV= ${CROSSENV} \ INSTALL="${INSTALL_CMD} -U" \ and: .if defined(DB_FROM_SRC) || defined(NO_ROOT) IMAKE_INSTALL= INSTALL="${INSTALL_CMD} ${INSTALLFLAGS}" So: explicitly varying styles for various
Re: 13.2 BETA2: how do debug META_MODE?
On Feb 21, 2023, at 21:53, Mark Millard wrote: > On Feb 21, 2023, at 20:51, Mark Millard wrote: > >> On Feb 21, 2023, at 19:11, Peter wrote: >> >>> On Tue, Feb 21, 2023 at 06:44:09PM -0800, Mark Millard wrote: >>> ! On Feb 21, 2023, at 18:10, Peter wrote: >>> ! >>> ! > On Tue, Feb 21, 2023 at 11:56:13AM -0800, Mark Millard wrote: >>> ! > ! On Feb 21, 2023, at 04:55, Peter wrote: >>> ! > ! >>> ! > ! > ! # cd /usr/src/ >>> ! > ! > ! # env WITH_META_MODE=yes make buildworld >>> ! > ! > ! # env WITH_META_MODE=yes make installworld >>> ! > ! > ! # env WITH_META_MODE=yes make buildworld (again #0) >>> ! > ! > ! ## no more rebuilds below? >>> ! > ! > ! # env WITH_META_MODE=yes make buildworld (again #1) >>> ! > ! > ! # env WITH_META_MODE=yes make buildworld (again #2) >>> ! > ! > >>> ! > ! > But what is the difference between #0 and #1? >>> ! > ! >>> ! > ! awk, cp, ln, rm, sed, and many more from >>> ! > ! . . ./tmp/legacy/usr/sbin/have new dates >>> ! > ! for rebuilds after installworld (that targets >>> ! > ! the running system). Not true for #1 and #2. >>> ! > ! >>> ! > ! The dates on these tools being more recent than >>> ! > ! the files that they were involved in producing >>> ! > ! leads to rebuilding those files. That in turn >>> ! > ! leads to other files being rebuilt. >>> ! > ! >>> ! > ! make with -dM reports the likes of: >>> ! > ! >>> ! > !file '. . ./tmp/legacy/usr/sbin/awk' is newer than the target... >>> ! > ! >>> ! > ! explicitly as it goes. As I remember tmp/legacy/usr/sbin/ >>> ! > ! was always part of the path for what I found. >>> ! > >>> ! > Mark, thanks a lot for the proper input at the right time! >>> ! > >>> ! > This put me on the right track and I mananged to analyze and >>> ! > understand what is actually happening. >>> ! > >>> ! > It looks like my issue does resolve itself somehow, and things >>> ! > start to behave as expected again after four builds. >>> ! >>> ! Intersting. >>> ! >>> ! > ! I did not do the analysis of how (e.g.) tmp/legacy/usr/sbin/awk >>> ! > ! ended up being newer than such a target and, so, causing a >>> ! > ! rebuild of that target. I was going the direction: that >>> ! > ! it is newer really is unlikely to justify the rebuild for >>> ! > ! the target(s) in question. The other direction about how >>> ! > ! it got to be newer is also relevant. >>> ! > >>> ! > I have now analyzed some parts of it. META_MODE typically finds some >>> ! > build-tools to rebuild, but then if the result is not different >>> ! > from what was there before, then "install" will not copy it to the >>> ! > bin-dir, and so the avalanche gets usually avoided. >>> ! > >>> ! >>> ! The implication is that "install -C" is in use, quoting the >>> ! man page: >>> ! >>> ! -C Copy the file. If the target file already exists and the >>> files >>> ! are the same, then do not change the modification time of the >>> ! target. If the target's file flags and mode need not to be >>> ! changed, the target's inode change time is also unchanged. >>> ! >>> ! -c Copy the file. This is actually the default. The -c option >>> is >>> ! only included for backwards compatibility. >>> ! >>> ! -C might have more of an effect in a reproducible-build >>> ! style build process than on a non-reproducible-build >>> ! style one. >>> >>> Yepp. "install -p" is used, see /usr/src/tools/install.sh > > That may be incorrect about what is happening for > _bootstap_tools_links and other things. Why do I > say that? Several points . . . I missed looking for a obvious type of evidence: -rw-r--r-- 1 root wheel 2355 Apr 28 15:20:53 2021 /usr/main-src/tools/install.sh The script is not executable, which explains the use of an explicit sh in: INSTALL="sh ${.CURDIR}/tools/install.sh" This tends to nail down that the likes of: install -o root -g wheel -m 555 . . . in the output is not an example of using the script. > I do not see "tools" in any PATH= so far, making implicit > use unlike
Re: 13.2 BETA2: how do debug META_MODE?
d64.amd64/tmp/usr/bin/nm' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/touch' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/jot' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/egrep' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/crunchgen' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/cap_mkdb' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/basename' is newer than the target... As for lines unique to the 2nd run (diff output line starts with "+"): # diff -u /usr/obj/BUILDs/main-amd64-nodbg-clang/sys-typescripts/typescript-make-amd64-nodbg-clang-amd64-host-2023-02-22:12:* | grep "^[+].*is newer than the target" | sed -e "s@^.*: file '@file '@" | sort | uniq -c | sort -rn | more 2155 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/realpath' is newer than the target... 1466 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/Scrt1.o' is newer than the target... 878 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/ln' is newer than the target... 444 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/crti.o' is newer than the target... 235 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib32/crti.o' is newer than the target... 67 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/crt1.o' is newer than the target... 41 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib32/libgcc_s.so' is newer than the target... 40 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libgcc_s.so' is newer than the target... 3 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/sh' is newer than the target... 2 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/lib/ncurses/tinfo/./ncurses_dll.h' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib32/libcxxrt.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib32/libcrypto.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib32/libc.so.7' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib32/Scrt1.o' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libssl.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libcxxrt.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libctf.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libcrypto.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/lib/libc.so.7' is newer than the target... (The list is already short, so no subset listed.) === Mark Millard marklmi at yahoo.com
Re: 13.2 BETA2: how do debug META_MODE?
On Feb 22, 2023, at 14:07, Mark Millard wrote: > This is just an FYI about an experiment. The experiment had a flaw: I did not do the builds as -j1 (or no -j at all) but the parallel activities changes the sequencing of the lines from run to run. Thus the "diff -u" part finds more differences than I was intending. I may try to make a better experiment at some point. > After having done installworld installkernel for other reasons, > I did two -dM buildworld buildkernel sequences in a row (no > source changes, no cleaning activity), producing script files > logging the output. The below provides some comparison/contrast > between the two log files. > > Below I report first on the frequencies of the file paths > reported in the "is newer than the target" lines that were > unique to the first run (diff output line starts with "-"). > > I only show a prefix the full list: > > # diff -u > /usr/obj/BUILDs/main-amd64-nodbg-clang/sys-typescripts/typescript-make-amd64-nodbg-clang-amd64-host-2023-02-22:12:* > | grep "^-.*is newer than the target" | sed -e "s@^.*: file '@file '@" | > sort | uniq -c | sort -rn | more > 4432 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/gzip' > is newer than the target... > 2692 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/awk' > is newer than the target... > 2155 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/realpath' > is newer than the target... > 1395 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/Scrt1.o' > is newer than the target... > 1381 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/secure/lib/libcrypto/openssl/opensslconf.h' > is newer than the target... > 1318 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/obj-lib32/secure/lib/libcrypto/openssl/opensslconf.h' > is newer than the target... > 1000 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/cat' > is newer than the target... > 962 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/rm' > is newer than the target... > 928 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/sh' > is newer than the target... > 878 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/ln' > is newer than the target... > 624 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/lib/clang/libllvm/llvm/IR/Attributes.inc' > is newer than the target... > 553 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/sed' > is newer than the target... > 437 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/mv' > is newer than the target... > 417 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/mkcsmapper' > is newer than the target... > 398 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/lib/clang/libclang/clang/Basic/DiagnosticCommonKinds.inc' > is newer than the target... > 351 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/grep' > is newer than the target... > 281 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/lib/clang/libllvm/llvm/IR/IntrinsicEnums.inc' > is newer than the target... > 177 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/crti.o' > is newer than the target... > 161 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/lib/clang/libclang/clang/AST/DeclNodes.inc' > is newer than the target... > 115 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/llvm-tblgen' > is newer than the target... > 98 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/crunchide' > is newer than the target... > 86 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/lib/clang/libclang/clang/StaticAnalyzer/Checkers/Checkers.inc' > is newer than the target... > 75 file > '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/legacy/usr/sbin/uudecode' > is newer than the target... >
Re: 13.2 BETA2: how do debug META_MODE?
r/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib32/libcrypto.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib32/libc.so.7' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib32/Scrt1.o' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libssl.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libcxxrt.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libctf.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libcrypto.so' is newer than the target... 1 file '/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/lib/libc.so.7' is newer than the target... So the two end up very similar for what activity happens. I've only tried this on the amd64 context that I have access to. I'll set up the aarch64 context as well and see how it goes over time. (That context builds for aarch64 and for armv7 .) Similarly, I've only tried main but will be adding the changes to my releng/13.0 , releng/13.1 , and stable/13 contexts and seeing how it goes. (Not that I'm likely to rebuild releng/13.0 at this point.) For now, I've no plans for investigations related to any of the *.o , *.h , *.so* "is newer than" activity listed above. For reference: # git -C /usr/main-src/ diff share/mk/src.sys.obj.mk diff --git a/share/mk/src.sys.obj.mk b/share/mk/src.sys.obj.mk index 3b48fc3c5514..3c7e570dbdbd 100644 --- a/share/mk/src.sys.obj.mk +++ b/share/mk/src.sys.obj.mk @@ -67,6 +67,9 @@ SB_OBJROOT?= ${SB}/obj/ OBJROOT?= ${SB_OBJROOT} .endif OBJROOT?= ${_default_makeobjdirprefix}${SRCTOP}/ +# save the value before we mess with it +_OBJROOT:= ${OBJROOT:tA} +.export _OBJROOT .if ${OBJROOT:M*/} != "" OBJROOT:= ${OBJROOT:H:tA}/ .else (The change is not specific to main .) The content for the special make.conf has the following block of lines for having META MODE avoid specific . . ./tmp/legacy/usr/sbin/* programs (and 3 tmp/usr/bin/* ones) from causing rebuild activity based on the dates on the programs: # _OBJROOT is an addition to share/mk/src.sys.obj.mk # provided by Simon J. Gerraty for my experimentation # with this avoidance of some unnecessary build # activity in META MODE: # # OBJROOT?= ${_default_makeobjdirprefix}${SRCTOP}/ # +# save the value before we mess with it # +_OBJROOT:= ${OBJROOT:tA} # +.export _OBJROOT # # TARGET.TARGET_ARCH for amd64 stays as amd64.amd64 for obj-lib32 (correct for the purpose) # MACHINE.MACHINE_ARCH for amd64 turns into i386.i386 for obj-lib32 (wrong for the purpose) # IGNORELEGACY_NOSYMLINKPREFIX= ${_OBJROOT}/${TARGET}.${TARGET_ARCH}/tmp/legacy/usr IGNOREOTHER_NOSYMLINKPREFIX= ${_OBJROOT}/${TARGET}.${TARGET_ARCH}/tmp/usr/bin # .for ignore_legacy_tool in awk basename cap_mkdb cat chmod cmp cp crunchgen crunchide cut date dd dirname echo egrep env expr fgrep file2c find gencat grep gzip head hostname jot lex lb ln ls m4 make mkcsmapper mkdir mktemp mtree mv nawk patch realpath rm sed sh sort touch tr truncate uudecode uuencode wc xargs .MAKE.META.IGNORE_PATHS+= ${IGNORELEGACY_NOSYMLINKPREFIX}/sbin/${ignore_legacy_tool} .endfor # .for ignore_other_tool in ctfconvert objcopy nm .MAKE.META.IGNORE_PATHS+= ${IGNOREOTHER_NOSYMLINKPREFIX}/${ignore_other_tool} .endfor # .MAKE.META.IGNORE_PATHS:= ${.MAKE.META.IGNORE_PATHS} The . . ./tmp/usr/bin/* ones ( ctfconvert objcopy nm ) may be more questionable than the . . ./tmp/legacy/usr/sbin/* ones. This likely will not prevent the likes of a system with clang14 -> system with clang15 transition having clang15 rebuild itself once the clang15 system is running and another buildworld is started. === Mark Millard marklmi at yahoo.com
Re: git: a28ccb32bf56 - main - machine-id: generate a compact version of the uuid
Mike Karels wrote on Date: Fri, 03 Mar 2023 16:12:50 UTC : > On 3 Mar 2023, at 9:40, Tijl Coosemans wrote: > > > On Wed, 1 Mar 2023 18:18:33 GMT Baptiste Daroussin wrote: > >> The branch main has been updated by bapt: > >> > >> URL: > >> https://cgit.FreeBSD.org/src/commit/?id=a28ccb32bf5678fc401f1602865ee9b37ca4c990 > >> > >> commit a28ccb32bf5678fc401f1602865ee9b37ca4c990 > >> Author: Baptiste Daroussin > >> AuthorDate: 2023-02-28 10:31:06 + > >> Commit: Baptiste Daroussin > >> CommitDate: 2023-03-01 18:16:25 + > >> > >> machine-id: generate a compact version of the uuid > >> > >> dbus and other actually expect an uuid without hyphens > >> > >> Reported by: tijl > >> MFC After: 3 days > >> --- > >> libexec/rc/rc.d/machine_id | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/libexec/rc/rc.d/machine_id b/libexec/rc/rc.d/machine_id > >> index 7cfd7b2d92f8..8bf3e41d0603 100644 > >> --- a/libexec/rc/rc.d/machine_id > >> +++ b/libexec/rc/rc.d/machine_id > >> @@ -23,7 +23,7 @@ machine_id_start() > >> if [ ! -f ${machine_id_file} ] ; then > >> startmsg -n "Creating ${machine_id_file} " > >> t=$(mktemp -t machine-id) > >> - /bin/uuidgen -r -o $t > >> + /bin/uuidgen -r -c -o $t > >> install -C -o root -g wheel -m ${machine_id_perms} "$t" > >> "${machine_id_file}" > >> rm -f "$t" > >> startmsg 'done.' > > > > I really think this file should be defined to contain the same UUID as > > /etc/hostid such that there's one and only one UUID per machine. Having > > two different IDs needlessly complicates things if they end up in logs > > etc. > > > > It also looks like on Linux virtual machines this file contains the > > SMBIOS UUID just like our /etc/hostid. If /etc/machine-id is supposed > > to be a portable way to obtain that UUID it should be the same as > > /etc/hostid. > > I agree. I had the same reaction when the machine-id was added, but > thought the requirements were different (in particular, the UUID version). > If at all possible, the two should be the same except for hyphens. > > > Please have another look at https://reviews.freebsd.org/D38811. This > > file is supposed to remain constant across updates. If we get this > > wrong in 13.2, applications may have to deal with the complications for > > a very long time. > > This should be resolved for 13.2 if at all possible. What are the properties for the content of /etc/hostid in FreeBSD? Where are they documented? /etc/machine-id has strong property guarnatee requirements in linux and dbus (which linux indicates it has adopted requirements from): https://man7.org/linux/man-pages/man5/machine-id.5.html reports: QUOTE The machine ID does not change based on local or network configuration or when hardware is replaced. Due to this and its greater length, it is a more useful replacement for the gethostid(3) call that POSIX specifies. This machine ID adheres to the same format and logic as the D-Bus machine ID. END QUOTE https://dbus.freedesktop.org/doc/dbus-uuidgen.1.html reports: ( used via dbus-uuidgen --ensure=/etc/machine-id as one way to get a linux-comaptibile /etc/machine-id for at least some types of contexts ) QUOTE The important properties of the machine UUID are that 1) it remains unchanged until the next reboot and 2) it is different for any two running instances of the OS kernel. That is, if two processes see the same UUID, they should also see the same shared memory, UNIX domain sockets, local X displays, localhost.localdomain resolution, process IDs, and so forth END QUOTE Does /etc/hostid generated the normal way in FreeBSD have such properties? (How do I look that up?) Returning to: https://man7.org/linux/man-pages/man5/machine-id.5.html QUOTE This ID uniquely identifies the host. It should be considered "confidential", and must not be exposed in untrusted environments, in particular on the network. If a stable unique identifier that is tied to the machine is needed for some application, the machine ID or any part of it must not be used directly. Instead the machine ID should be hashed with a cryptographic, keyed hash function, using a fixed, application-specific key. That way the ID will be properly unique, and derived in a constant way from the machine ID but there will be no way to retrieve the original machine ID from the application-specific one. END QUOTE Is that at least recommended for handling FreeBSD's /etc/hostid content? Is FreeBSD going to document /etc/machine-id content properties in a similar manor? If FreeBSD ends up with a /etc/machine-id that does not have the properties and recommended principles of use, it would appear that the /etc/machine-id path would be highly misleading and, so, inappropriate. === Mark Millard marklmi at yahoo.com
Re: git: a28ccb32bf56 - main - machine-id: generate a compact version of the uuid
On Mar 4, 2023, at 06:32, Tijl Coosemans wrote: > > On Fri, 3 Mar 2023 10:36:20 -0800 Mark Millard wrote: >> What are the properties for the content of /etc/hostid >> in FreeBSD? Where are they documented? >> >> /etc/machine-id has strong property guarnatee >> requirements in linux and dbus (which linux indicates >> it has adopted requirements from): >> >> https://man7.org/linux/man-pages/man5/machine-id.5.html >> >> reports: >> >> QUOTE >> The machine ID does not change based on local or network >> configuration or when hardware is replaced. Due to this and its >> greater length, it is a more useful replacement for the >> gethostid(3) call that POSIX specifies. >> >> This machine ID adheres to the same format and logic as the D-Bus >> machine ID. >> END QUOTE > > /etc/hostid is written once. It does not change with network or > hardware changes. > >> https://dbus.freedesktop.org/doc/dbus-uuidgen.1.html reports: >> ( used via dbus-uuidgen --ensure=/etc/machine-id as one way >> to get a linux-comaptibile /etc/machine-id for at least >> some types of contexts ) >> >> QUOTE >> The important properties of the machine UUID are that 1) it remains >> unchanged until the next reboot and 2) it is different for any two >> running instances of the OS kernel. That is, if two processes see >> the same UUID, they should also see the same shared memory, UNIX >> domain sockets, local X displays, localhost.localdomain resolution, >> process IDs, and so forth >> END QUOTE >> >> >> Does /etc/hostid generated the normal way in FreeBSD have such >> properties? (How do I look that up?) > > Yes. It's `kenv smbios.system.uuid` if that's available and generated > by uuidgen otherwise. The code is in /etc/rc.d/hostid and > /etc/rc.d/hostid_save. I probably also should have quoted the below for completeness: QUOTE Also, don't make it the same on two different systems; it needs to be different anytime there are two different kernels running. END QUOTE There are implications for some virtual environments. >> Returning to: >> >> https://man7.org/linux/man-pages/man5/machine-id.5.html >> >> QUOTE >> This ID uniquely identifies the host. It should be considered >> "confidential", and must not be exposed in untrusted >> environments, in particular on the network. If a stable unique >> identifier that is tied to the machine is needed for some >> application, the machine ID or any part of it must not be used >> directly. Instead the machine ID should be hashed with a >> cryptographic, keyed hash function, using a fixed, >> application-specific key. That way the ID will be properly >> unique, and derived in a constant way from the machine ID but >> there will be no way to retrieve the original machine ID from the >> application-specific one. >> END QUOTE >> >> Is that at least recommended for handling FreeBSD's /etc/hostid >> content? > > No, the file is not documented at all, but this is a recommendation on > how to use the file not a restriction on the content like the other > quotes so this isn't an impediment to using the same ID in > /etc/machine-id. That presumes that what FreeBSD does with /etc/hostid content keeps the content confidential by default, such as using hashing to avoid there being a way to "retrieve the original machine ID". (It may well, but that is not documented.) Otherwise following the recommendation would be an impossibility for /etc/hostid content. >> Is FreeBSD going to document /etc/machine-id content properties >> in a similar manor? >> >> >> If FreeBSD ends up with a /etc/machine-id that does not have >> the properties and recommended principles of use, it would >> appear that the /etc/machine-id path would be highly misleading >> and, so, inappropriate. Thanks for the notes. === Mark Millard marklmi at yahoo.com
SYSDECODE_ABI_FREEBSD32 for #include : armv7 for aarch64?
https://man.freebsd.org/cgi/man.cgi?query=sysdecode&apropos=0&sektion=3&manpath=FreeBSD+13.2-STABLE&arch=default&format=html reports: SYSDECODE_ABI_FREEBSD32 32-bit FreeBSD binaries. Supported on amd64 and powerpc64. But what of contexts with: # sysctl kern.supported_archs kern.supported_archs: aarch64 armv7 === Mark Millard marklmi at yahoo.com
I just updated to main-n261544-cee09bda03c8 based (via source) and now /etc/machine-id and /var/db/machine-id disagree ; more
# cat /etc/hostid /etc/machine-id /var/db/machine-id a4f7fbeb-f668-11de-b280-ebb65474e619 a4f7fbebf66811deb280ebb65474e619 7227cd89727a462186e3ba680d0ee142 (I'll not be keeping these values for the example system.) # ls -Tld /etc/hostid /etc/machine-id /var/db/machine-id -rw-r--r-- 1 root wheel 37 Dec 31 16:00:18 2009 /etc/hostid -rw-r--r-- 1 root wheel 33 Mar 16 15:16:18 2023 /etc/machine-id -r--r--r-- 1 root wheel 33 Mar 3 23:03:25 2023 /var/db/machine-id I observed the delete-old-files deleting /etc/machine-id during the upgrade. It did nothing with /var/db/machine-id . Also, modern hostid generation was switched to random to avoid an exposure. But the update kept the old hostid and propogated it (not "-"s) into /etc/machine-id . So /etc/machine-id now has the same exposure. Later I'll see if stable/13 also got such behavior for its upgrade. I've not been dealing with releng/13.2 but upgrades from releng/13.1 and before likely have the same questions for what the handling should be vs. what it might actually be. Different ways of upgrading might not be in agreement, for all I know. === Mark Millard marklmi at yahoo.com
Re: I just updated to main-n261544-cee09bda03c8 based (via source) and now /etc/machine-id and /var/db/machine-id disagree ; more
On Mar 16, 2023, at 15:55, Mark Millard wrote: > # cat /etc/hostid /etc/machine-id /var/db/machine-id > a4f7fbeb-f668-11de-b280-ebb65474e619 > a4f7fbebf66811deb280ebb65474e619 > 7227cd89727a462186e3ba680d0ee142 > > (I'll not be keeping these values for the example system.) > > # ls -Tld /etc/hostid /etc/machine-id /var/db/machine-id > -rw-r--r-- 1 root wheel 37 Dec 31 16:00:18 2009 /etc/hostid > -rw-r--r-- 1 root wheel 33 Mar 16 15:16:18 2023 /etc/machine-id > -r--r--r-- 1 root wheel 33 Mar 3 23:03:25 2023 /var/db/machine-id > > I observed the delete-old-files deleting > /etc/machine-id during the upgrade. It did > nothing with /var/db/machine-id . > > Also, modern hostid generation was switched to > random to avoid an exposure. But the update kept > the old hostid and propogated it (not "-"s) into > /etc/machine-id . So /etc/machine-id now has the > same exposure. > > Later I'll see if stable/13 also got such behavior > for its upgrade. > > I've not been dealing with releng/13.2 but upgrades > from releng/13.1 and before likely have the same > questions for what the handling should be vs. what it > might actually be. Different ways of upgrading might > not be in agreement, for all I know. > stable/13 was updated to be stable/13-n254805-4e4e299b0950 based. It got the same type of results. (I'll not list the actual id's for this context.) # ls -Tld /etc/hostid /etc/machine-id /var/db/machine-id -rw-r--r-- 1 root wheel 37 Jul 5 20:08:03 2022 /etc/hostid -rw-r--r-- 1 root wheel 33 Mar 16 13:32:49 2023 /etc/machine-id -r--r--r-- 1 root wheel 33 Mar 3 23:07:55 2023 /var/db/machine-id (I'm not sure of the intent on the permissions.) === Mark Millard marklmi at yahoo.com
Re: I just updated to main-n261544-cee09bda03c8 based (via source) and now /etc/machine-id and /var/db/machine-id disagree ; more
On Mar 16, 2023, at 16:48, Colin Percival wrote: > I think the current situation should be sorted out aside from potential issues > for people who upgraded to a "broken" version before updating to the latest > code -- CCing bapt and tijl just in case since they're more familiar with this > than I am. A question may be if past dbus port related activity might have established a /var/db/machine-id independent of the recent FreeBSD activity. That might not be able to be classified as a "broken version": Before upgrade: /etc/hostid (old style) /var/db/machine-id (via port) After binary or source upgrade to releng/13.2 . . . ? For other source(!) upgrades: Similarly but to a stable/13 (jumping over the middle)? Similarly but to a main [so: 14] (jumping over the middle)? To some extent the "broken" context is somewhat analogous other possible prior history sequences with /var/db/machine-id and /etc/hostid ( but not /etc/machine-id ). > Colin Percival > > On 3/16/23 15:55, Mark Millard wrote: >> # cat /etc/hostid /etc/machine-id /var/db/machine-id >> a4f7fbeb-f668-11de-b280-ebb65474e619 >> a4f7fbebf66811deb280ebb65474e619 >> 7227cd89727a462186e3ba680d0ee142 >> (I'll not be keeping these values for the example system.) >> # ls -Tld /etc/hostid /etc/machine-id /var/db/machine-id >> -rw-r--r-- 1 root wheel 37 Dec 31 16:00:18 2009 /etc/hostid >> -rw-r--r-- 1 root wheel 33 Mar 16 15:16:18 2023 /etc/machine-id >> -r--r--r-- 1 root wheel 33 Mar 3 23:03:25 2023 /var/db/machine-id >> I observed the delete-old-files deleting >> /etc/machine-id during the upgrade. It did >> nothing with /var/db/machine-id . >> Also, modern hostid generation was switched to >> random to avoid an exposure. But the update kept >> the old hostid and propogated it (not "-"s) into >> /etc/machine-id . So /etc/machine-id now has the >> same exposure. >> Later I'll see if stable/13 also got such behavior >> for its upgrade. >> I've not been dealing with releng/13.2 but upgrades >> from releng/13.1 and before likely have the same >> questions for what the handling should be vs. what it >> might actually be. Different ways of upgrading might >> not be in agreement, for all I know. === Mark Millard marklmi at yahoo.com
Re: I just updated to main-n261544-cee09bda03c8 based (via source) and now /etc/machine-id and /var/db/machine-id disagree ; more
On Mar 16, 2023, at 17:27, Mark Millard wrote: > On Mar 16, 2023, at 16:48, Colin Percival wrote: > >> I think the current situation should be sorted out aside from potential >> issues >> for people who upgraded to a "broken" version before updating to the latest >> code -- CCing bapt and tijl just in case since they're more familiar with >> this >> than I am. > > A question may be if past dbus port related activity might > have established a /var/db/machine-id independent of the > recent FreeBSD activity. That might not be able to be > classified as a "broken version": > > Before upgrade: > /etc/hostid (old style) > /var/db/machine-id (via port) Looks like var/db/machine-id is not a dbus default place: # find /var -name machine-id -print | more # dbus-uuidgen --ensure # find /var -name machine-id -print | more /var/lib/dbus/machine-id So the path in my analogy may not be the right one for overall question. > After binary or source upgrade to releng/13.2 . . . ? > > For other source(!) upgrades: > Similarly but to a stable/13 (jumping over the middle)? > Similarly but to a main [so: 14] (jumping over the middle)? > > To some extent the "broken" context is > somewhat analogous other possible prior > history sequences with /var/db/machine-id > and /etc/hostid ( but not /etc/machine-id ). > >> Colin Percival >> >> On 3/16/23 15:55, Mark Millard wrote: >>> # cat /etc/hostid /etc/machine-id /var/db/machine-id >>> a4f7fbeb-f668-11de-b280-ebb65474e619 >>> a4f7fbebf66811deb280ebb65474e619 >>> 7227cd89727a462186e3ba680d0ee142 >>> (I'll not be keeping these values for the example system.) >>> # ls -Tld /etc/hostid /etc/machine-id /var/db/machine-id >>> -rw-r--r-- 1 root wheel 37 Dec 31 16:00:18 2009 /etc/hostid >>> -rw-r--r-- 1 root wheel 33 Mar 16 15:16:18 2023 /etc/machine-id >>> -r--r--r-- 1 root wheel 33 Mar 3 23:03:25 2023 /var/db/machine-id >>> I observed the delete-old-files deleting >>> /etc/machine-id during the upgrade. The above is wrong: it was etcupdate activity, not delete-old-files activity, that did the delete ("D") and did nothing with /var/???/machine-id . >>> It did >>> nothing with /var/db/machine-id . >>> Also, modern hostid generation was switched to >>> random to avoid an exposure. But the update kept >>> the old hostid and propogated it (not "-"s) into >>> /etc/machine-id . So /etc/machine-id now has the >>> same exposure. >>> Later I'll see if stable/13 also got such behavior >>> for its upgrade. >>> I've not been dealing with releng/13.2 but upgrades >>> from releng/13.1 and before likely have the same >>> questions for what the handling should be vs. what it >>> might actually be. Different ways of upgrading might >>> not be in agreement, for all I know. > It might just be that there should be notes someplace about checking and possibly fixing the various machine-id related file relationships, especially if "dbus-uuidgen --ensure" (default path) was part of the prior context. === Mark Millard marklmi at yahoo.com
Re: I just updated to main-n261544-cee09bda03c8 based (via source) and now /etc/machine-id and /var/db/machine-id disagree ; more
On Mar 17, 2023, at 10:15, Tijl Coosemans wrote: > On Thu, 16 Mar 2023 16:48:40 -0700 Colin Percival > wrote: >> I think the current situation should be sorted out aside from potential >> issues >> for people who upgraded to a "broken" version before updating to the latest >> code -- CCing bapt and tijl just in case since they're more familiar with >> this >> than I am. >> >> Colin Percival >> >> On 3/16/23 15:55, Mark Millard wrote: >>> # cat /etc/hostid /etc/machine-id /var/db/machine-id >>> a4f7fbeb-f668-11de-b280-ebb65474e619 >>> a4f7fbebf66811deb280ebb65474e619 >>> 7227cd89727a462186e3ba680d0ee142 >>> >>> (I'll not be keeping these values for the example system.) >>> >>> # ls -Tld /etc/hostid /etc/machine-id /var/db/machine-id >>> -rw-r--r-- 1 root wheel 37 Dec 31 16:00:18 2009 /etc/hostid >>> -rw-r--r-- 1 root wheel 33 Mar 16 15:16:18 2023 /etc/machine-id >>> -r--r--r-- 1 root wheel 33 Mar 3 23:03:25 2023 /var/db/machine-id >>> >>> I observed the delete-old-files deleting >>> /etc/machine-id during the upgrade. It did >>> nothing with /var/db/machine-id . > > delete-old deletes /etc/rc.d/machine-id, etcupdate deletes > /etc/machine-id. I suppose delete-old could also delete > /var/db/machine-id but the file is harmless so I don't think this is > important for 13.2. Good to know. I'll remove the /var/db/machine-id that hte machines happen to have around. >>> Also, modern hostid generation was switched to >>> random to avoid an exposure. But the update kept >>> the old hostid and propogated it (not "-"s) into >>> /etc/machine-id . So /etc/machine-id now has the >>> same exposure. > > These files are meant to remain constant across reboots, so the update > process cannot change an existing /etc/hostid. For example, it is used > by NFS servers to restore state when a client crashes and reboots. Good to know. Absent man page(s) describing the princples for handling the hostid and machine-id file(s) (and why), what to report vs. not was unclear. So, for example, historical hostid value takes default precedence over potential adjustment to be random-based instead. That was not obvious to me prior to the explanation. I'm not aware of any place to find that in the man pages or other documentation. > If nothing relies on the old ID you can generate a new one by running > "uuidgen -r > /etc/hostid" and rebooting the machine. Yea, in my context, it appears that I can freely update the files. >>> Later I'll see if stable/13 also got such behavior >>> for its upgrade. >>> >>> I've not been dealing with releng/13.2 but upgrades >>> from releng/13.1 and before likely have the same >>> questions for what the handling should be vs. what it >>> might actually be. Different ways of upgrading might >>> not be in agreement, for all I know. Thanks for the notes. === Mark Millard marklmi at yahoo.com
Re: I just updated to main-n261544-cee09bda03c8 based (via source) and now /etc/machine-id and /var/db/machine-id disagree ; more
The 13.1-RELEASE (snapshot) to 13.2-RC3 freebsd-update's upgrade sequence did not go well relative to my being prompted to do the right thing to establish /etc/machine-id . After the last reboot (kernel upgrade, presumably) it had me continue with. . . # /usr/sbin/freebsd-update install src component not installed, skipped ZFS filesystem version: 5 ZFS storage pool version: features support (5000) Installing updates... install: ///var/db/etcupdate/current/etc/rc.d/growfs_fstab: No such file or directory install: ///var/db/etcupdate/current/etc/rc.d/var_run: No such file or directory install: ///var/db/etcupdate/current/etc/rc.d/zpoolreguid: No such file or directory Scanning //usr/share/certs/blacklisted for certificates... Scanning //usr/share/certs/trusted for certificates... rmdir: ///usr/tests/usr.bin/timeout: Directory not empty done. root@generic:~ # cat /etc/hostid /etc/mach* cat: No match. It did not indicate the need for another reboot to end up with a /etc/machine-id file. I tried "shutdown -r now" anyway. It did establish an /etc/machine-id file during the reboot: # ls -Tld /etc/hostid /etc/machine-id -rw-r--r-- 1 root wheel 37 May 12 08:46:21 2022 /etc/hostid -rw-r--r-- 1 root wheel 33 May 13 09:46:56 2022 /etc/machine-id So the basic implementation is operational but just lacks an indication of the need to reboot again. The date/time is because it is a RPi4B context (no time of its own) and time is not automatically being established via ntp, apparently. (I did not make such adjustments to the snapshot before starting the upgrade.) I do not know if any of the "install: ///var/db/etcupdate/ . . . " lines or the rmdir line are important. It earlier indicated 5708 patches were fetched and that 377 files were as well. === Mark Millard marklmi at yahoo.com
Re: I just updated to main-n261544-cee09bda03c8 based (via source) and now /etc/machine-id and /var/db/machine-id disagree ; more
On Mar 17, 2023, at 18:24, Mark Millard wrote: > The 13.1-RELEASE (snapshot) to 13.2-RC3 freebsd-update's > upgrade sequence did not go well relative to my being > prompted to do the right thing to establish /etc/machine-id . > After the last reboot (kernel upgrade, presumably) it had me > continue with. . . > > # /usr/sbin/freebsd-update install > src component not installed, skipped > ZFS filesystem version: 5 > ZFS storage pool version: features support (5000) > Installing updates... > install: ///var/db/etcupdate/current/etc/rc.d/growfs_fstab: No such file or > directory > install: ///var/db/etcupdate/current/etc/rc.d/var_run: No such file or > directory > install: ///var/db/etcupdate/current/etc/rc.d/zpoolreguid: No such file or > directory > Scanning //usr/share/certs/blacklisted for certificates... > Scanning //usr/share/certs/trusted for certificates... > rmdir: ///usr/tests/usr.bin/timeout: Directory not empty > done. > root@generic:~ # cat /etc/hostid /etc/mach* > cat: No match. > > It did not indicate the need for another reboot to > end up with a /etc/machine-id file. > > I tried "shutdown -r now" anyway. It did establish > an /etc/machine-id file during the reboot: > > # ls -Tld /etc/hostid /etc/machine-id > -rw-r--r-- 1 root wheel 37 May 12 08:46:21 2022 /etc/hostid > -rw-r--r-- 1 root wheel 33 May 13 09:46:56 2022 /etc/machine-id > > So the basic implementation is operational but just > lacks an indication of the need to reboot again. > > The date/time is because it is a RPi4B context (no > time of its own) and time is not automatically being > established via ntp, apparently. (I did not make such > adjustments to the snapshot before starting the > upgrade.) > > I do not know if any of the "install: ///var/db/etcupdate/ . . . " > lines or the rmdir line are important. > > It earlier indicated 5708 patches were fetched and that 377 > files were as well. Using the likes of: http://ftp3.freebsd.org/pub/FreeBSD/releases/ISO-IMAGES/13.2/FreeBSD-13.2-RC3-arm64-aarch64-RPI.img.xz directly seems to produce installations with a constant: kenv -q smbios.system.uuid 30303031-3030-3030-3265-373238346338 that ends up being what is used for /etc/hostid . It looks like this traces back to the U-Boot involvement in the boot sequence: # kenv | grep smbios hint.smbios.0.mem="0x39c2b000" smbios.bios.reldate="10/01/2022" smbios.bios.revision="22.10" smbios.bios.vendor="U-Boot" smbios.bios.version="2022.10" smbios.chassis.maker="Unknown" smbios.chassis.type="Desktop" smbios.planar.maker="Unknown" smbios.planar.product="Unknown Product" smbios.socket.enabled="1" smbios.system.maker="Unknown" smbios.system.product="Unknown Product" smbios.system.serial="REDACTED" smbios.system.uuid="30303031-3030-3030-3265-373238346338" smbios.version="3.0" === Mark Millard marklmi at yahoo.com
Re: I just updated to main-n261544-cee09bda03c8 based (via source) and now /etc/machine-id and /var/db/machine-id disagree ; more
On Mar 17, 2023, at 19:04, Mark Millard wrote: > On Mar 17, 2023, at 18:24, Mark Millard wrote: > >> The 13.1-RELEASE (snapshot) to 13.2-RC3 freebsd-update's >> upgrade sequence did not go well relative to my being >> prompted to do the right thing to establish /etc/machine-id . >> After the last reboot (kernel upgrade, presumably) it had me >> continue with. . . >> >> # /usr/sbin/freebsd-update install >> src component not installed, skipped >> ZFS filesystem version: 5 >> ZFS storage pool version: features support (5000) >> Installing updates... >> install: ///var/db/etcupdate/current/etc/rc.d/growfs_fstab: No such file or >> directory >> install: ///var/db/etcupdate/current/etc/rc.d/var_run: No such file or >> directory >> install: ///var/db/etcupdate/current/etc/rc.d/zpoolreguid: No such file or >> directory >> Scanning //usr/share/certs/blacklisted for certificates... >> Scanning //usr/share/certs/trusted for certificates... >> rmdir: ///usr/tests/usr.bin/timeout: Directory not empty >> done. >> root@generic:~ # cat /etc/hostid /etc/mach* >> cat: No match. >> >> It did not indicate the need for another reboot to >> end up with a /etc/machine-id file. >> >> I tried "shutdown -r now" anyway. It did establish >> an /etc/machine-id file during the reboot: >> >> # ls -Tld /etc/hostid /etc/machine-id >> -rw-r--r-- 1 root wheel 37 May 12 08:46:21 2022 /etc/hostid >> -rw-r--r-- 1 root wheel 33 May 13 09:46:56 2022 /etc/machine-id >> >> So the basic implementation is operational but just >> lacks an indication of the need to reboot again. >> >> The date/time is because it is a RPi4B context (no >> time of its own) and time is not automatically being >> established via ntp, apparently. (I did not make such >> adjustments to the snapshot before starting the >> upgrade.) >> >> I do not know if any of the "install: ///var/db/etcupdate/ . . . " >> lines or the rmdir line are important. >> >> It earlier indicated 5708 patches were fetched and that 377 >> files were as well. > > Using the likes of: > > http://ftp3.freebsd.org/pub/FreeBSD/releases/ISO-IMAGES/13.2/FreeBSD-13.2-RC3-arm64-aarch64-RPI.img.xz > > directly seems to produce installations with a constant: > > kenv -q smbios.system.uuid > 30303031-3030-3030-3265-373238346338 > > that ends up being what is used for /etc/hostid . > > It looks like this traces back to the U-Boot > involvement in the boot sequence: > > # kenv | grep smbios > hint.smbios.0.mem="0x39c2b000" > smbios.bios.reldate="10/01/2022" > smbios.bios.revision="22.10" > smbios.bios.vendor="U-Boot" > smbios.bios.version="2022.10" > smbios.chassis.maker="Unknown" > smbios.chassis.type="Desktop" > smbios.planar.maker="Unknown" > smbios.planar.product="Unknown Product" > smbios.socket.enabled="1" > smbios.system.maker="Unknown" > smbios.system.product="Unknown Product" > smbios.system.serial="REDACTED" > smbios.system.uuid="30303031-3030-3030-3265-373238346338" > smbios.version="3.0" > Looks like if U-Boot ends up with a system serial number, it uses that as the basis for the system uuid: https://github.com/u-boot/u-boot/blob/master/lib/smbios.c char *serial_str = env_get("serial#"); . . . if (serial_str) { t->serial_number = smbios_add_string(ctx, serial_str); strncpy((char *)t->uuid, serial_str, sizeof(t->uuid)); } else { t->serial_number = smbios_add_prop(ctx, "serial"); } For example (some byte reordering also involved someplace): smbios.system.serial="10002e7284c8" smbios.system.uuid="30303031-3030-3030-3265-373238346338" #0 0 0 1- 0 0- 0 0- 2 e- 7 2 8 4 c 8 This explains my seeing the same uuid from 13.1-RELEASE installation as I later saw from an independent 13.2-RC3 installation (not upgrade): I reused the same RPi4B. All media produced on the same RPi4B will get the same hostid and machine-id files by default, given how U-Boot works and that smbios.system.uuid "wins" when present. This may all be fine. But it still leaves me expecting that there should be man page(s) covering these hostid and machine-id files and how they should be handled to match the usages to which they are put, such as the nfs use that was referenced. A note/reminder to look up that material could also be relevant. === Mark Millard marklmi at yahoo.com
releng/13.1 amd64 atomic_fcmpset_long parameter order and dst,expect,src (source) vs. src,dst,expect (crash dump report)
Anyone know what to make of the below mismatch between the source and what crash log is reporting about the atomic_fcmpset_long parameter order? A releng/13.1 sys/amd64/include/atomic.h has the likes of: int atomic_fcmpset_long(volatile u_long *dst, u_long *expect, u_long src); Note the order: dst, expect, src. Later it has the implementation: /* * Atomic compare and set, used by the mutex functions. * * cmpset: * if (*dst == expect) * *dst = src * * fcmpset: * if (*dst == *expect) * *dst = src * else * *expect = *dst * * Returns 0 on failure, non-zero on success. */ #define ATOMIC_CMPSET(TYPE) \ static __inline int \ atomic_cmpset_##TYPE(volatile u_##TYPE *dst, u_##TYPE expect, u_##TYPE src) \ { \ u_char res; \ \ __asm __volatile( \ " lock; cmpxchg %3,%1 ; " \ "# atomic_cmpset_" #TYPE " " \ : "=@cce" (res),/* 0 */ \ "+m" (*dst), /* 1 */ \ "+a" (expect) /* 2 */ \ : "r" (src) /* 3 */ \ : "memory", "cc"); \ return (res); \ } \ \ static __inline int \ atomic_fcmpset_##TYPE(volatile u_##TYPE *dst, u_##TYPE *expect, u_##TYPE src) \ { \ u_char res; \ \ __asm __volatile( \ " lock; cmpxchg %3,%1 ; " \ "# atomic_fcmpset_" #TYPE " " \ : "=@cce" (res),/* 0 */ \ "+m" (*dst), /* 1 */ \ "+a" (*expect)/* 2 */ \ : "r" (src) /* 3 */ \ : "memory", "cc"); \ return (res); \ } ATOMIC_CMPSET(char); ATOMIC_CMPSET(short); ATOMIC_CMPSET(int); ATOMIC_CMPSET(long); which still shows dst,expect,src for the order. But a releng/13.1 crash dump log shows the name order: src, dst, expect (in #7 below): #4 0x80c1ba63 in panic (fmt=) at /usr/src/sys/kern/kern_shutdown.c:844 #5 0x810addf5 in trap_fatal (frame=0xfe00b555dae0, eva=0) at /usr/src/sys/amd64/amd64/trap.c:944 #6 #7 0x80c895cb in atomic_fcmpset_long (src=18446741877726026240, dst=, expect=) at /usr/src/sys/amd64/include/atomic.h:225 The atomic_fcmpset_long (from a mtx_lock(?) use) got a: Fatal trap 9: general protection fault while in kernel mode crash. The code was inside nfsd. ( Note: 18446741877726026240 == 0xfe00b52e9a00 ) The crash is not mine. It is a new type of example from an ongoing crash-evidence gathering session. See: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=267028#c147 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=267028#c148 === Mark Millard marklmi at yahoo.com
stable/13 missing 2 Windows Dev Kit 2023 related updates that main [so: 14] has
I had to substitute a FreeBSD EFI loader from main [so: 14] to boot the Windows Dev Kit 2023 via USB3 and a dd'd: http://ftp3.freebsd.org/pub/FreeBSD/snapshots/ISO-IMAGES/13.2/FreeBSD-13.2-STABLE-arm64-aarch64-ROCK64-20230504-7dea7445ba44-255298.img.xz When I looked as a resut, the following that mention the Windows Dev Kit 2023 were not in stable/13 : Commit message (Expand) Author Age Files Lines * arm64: Disable PAC when booting on a Windows Dev Kit 2023 Mark Johnston 2023-04-23 1 -1/+30 * Add the fixed memory type to the pci ecam driverAndrew Turner 2023-01-18 1 -3/+20 By contrast, I did find: Commit message (Expand) Author Age Files Lines * loader.efi: make sure kernel image is executableRobert Clausecker 2023-01-23 1 -4/+4 * Add Windows Dev Kit 2023 support to if_ure Andrew Turner 2023-01-23 2 -0/+2 * Check for more XHCI ACPI IDsAndrew Turner 2023-01-23 1 -4/+7 in stable/13 . (I have not checked the correspondance of what I found missing vs. the status of the loader. There could be more involved than what I've found --and some or all of what I found missing may not be involved in the EFI loader issue.) === Mark Millard marklmi at yahoo.com
Re: Possible regression in main causing poor performance
On Aug 18, 2023, at 19:09, Mark Millard wrote: > Glen Barber wrote on > Date: Sat, 19 Aug 2023 00:10:59 UTC : > >> I am somewhat inclined to look in the direction of ZFS here, as two >> things changed: >> >> 1) the build machine in question was recently (as in a week and a half >> ago) upgraded to the tip of main in order to ease the transition from >> this machine from building 14.x to building 15.x; >> 2) there is the recent addition of building ZFS-backed virtual machine >> and cloud images. >> >> . . . >> The first machine runs: >> # uname -a >> FreeBSD releng1.nyi.freebsd.org 14.0-CURRENT FreeBSD 14.0-CURRENT \ >> amd64 1400093 #5 main-n264224-c84617e87a70: Wed Jul 19 19:10:38 UTC 2023 > > I'm confused: > > "the build machine in question was recently (as in a week and a half > ago) upgraded to the tip of main in order to ease the transition from > this machine from building 14.x to building 15.x"? But the above > kernel is from mid July? (-aKU was not used to also get some clue > about world from the pair of 140009? that would show.) > >> Last week's snapshot builds were completed in a reasonable amount of >> time: >> >> r...@releng1.nyi:/releng/scripts-snapshot/scripts # ./thermite.sh -c >> ./builds-14.conf ; echo ^G >> 20230811-00:03:11 INFO: Creating /releng/scripts-snapshot/logs >> 20230811-00:03:11 INFO: Creating /releng/scripts-snapshot/chroots >> 20230811-00:03:12 INFO: Creating /releng/scripts-snapshot/release >> 20230811-00:03:12 INFO: Creating /releng/scripts-snapshot/ports >> 20230811-00:03:12 INFO: Creating /releng/scripts-snapshot/doc >> 20230811-00:03:13 INFO: Checking out https://git.FreeBSD.org//src.git (main) >> to /releng/scripts-snapshot/release >> [...] >> 20230811-15:11:13 INFO: Staging for ftp: 14-i386-GENERIC-snap >> 20230811-16:27:28 INFO: Staging for ftp: 14-amd64-GENERIC-snap >> 20230811-16:33:43 INFO: Staging for ftp: 14-aarch64-GENERIC-snap >> >> Overall, 17 hours, including the time to upload EC2, Vagrant, and GCE. >> >> With no changes to the system, no stale ZFS datasets laying around from >> last week (everything is a pristine environment, etc.), this week's >> builds are taking forever: > > My confusion may extend to this "no changes" status vs. the uname > output identifying the kernel is from mid July. > >> r...@releng1.nyi:/releng/scripts-snapshot/scripts # ./thermite.sh -c >> ./builds-14.conf ; echo ^G >> 20230818-00:15:44 INFO: Creating /releng/scripts-snapshot/logs >> 20230818-00:15:44 INFO: Creating /releng/scripts-snapshot/chroots >> 20230818-00:15:45 INFO: Creating /releng/scripts-snapshot/release >> 20230818-00:15:45 INFO: Creating /releng/scripts-snapshot/ports >> 20230818-00:15:45 INFO: Creating /releng/scripts-snapshot/doc >> 20230818-00:15:46 INFO: Checking out https://git.FreeBSD.org//src.git (main) >> to /releng/scripts-snapshot/release >> [...] >> 20230818-18:46:22 INFO: Staging for ftp: 14-aarch64-ROCKPRO64-snap >> 20230818-20:41:02 INFO: Staging for ftp: 14-riscv64-GENERIC-snap >> 20230818-22:54:49 INFO: Staging for ftp: 14-amd64-GENERIC-snap >> >> Note, it is just about 4 minutes past 00:00 UTC as of this writing, so >> we are about to cross well over the 24-hour mark, and cloud provider >> images have not yet even started. >> >> . . . > > In: > > https://lists.freebsd.org/archives/freebsd-current/2023-August/004314.html > ("HEADS UP: $FreeBSD$ Removed from main", Wed, 16 Aug 2023) > > Warner wrote: > > QUOTE > . . . , but there's no incremental building > with this change, . . . Also: expect long build times, git fetch times, etc > after this. > END QUOTE > > Might this be contributing? How long did those two > "Checking out . . ." take? Similar time frames? > The build process and information is not available. So I looked at something I thought might have a chance of being somewhat invariant and have a limited range of types of (parallel) activity: time differences for the CHECKSUM files taht have timestamps after the last *.img* timestamp, as seen via: http://ftp3.freebsd.org/pub/FreeBSD/snapshots/ISO-IMAGES/14.0/?C=M&O=D (so: most recent to oldest as displayed) First today's: CHECKSUM.SHA256-FreeBSD-14.0-ALPHA2-arm64-aarch64-20230819-77013f29d048-264841 1232 2023-Aug-19 00:26 CHECKSUM.SHA512-FreeBSD-14.0-ALPHA2-arm64-aarch64-20230819-77013f29d048-264841 1744 2023-Aug-19 00:25 CHECKSUM.SHA256-FreeBSD-14.0-ALPHA2-amd64-20230818-77013f29d048-264841 1168 2023-Aug-18 22:59 CHECKSUM.SHA512-FreeBSD-14.0-ALPHA2-amd64-20230818-7
Re: Possible regression in main causing poor performance
Has any more been learned about this? Is it still an issue? === Mark Millard marklmi at yahoo.com
Re: Possible regression in main causing poor performance
On Sep 5, 2023, at 08:58, Cy Schubert wrote: > In message <20230830204406.24fd...@slippy.cwsent.com>, Cy Schubert writes: >> In message <20230830184426.gm1...@freebsd.org>, Glen Barber writes: >>> >>> >>> On Mon, Aug 28, 2023 at 06:06:09PM -0700, Mark Millard wrote: >>>> Has any more been learned about this? Is it still an issue? >>>> =20 >>> >>> I rebooted the machine before the ALPHA3 builds with no other changes, >>> and the overall times for 14.x builds went back to normal. I do not >>> like to experiment with builders during a release cycle, but as we are >>> going to have 15.x snapshots available moving forward, I will not reboot >>> that machine next week in hopes to get some useful data. >>> >>> If my memory serves correctly, mm@ has a pending ZFS import from >>> upstream for both main and stable/14 pending. Whether or not that will >>> resolve any issue here, I do not know. >> >> Two of my poudriere builder machines have experienced different panics >> since the ZFS import two days ago. The problems have been documented on the >> -current list. > > Just an update. > > The three pull requests amotin@ pointed to did resolve all my problems. A > subsequent update which included the latest ZFS commits worked just as > well, without any new regressions. AFAIAC this problem has been resolved. > > The random email corruptions have also been resolved. > > > -- > Cheers, > Cy Schubert > FreeBSD UNIX: Web: https://FreeBSD.org > NTP: Web: https://nwtime.org > > e^(i*pi)+1=0 > > > > > 9O8 The just-above quoted line looks like a corruption to me. Otherwise, I'm just reporting more evidence from separate testing on amd64 . . . I will say that my separate-install/boot environment 10hr, 6366 port->package poudriere bulk -a prefix test of: # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #118 main-n265152-f49d6f583e9d-dirty: Mon Sep 4 14:26:56 PDT 2023 root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-NODBG amd64 amd64 150 150 did not show any deadlocks. The only oddity that I've noticed is the 1 extra message shown in: . . . [00:03:25] [32] [00:00:00] Builder starting [00:03:43] [01] [00:00:18] Finished print/indexinfo | indexinfo-0.3.1: Success [00:03:43] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22_1 [00:05:20] [01] [00:01:37] Finished devel/gettext-runtime | gettext-runtime-0.22_1: Success 23/.p/cleaning/rdeps/gettext-runtime-0.22_1/chemtool-1.6.14_4 copy: open failed: No such file or directory [00:05:23] [01] [00:00:00] Building devel/gmake | gmake-4.3_2 [00:05:55] [02] [00:02:30] Builder started . . . I'm comfortable moving my normal environments forward to include this latest import of openzfs. The effort established a separate environment set up for doing testing of jumping to/past an openzfs import(s) in main. Too many recent imports have dangerous-to-the-file-system and/or had deadlocking issues for me to simply update to include them without first testing on separate media that does not have to stay operational. === Mark Millard marklmi at yahoo.com
main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
805d6107380, outoffp=0x811e6eb7, outoffp@entry=0xf819860a2c78, lenp=0x0, lenp@entry=0xfe0352758d50, flags=flags@entry=0, incred=0xf80e32335200, outcred=0xf80e32335200, fsize_td=0xfe03586c0720) at /usr/main-src/sys/kern/vfs_vnops.c:3085 #21 0x80c6b998 in kern_copy_file_range ( td=td@entry=0xfe03586c0720, infd=, inoffp=0xf81910c3c7c8, inoffp@entry=0x0, outfd=, outoffp=0xf819860a2c78, outoffp@entry=0x0, len=9223372036854775807, flags=0) at /usr/main-src/sys/kern/vfs_syscalls.c:4971 #22 0x80c6bab8 in sys_copy_file_range (td=0xfe03586c0720, uap=0xfe03586c0b20) at /usr/main-src/sys/kern/vfs_syscalls.c:5009 #23 0x8104bab9 in syscallenter (td=0xfe03586c0720) at /usr/main-src/sys/amd64/amd64/../../kern/subr_syscall.c:187 #24 amd64_syscall (td=0xfe03586c0720, traced=0) at /usr/main-src/sys/amd64/amd64/trap.c:1197 #25 #26 0x1ce4506d155a in ?? () Backtrace stopped: Cannot access memory at address 0x1ce44ec71e88 (kgdb) Context details follow. Absent a openzfs-2.2 in: ls -C1 /usr/share/zfs/compatibility.d/openzfs-2.* /usr/share/zfs/compatibility.d/openzfs-2.0-freebsd /usr/share/zfs/compatibility.d/openzfs-2.0-linux /usr/share/zfs/compatibility.d/openzfs-2.1-freebsd /usr/share/zfs/compatibility.d/openzfs-2.1-linux I have copied: /usr/main-src/sys/contrib/openzfs/cmd/zpool/compatibility.d/openzfs-2.2 over to: # ls -C1 /etc/zfs/compatibility.d/* /etc/zfs/compatibility.d/openzfs-2.2 and used it: # zpool get compatibility zamd64 NAMEPROPERTY VALUE SOURCE zamd64 compatibility openzfs-2.2local For reference: # zpool upgrade This system supports ZFS pool feature flags. All pools are formatted using feature flags. Some supported features are not enabled on the following pools. Once a feature is enabled the pool may become incompatible with software that does not support the feature. See zpool-features(7) for details. Note that the pool 'compatibility' feature can be used to inhibit feature upgrades. POOL FEATURE --- zamd64 redaction_list_spill which agrees with openzfs-2.2 . I did: # sysctl vfs.zfs.bclone_enabled=1 vfs.zfs.bclone_enabled: 0 -> 1 I also made a snapshot: zamd64@before-bclone-test and I then made a checkpoint. These were establshed just after the above enable. I then did a: zpool trim -w zamd64 The poudriere bulk command was: poudriere bulk -jmain-amd64-bulk_a -a where main-amd64-bulk_a has nothing prebuilt. USE_TMPFS=no is in use. No form of ALLOW_MAKE_JOBS is in use. It is a 32 builder context (32 hardware threads). For reference: # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #118 main-n265152-f49d6f583e9d-dirty: Mon Sep 4 14:26:56 PDT 2023 root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-NODBG amd64 amd64 150 150 I'll note that with openzfs-2.1-freebsd compatibility I'd previously let such a bulk -a run for about 10 hr and it had reached 6366 port->package builds. Prior to that I'd done shorter experiments with default zpool features (no explicit compatibility constraint) but vfs.zfs.bclone_enabled=0 and I'd had no problems. (I have a separate M.2 boot media just for such experiments and can reconstruct its content at will.) All these have been based on the same personal main-n265152-f49d6f583e9d-dirty system build. Unfortunately, no appropriate snapshot of main was available to avoid my personal context being involved for the system build used. Similarly, the snapshot(s) of stable/14 predate: Sun, 03 Sep 2023 . . . git: f789381671a3 - stable/14 - zfs: merge openzfs/zfs@32949f256 (zfs-2.2-release) into stable/14 that has required fixes for other issues. === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
[Drat, the request to rerun my tests did not not mention the more recent change: vfs: copy_file_range() between multiple mountpoints of the same fs type and I'd not noticed on my own and ran the test without updating.] On Sep 7, 2023, at 11:02, Mark Millard wrote: > I was requested to do a test with vfs.zfs.bclone_enabled=1 and > the bulk -a build paniced (having stored 128 *.pkg files in > .building/ first): Unfortunately, rerunning my tests with this set was testing a context predating: Wed, 06 Sep 2023 . . . • git: 969071be938c - main - vfs: copy_file_range() between multiple mountpoints of the same fs type Martin Matuska So the information might be out of date for main and for stable/14 : I've no clue how good of a test it was. May be some of those I've cc'd would know. When I next have time, should I retry based on a more recent vintage of main that includes 969071be938c ? > # more /var/crash/core.txt.3 > . . . > Unread portion of the kernel message buffer: > panic: Solaris(panic): zfs: accessing past end of object 422/1108c16 > (size=2560 access=2560+2560) > cpuid = 15 > time = 1694103674 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe0352758590 > vpanic() at vpanic+0x132/frame 0xfe03527586c0 > panic() at panic+0x43/frame 0xfe0352758720 > vcmn_err() at vcmn_err+0xeb/frame 0xfe0352758850 > zfs_panic_recover() at zfs_panic_recover+0x59/frame 0xfe03527588b0 > dmu_buf_hold_array_by_dnode() at dmu_buf_hold_array_by_dnode+0x97/frame > 0xfe0352758960 > dmu_brt_clone() at dmu_brt_clone+0x61/frame 0xfe03527589f0 > zfs_clone_range() at zfs_clone_range+0xa6a/frame 0xfe0352758bc0 > zfs_freebsd_copy_file_range() at zfs_freebsd_copy_file_range+0x1ae/frame > 0xfe0352758c40 > vn_copy_file_range() at vn_copy_file_range+0x11e/frame 0xfe0352758ce0 > kern_copy_file_range() at kern_copy_file_range+0x338/frame 0xfe0352758db0 > sys_copy_file_range() at sys_copy_file_range+0x78/frame 0xfe0352758e00 > amd64_syscall() at amd64_syscall+0x109/frame 0xfe0352758f30 > fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe0352758f30 > --- syscall (569, FreeBSD ELF64, copy_file_range), rip = 0x1ce4506d155a, rsp > = 0x1ce44ec71e88, rbp = 0x1ce44ec72320 --- > KDB: enter: panic > > __curthread () at /usr/main-src/sys/amd64/include/pcpu_aux.h:57 > 57 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct > pcpu, > (kgdb) #0 __curthread () at /usr/main-src/sys/amd64/include/pcpu_aux.h:57 > #1 doadump (textdump=textdump@entry=0) > at /usr/main-src/sys/kern/kern_shutdown.c:405 > #2 0x804a442a in db_dump (dummy=, > dummy2=, dummy3=, dummy4=) > at /usr/main-src/sys/ddb/db_command.c:591 > #3 0x804a422d in db_command (last_cmdp=, > cmd_table=, dopager=true) > at /usr/main-src/sys/ddb/db_command.c:504 > #4 0x804a3eed in db_command_loop () > at /usr/main-src/sys/ddb/db_command.c:551 > #5 0x804a7876 in db_trap (type=, code=) > at /usr/main-src/sys/ddb/db_main.c:268 > #6 0x80bb9e57 in kdb_trap (type=type@entry=3, code=code@entry=0, > tf=tf@entry=0xfe03527584d0) at /usr/main-src/sys/kern/subr_kdb.c:790 > #7 0x8104ad3d in trap (frame=0xfe03527584d0) > at /usr/main-src/sys/amd64/amd64/trap.c:608 > #8 > #9 kdb_enter (why=, msg=) > at /usr/main-src/sys/kern/subr_kdb.c:556 > #10 0x80b6aab3 in vpanic (fmt=0x82be52d6 "%s%s", > ap=ap@entry=0xfe0352758700) > at /usr/main-src/sys/kern/kern_shutdown.c:958 > #11 0x80b6a943 in panic ( > fmt=0x820aa2e8 "\312C$\201\377\377\377\377") > at /usr/main-src/sys/kern/kern_shutdown.c:894 > #12 0x82993c5b in vcmn_err (ce=, > fmt=0x82bfdd1f "zfs: accessing past end of object %llx/%llx (size=%u > access=%llu+%llu)", adx=0xfe0352758890) > at /usr/main-src/sys/contrib/openzfs/module/os/freebsd/spl/spl_cmn_err.c:60 > #13 0x82a84d69 in zfs_panic_recover ( > fmt=0x12 ) > at /usr/main-src/sys/contrib/openzfs/module/zfs/spa_misc.c:1594 > #14 0x829f8e27 in dmu_buf_hold_array_by_dnode (dn=0xf813dfc48978, > offset=offset@entry=2560, length=length@entry=2560, read=read@entry=0, >tag=0x82bd8175, numbufsp=numbufsp@entry=0xfe03527589bc, > dbpp=0xfe03527589c0, flags=0) > at /usr/main-src/sys/contrib/openzfs/module/zfs/dmu.c:543 > #15 0x829fc6a1 in dmu_buf_hold_array (os=, > object=, read=0, numbufsp=0xfe03527589bc, > dbpp=0xfe03527589c0, offset=, length=, > tag=) > at /usr/main-src/sys/contrib/openzfs/module/zfs/dmu.c:6
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 7, 2023, at 11:48, Glen Barber wrote: > On Thu, Sep 07, 2023 at 11:17:22AM -0700, Mark Millard wrote: >> When I next have time, should I retry based on a more recent >> vintage of main that includes 969071be938c ? >> > > Yes, please, if you can. As stands, I rebooted that machine into my normal enviroment, so the after-crash-with-dump-info context is preserved. I'll presume lack of a need to preserve that context unless I hear otherwise. (But I'll work on this until later today.) Even my normal environment predates the commit in question by a few commits. So I'll end up doing a more general round of updates overall. Someone can let me know if there is a preference for debug over non-debug for the next test run. Looking at "git: 969071be938c - main", the relevant part seems to be just (white space possibly not preserved accurately): diff --git a/sys/kern/vfs_vnops.c b/sys/kern/vfs_vnops.c index 9fb5aee6a023..4e4161ef1a7f 100644 --- a/sys/kern/vfs_vnops.c +++ b/sys/kern/vfs_vnops.c @@ -3076,12 +3076,14 @@ vn_copy_file_range(struct vnode *invp, off_t *inoffp, struct vnode *outvp, goto out; /* -* If the two vnode are for the same file system, call +* If the two vnodes are for the same file system type, call * VOP_COPY_FILE_RANGE(), otherwise call vn_generic_copy_file_range() -* which can handle copies across multiple file systems. +* which can handle copies across multiple file system types. */ *lenp = len; - if (invp->v_mount == outvp->v_mount) + if (invp->v_mount == outvp->v_mount || + strcmp(invp->v_mount->mnt_vfc->vfc_name, + outvp->v_mount->mnt_vfc->vfc_name) == 0) error = VOP_COPY_FILE_RANGE(invp, inoffp, outvp, outoffp, lenp, flags, incred, outcred, fsize_td); else That looks to call VOP_COPY_FILE_RANGE in more contexts and vn_generic_copy_file_range in fewer. The backtrace I reported involves: VOP_COPY_FILE_RANGE So it appears this change is unlikely to invalidate my test result, although failure might happen sooner if more VOP_COPY_FILE_RANGE calls happen with the newer code. That in turns means that someone may come up with some other change for me to test by the time I get around to setting up another test. Let me know if so. === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 7, 2023, at 13:07, Alexander Motin wrote: > Thanks, Mark. > > On 07.09.2023 15:40, Mark Millard wrote: >> On Sep 7, 2023, at 11:48, Glen Barber wrote: >>> On Thu, Sep 07, 2023 at 11:17:22AM -0700, Mark Millard wrote: >>>> When I next have time, should I retry based on a more recent >>>> vintage of main that includes 969071be938c ? >>> >>> Yes, please, if you can. >> As stands, I rebooted that machine into my normal >> enviroment, so the after-crash-with-dump-info >> context is preserved. I'll presume lack of a need >> to preserve that context unless I hear otherwise. >> (But I'll work on this until later today.) >> Even my normal environment predates the commit in >> question by a few commits. So I'll end up doing a >> more general round of updates overall. >> Someone can let me know if there is a preference >> for debug over non-debug for the next test run. > > It is not unknown when some bugs disappear once debugging is enabled due to > different execution timings, but generally debug may to detect the problem > closer to its origin instead of looking on random consequences. I am only > starting to look on this report (unless Pawel or somebody beat me on it), and > don't have additional requests yet, but if you can repeat the same with debug > kernel (in-base ZFS's ZFS_DEBUG setting follows kernel's INVARIANTS), it may > give us some additional information. So I did a zpool import, rewinding to the checkpoint. (This depends on the questionable zfs doing fully as desired for this. Notably the normal environment has vfs.zfs.bclone_enabled=0 , including when it was doing this activity.) My normal environment reported no problems. Note: the earlier snapshot from my first setup was still in place since it was made just before the original checkpoint used above. However, the rewind did remove the /var/crash/ material that had been added. I did the appropriate zfs mount. I installed a debug kernel and world to the import. Again, no problems reported. I did the appropriate zfs umount. I did the appropriate zpool export. I rebooted with the test media. # sysctl vfs.zfs.bclone_enabled vfs.zfs.bclone_enabled: 1 # zpool trim -w zamd64 # zpool checkpoint zamd64 # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #74 main-n265188-117c54a78ccd-dirty: Tue Sep 5 21:29:53 PDT 2023 root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG amd64 amd64 150 150 (So, before the 969071be938c vintage, same sources as for my last run but a debug build.) # poudriere bulk -jmain-amd64-bulk_a -a . . . [00:03:23] Building 34214 packages using up to 32 builders [00:03:23] Hit CTRL+t at any time to see build progress and stats [00:03:23] [01] [00:00:00] Builder starting [00:04:19] [01] [00:00:56] Builder started [00:04:20] [01] [00:00:01] Building ports-mgmt/pkg | pkg-1.20.6 [00:05:33] [01] [00:01:14] Finished ports-mgmt/pkg | pkg-1.20.6: Success [00:05:53] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1 [00:05:53] [02] [00:00:00] Builder starting . . . [00:05:54] [32] [00:00:00] Builder starting [00:06:11] [01] [00:00:18] Finished print/indexinfo | indexinfo-0.3.1: Success [00:06:12] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22_1 [00:08:24] [01] [00:02:12] Finished devel/gettext-runtime | gettext-runtime-0.22_1: Success [00:08:31] [01] [00:00:00] Building devel/libtextstyle | libtextstyle-0.22 [00:10:06] [05] [00:04:13] Builder started [00:10:06] [05] [00:00:00] Building devel/autoconf-switch | autoconf-switch-20220527 [00:10:06] [31] [00:04:12] Builder started [00:10:06] [31] [00:00:00] Building devel/libatomic_ops | libatomic_ops-7.8.0 . . . Crashed again, with 158 *.pkg files in .building/All/ after rebooting. The crash is similar to the non-debug one. No extra output from the debug build. For reference: Unread portion of the kernel message buffer: panic: Solaris(panic): zfs: accessing past end of object 422/10b1c02 (size=2560 access=2560+2560) cpuid = 15 time = 1694127988 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe02e783b5a0 vpanic() at vpanic+0x132/frame 0xfe02e783b6d0 panic() at panic+0x43/frame 0xfe02e783b730 vcmn_err() at vcmn_err+0xeb/frame 0xfe02e783b860 zfs_panic_recover() at zfs_panic_recover+0x59/frame 0xfe02e783b8c0 dmu_buf_hold_array_by_dnode() at dmu_buf_hold_array_by_dnode+0xb8/frame 0xfe02e783b970 dmu_brt_clone() at dmu_brt_clone+0x61/frame 0xfe02e783b9f0 zfs_clone_range() at zfs_clone_range+0xaa3/frame 0xfe02e783bbc0 zfs_freebsd_copy_file_range() at zfs_freebsd_copy_file_range+0x18a/frame 0xfe02e783bc40 vn_copy_file_range() at vn_copy_file_range+0x114/frame 0xfe02e783bce0 kern_copy_file_ra
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
[Today's main-snapshot kernel panics as well.] On Sep 7, 2023, at 16:32, Mark Millard wrote: > On Sep 7, 2023, at 13:07, Alexander Motin wrote: > >> Thanks, Mark. >> >> On 07.09.2023 15:40, Mark Millard wrote: >>> On Sep 7, 2023, at 11:48, Glen Barber wrote: >>>> On Thu, Sep 07, 2023 at 11:17:22AM -0700, Mark Millard wrote: >>>>> When I next have time, should I retry based on a more recent >>>>> vintage of main that includes 969071be938c ? >>>> >>>> Yes, please, if you can. >>> As stands, I rebooted that machine into my normal >>> enviroment, so the after-crash-with-dump-info >>> context is preserved. I'll presume lack of a need >>> to preserve that context unless I hear otherwise. >>> (But I'll work on this until later today.) >>> Even my normal environment predates the commit in >>> question by a few commits. So I'll end up doing a >>> more general round of updates overall. >>> Someone can let me know if there is a preference >>> for debug over non-debug for the next test run. >> >> It is not unknown when some bugs disappear once debugging is enabled due to >> different execution timings, but generally debug may to detect the problem >> closer to its origin instead of looking on random consequences. I am only >> starting to look on this report (unless Pawel or somebody beat me on it), >> and don't have additional requests yet, but if you can repeat the same with >> debug kernel (in-base ZFS's ZFS_DEBUG setting follows kernel's INVARIANTS), >> it may give us some additional information. > > So I did a zpool import, rewinding to the checkpoint. > (This depends on the questionable zfs doing fully as > desired for this. Notably the normal environment has > vfs.zfs.bclone_enabled=0 , including when it was > doing this activity.) My normal environment reported > no problems. > > Note: the earlier snapshot from my first setup was > still in place since it was made just before the > original checkpoint used above. > > However, the rewind did remove the /var/crash/ > material that had been added. > > I did the appropriate zfs mount. > > I installed a debug kernel and world to the import. Again, > no problems reported. > > I did the appropriate zfs umount. > > I did the appropriate zpool export. > > I rebooted with the test media. > > # sysctl vfs.zfs.bclone_enabled > vfs.zfs.bclone_enabled: 1 > > # zpool trim -w zamd64 > > # zpool checkpoint zamd64 > > # uname -apKU > FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #74 > main-n265188-117c54a78ccd-dirty: Tue Sep 5 21:29:53 PDT 2023 > root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG > amd64 amd64 150 150 > > (So, before the 969071be938c vintage, same sources as for > my last run but a debug build.) > > # poudriere bulk -jmain-amd64-bulk_a -a > . . . > [00:03:23] Building 34214 packages using up to 32 builders > [00:03:23] Hit CTRL+t at any time to see build progress and stats > [00:03:23] [01] [00:00:00] Builder starting > [00:04:19] [01] [00:00:56] Builder started > [00:04:20] [01] [00:00:01] Building ports-mgmt/pkg | pkg-1.20.6 > [00:05:33] [01] [00:01:14] Finished ports-mgmt/pkg | pkg-1.20.6: Success > [00:05:53] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1 > [00:05:53] [02] [00:00:00] Builder starting > . . . > [00:05:54] [32] [00:00:00] Builder starting > [00:06:11] [01] [00:00:18] Finished print/indexinfo | indexinfo-0.3.1: Success > [00:06:12] [01] [00:00:00] Building devel/gettext-runtime | > gettext-runtime-0.22_1 > [00:08:24] [01] [00:02:12] Finished devel/gettext-runtime | > gettext-runtime-0.22_1: Success > [00:08:31] [01] [00:00:00] Building devel/libtextstyle | libtextstyle-0.22 > [00:10:06] [05] [00:04:13] Builder started > [00:10:06] [05] [00:00:00] Building devel/autoconf-switch | > autoconf-switch-20220527 > [00:10:06] [31] [00:04:12] Builder started > [00:10:06] [31] [00:00:00] Building devel/libatomic_ops | libatomic_ops-7.8.0 > . . . > > Crashed again, with 158 *.pkg files in .building/All/ after > rebooting. > > The crash is similar to the non-debug one. No extra output > from the debug build. > > For reference: > > Unread portion of the kernel message buffer: > panic: Solaris(panic): zfs: accessing past end of object 422/10b1c02 > (size=2560 access=2560+2560) > . . . Same world with newer snapshot main kernel that should be compatible with the world: # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURREN
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 8, 2023, at 06:52, Martin Matuska wrote: > I digged a little and was able to reproduce the panic without poudriere with > a shell script. > > You may want to increase "repeats". > The script causes the panic in dmu_buf_hold_array_by_dnode() on my VirtualBox > with the cat command on 9th iteration. > > Here is the script: > > #!/bin/sh > nl=' > ' > sed_script=s/aaa/b/ > for ac_i in 1 2 3 4 5 6 7; do > sed_script="$sed_script$nl$sed_script" > done > echo "$sed_script" 2>/dev/null | sed 99q >conftest.sed > > repeats=8 > count=0 > echo -n 0123456789 >"conftest.in" > while : > do > cat "conftest.in" "conftest.in" >"conftest.tmp" > mv "conftest.tmp" "conftest.in" > cp "conftest.in" "conftest.nl" > echo '' >> "conftest.nl" > sed -f conftest.sed < "conftest.nl" >"conftest.out" 2>/dev/null || break > diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break > count=$(($count + 1)) > echo "count: $count" > # 10*(2^10) chars as input seems more than enough > test $count -gt $repeats && break > done > rm -f conftest.in conftest.tmp conftest.nl conftest.out . . . (history removed) . . . # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #0 main-n265205-03a7c36ddbc0: Thu Sep 7 03:10:34 UTC 2023 r...@releng3.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 amd64 150 150 In my test environment with yesterday's snapshot kernel in use and with vfs.zfs.bclone_enabled=1 : # ~/bclone_panic.sh count: 1 count: 2 count: 3 count: 4 count: 5 count: 6 count: 7 count: 8 then panic: no 9. === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 8, 2023, at 15:30, Martin Matuska wrote: > I can confirm that the patch fixes the panic caused by the provided script on > my test systems. > Mark, would it be possible to try poudriere on your system with a patched > kernel? . . . On 9. 9. 2023 0:09, Alexander Motin wrote: > On 08.09.2023 09:52, Martin Matuska wrote: >> . . . > > Thank you, Martin. I was able to reproduce the issue with your script and > found the cause. > > I first though the issue is triggered by the `cp`, but it appeared to be > triggered by `cat`. It also got copy_file_range() support, but later than > `cp`. That is probably why it slipped through testing. This patch fixes it > for me: https://github.com/openzfs/zfs/pull/15251 . > > Mark, could you please try the patch? If all goes well, this will end up reporting that the poudriere bulk -a is still running but has gotten past, say, 320+ port->package builds finished (so: more than double observed so far for the panic context). Later would be a report with a larger figure. A normal run I might let go for 6000+ ports and 10 hr or so. Notes as I go . . . Patch applied, built, and installed to the test media. Also, booted: # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #75 main-n265228-c9315099f69e-dirty: Thu Sep 7 13:28:47 PDT 2023 root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG amd64 amd64 150 150 Note that this is with a debug kernel (-dbg- in path and -DBG in the GENERIC* name). Also, the vintage of what it is based on has: git: 969071be938c - main - vfs: copy_file_range() between multiple mountpoints of the same fs type The usual sort of sequencing previously reported to get to this point. Media update starts with the rewind to the checkpoint in hopes of avoiding oddities from the later failure. . . . : [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: 34588 Built: 414 Failed: 0 Skipped: 39Ignored: 335 Fetched: 0 Tobuild: 33800 Time: 00:30:41 So 414 and and still building. More later. (It may be a while.) === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 8, 2023, at 17:03, Mark Millard wrote: > On Sep 8, 2023, at 15:30, Martin Matuska wrote: > >> I can confirm that the patch fixes the panic caused by the provided script >> on my test systems. >> Mark, would it be possible to try poudriere on your system with a patched >> kernel? > > . . . > > On 9. 9. 2023 0:09, Alexander Motin wrote: >> On 08.09.2023 09:52, Martin Matuska wrote: >>> . . . >> >> Thank you, Martin. I was able to reproduce the issue with your script and >> found the cause. >> >> I first though the issue is triggered by the `cp`, but it appeared to be >> triggered by `cat`. It also got copy_file_range() support, but later than >> `cp`. That is probably why it slipped through testing. This patch fixes it >> for me: https://github.com/openzfs/zfs/pull/15251 . >> >> Mark, could you please try the patch? > > If all goes well, this will end up reporting that the > poudriere bulk -a is still running but has gotten past, > say, 320+ port->package builds finished (so: more than > double observed so far for the panic context). Later > would be a report with a larger figure. A normal run > I might let go for 6000+ ports and 10 hr or so. > > Notes as I go . . . > > Patch applied, built, and installed to the test media. > Also, booted: > > # uname -apKU > FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #75 > main-n265228-c9315099f69e-dirty: Thu Sep 7 13:28:47 PDT 2023 > root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG > amd64 amd64 150 150 > > Note that this is with a debug kernel (-dbg- in path and -DBG in > the GENERIC* name). Also, the vintage of what it is based on has: > > git: 969071be938c - main - vfs: copy_file_range() between multiple > mountpoints of the same fs type > > The usual sort of sequencing previously reported to get to this > point. Media update starts with the rewind to the checkpoint in > hopes of avoiding oddities from the later failure. > > . . . : > > [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: > 34588 Built: 414 Failed: 0 Skipped: 39Ignored: 335 Fetched: 0 > Tobuild: 33800 Time: 00:30:41 > > > So 414 and and still building. > > More later. (It may be a while.) > [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: 34588 Built: 2013 Failed: 2 Skipped: 179 Ignored: 335 Fetched: 0 Tobuild: 32059 Time: 01:42:47 and still going. (FYI: The failures are expected.) After a while I might stop it and start over with a non-debug kernel installed instead. === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 8, 2023, at 18:19, Mark Millard wrote: > On Sep 8, 2023, at 17:03, Mark Millard wrote: > >> On Sep 8, 2023, at 15:30, Martin Matuska wrote: >> >>> I can confirm that the patch fixes the panic caused by the provided script >>> on my test systems. >>> Mark, would it be possible to try poudriere on your system with a patched >>> kernel? >> >> . . . >> >> On 9. 9. 2023 0:09, Alexander Motin wrote: >>> On 08.09.2023 09:52, Martin Matuska wrote: >>>> . . . >>> >>> Thank you, Martin. I was able to reproduce the issue with your script and >>> found the cause. >>> >>> I first though the issue is triggered by the `cp`, but it appeared to be >>> triggered by `cat`. It also got copy_file_range() support, but later than >>> `cp`. That is probably why it slipped through testing. This patch fixes >>> it for me: https://github.com/openzfs/zfs/pull/15251 . >>> >>> Mark, could you please try the patch? >> >> If all goes well, this will end up reporting that the >> poudriere bulk -a is still running but has gotten past, >> say, 320+ port->package builds finished (so: more than >> double observed so far for the panic context). Later >> would be a report with a larger figure. A normal run >> I might let go for 6000+ ports and 10 hr or so. >> >> Notes as I go . . . >> >> Patch applied, built, and installed to the test media. >> Also, booted: >> >> # uname -apKU >> FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #75 >> main-n265228-c9315099f69e-dirty: Thu Sep 7 13:28:47 PDT 2023 >> root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG >> amd64 amd64 150 150 >> >> Note that this is with a debug kernel (-dbg- in path and -DBG in >> the GENERIC* name). Also, the vintage of what it is based on has: >> >> git: 969071be938c - main - vfs: copy_file_range() between multiple >> mountpoints of the same fs type >> >> The usual sort of sequencing previously reported to get to this >> point. Media update starts with the rewind to the checkpoint in >> hopes of avoiding oddities from the later failure. >> >> . . . : >> >> [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: >> 34588 Built: 414 Failed: 0 Skipped: 39Ignored: 335 Fetched: 0 >> Tobuild: 33800 Time: 00:30:41 >> >> >> So 414 and and still building. >> >> More later. (It may be a while.) >> > > [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: > 34588 Built: 2013 Failed: 2 Skipped: 179 Ignored: 335 Fetched: 0 > Tobuild: 32059 Time: 01:42:47 > > and still going. (FYI: The failures are expected.) > > After a while I might stop it and start over with a non-debug > kernel installed instead. I did ^C after 2.5 hr (with 2447 built): ^C[02:30:05] Error: Signal SIGINT caught, cleaning up and exiting [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [sigint:] Queued: 34588 Built: 2447 Failed: 5 Skipped: 226 Ignored: 335 Fetched: 0 Tobuild: 31575 Time: 02:29:59 [02:30:05] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-08_16h31m51s [02:30:05] Cleaning up [02:38:04] Unmounting file systems Exiting with status 1 I'll switch it over to a non-debug kernel and, probably, world and setup/run another test. . . . (time goes by) . . . Hmm. This did not get sent when I wrote the above. FYI, non-debug test status: [main-amd64-bulk_a-default] [2023-09-08_19h51m52s] [parallel_build:] Queued: 34588 Built: 2547 Failed: 5 Skipped: 239 Ignored: 335 Fetched: 0 Tobuild: 31462 Time: 01:59:58 I may let it run overnight. === Mark Millard marklmi at yahoo.com
Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics
On Sep 8, 2023, at 21:54, Mark Millard wrote: > On Sep 8, 2023, at 18:19, Mark Millard wrote: > >> On Sep 8, 2023, at 17:03, Mark Millard wrote: >> >>> On Sep 8, 2023, at 15:30, Martin Matuska wrote: >>> >>>> I can confirm that the patch fixes the panic caused by the provided script >>>> on my test systems. >>>> Mark, would it be possible to try poudriere on your system with a patched >>>> kernel? >>> >>> . . . >>> >>> On 9. 9. 2023 0:09, Alexander Motin wrote: >>>> On 08.09.2023 09:52, Martin Matuska wrote: >>>>> . . . >>>> >>>> Thank you, Martin. I was able to reproduce the issue with your script and >>>> found the cause. >>>> >>>> I first though the issue is triggered by the `cp`, but it appeared to be >>>> triggered by `cat`. It also got copy_file_range() support, but later than >>>> `cp`. That is probably why it slipped through testing. This patch fixes >>>> it for me: https://github.com/openzfs/zfs/pull/15251 . >>>> >>>> Mark, could you please try the patch? >>> >>> If all goes well, this will end up reporting that the >>> poudriere bulk -a is still running but has gotten past, >>> say, 320+ port->package builds finished (so: more than >>> double observed so far for the panic context). Later >>> would be a report with a larger figure. A normal run >>> I might let go for 6000+ ports and 10 hr or so. >>> >>> Notes as I go . . . >>> >>> Patch applied, built, and installed to the test media. >>> Also, booted: >>> >>> # uname -apKU >>> FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 150 #75 >>> main-n265228-c9315099f69e-dirty: Thu Sep 7 13:28:47 PDT 2023 >>> root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG >>> amd64 amd64 150 150 >>> >>> Note that this is with a debug kernel (-dbg- in path and -DBG in >>> the GENERIC* name). Also, the vintage of what it is based on has: >>> >>> git: 969071be938c - main - vfs: copy_file_range() between multiple >>> mountpoints of the same fs type >>> >>> The usual sort of sequencing previously reported to get to this >>> point. Media update starts with the rewind to the checkpoint in >>> hopes of avoiding oddities from the later failure. >>> >>> . . . : >>> >>> [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] >>> Queued: 34588 Built: 414 Failed: 0 Skipped: 39Ignored: 335 >>> Fetched: 0 Tobuild: 33800 Time: 00:30:41 >>> >>> >>> So 414 and and still building. >>> >>> More later. (It may be a while.) >>> >> >> [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: >> 34588 Built: 2013 Failed: 2 Skipped: 179 Ignored: 335 Fetched: 0 >> Tobuild: 32059 Time: 01:42:47 >> >> and still going. (FYI: The failures are expected.) >> >> After a while I might stop it and start over with a non-debug >> kernel installed instead. > > I did ^C after 2.5 hr (with 2447 built): > > ^C[02:30:05] Error: Signal SIGINT caught, cleaning up and exiting > [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [sigint:] Queued: 34588 > Built: 2447 Failed: 5 Skipped: 226 Ignored: 335 Fetched: 0 > Tobuild: 31575 Time: 02:29:59 > [02:30:05] Logs: > /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-08_16h31m51s > [02:30:05] Cleaning up > [02:38:04] Unmounting file systems > Exiting with status 1 > > I'll switch it over to a non-debug kernel and, probably, world > and setup/run another test. > > . . . (time goes by) . . . > > Hmm. This did not get sent when I wrote the above. FYI, non-debug > test status: > > [main-amd64-bulk_a-default] [2023-09-08_19h51m52s] [parallel_build:] Queued: > 34588 Built: 2547 Failed: 5 Skipped: 239 Ignored: 335 Fetched: 0 > Tobuild: 31462 Time: 01:59:58 > > I may let it run overnight. I finally stopped it at 7473 built (a little over 13 hrs elapsed): ^C[13:08:30] Error: Signal SIGINT caught, cleaning up and exiting [main-amd64-bulk_a-default] [2023-09-08_19h51m52s] [sigint:] Queued: 34588 Built: 7473 Failed: 23Skipped: 798 Ignored: 335 Fetched: 0 Tobuild: 25959 Time: 13:08:26 [13:08:30] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-08
Looks like the kyua zfs tests likely are not used on aarch64 or other contexts with unsigned char
kyua tests that use the: /usr/tests/sys/cddl/zfs/bin/mkfile program like so (for example): mkfile 500M /testpool.1861/bigfile.0 (which should be valid) end up with mkfile instead reporting: Standard error: Usage: mkfile [-nv] [e|p|t|g|m|k|b] ... which prevent the kyua test involved from working. Turns out this is from expecting char to be always signed (so a -1 vs. 255 distinction, here in an aarch64 context): . . . (gdb) list 179 /* Options. */ 180 while ((ch = getopt(argc, argv, "nv")) != -1) { 181 switch (ch) { 182 case 'n': 183 nofill = 1; 184 break; 185 case 'v': (gdb) print ch $16 = 255 '\377' (gdb) print/x -1 $17 = 0x (gdb) print/x ch $18 = 0xff . . . With the mix of unsigned and signed it ends up being a 0xffu != 0xu test, which is always true. So the switch is reached as if a "-" prefix was present (that is not). Then the "option" is classified as invalid and the usage message is produced. Apparently no one had noticed. That, in turn, suggests a lack of inspected testing on aarch64, powerpc64, powerpc64le, armv7, powerpc, and powerpcspe. That, in turn, suggests that kyua test inspection for the likes of aarch64 is not historically a part of the release process for openzfs or for operating systems that include openzfs. === Mark Millard marklmi at yahoo.com
Re: Looks like the kyua zfs tests likely are not used on aarch64 or other contexts with unsigned char
On Sep 10, 2023, at 05:58, Mike Karels wrote: > On 10 Sep 2023, at 2:31, Mark Millard wrote: > >> kyua tests that use the: >> >> /usr/tests/sys/cddl/zfs/bin/mkfile >> >> program like so (for example): >> >> mkfile 500M /testpool.1861/bigfile.0 >> >> (which should be valid) end up with mkfile >> instead reporting: >> >> Standard error: >> Usage: mkfile [-nv] [e|p|t|g|m|k|b] ... >> >> which prevent the kyua test involved from working. >> >> Turns out this is from expecting char to be always >> signed (so a -1 vs. 255 distinction, here in an >> aarch64 context): >> >> . . . >> (gdb) list >> 179 /* Options. */ >> 180 while ((ch = getopt(argc, argv, "nv")) != -1) { >> 181 switch (ch) { >> 182 case 'n': >> 183 nofill = 1; >> 184 break; >> 185 case 'v': >> (gdb) print ch >> $16 = 255 '\377' >> (gdb) print/x -1 >> $17 = 0x >> (gdb) print/x ch >> $18 = 0xff >> . . . >> >> With the mix of unsigned and signed it ends up >> being a 0xffu != 0xu test, which is >> always true. > > mkfile is broken. getopt returns an int, and -1 on end. > It never returns 0xff. But mkfile declares ch as char, > which truncates the return value -1. ch is a bad (misleading) > variable name, although getopt(3) uses it as well (but declared > as int). Yep: for char being signed, the code is still wrong via the char ch use. But the observed behavior is very different than for char being used but being unsigned. In this context, consequences of the unsigned char behavioral results are observable in the kyua run results but went unnoticed. I used to run into examples of the use of unsigned char for holding the getopt result back in my powerpc days as well and dealt with upstreams for a port or 2 for getting it fixed after finding such was the source of odd behavior I'd observed. If I remember right, this is the first example of running into the specific issue in my aarch64 and armv7 time frame. > Mike > >> So the switch is reached as if a "-" prefix was >> present (that is not). Then the "option" is classified >> as invalid and the usage message is produced. >> >> Apparently no one had noticed. That, in turn, suggests a >> lack of inspected testing on aarch64, powerpc64, >> powerpc64le, armv7, powerpc, and powerpcspe. That, in >> turn, suggests that kyua test inspection for the likes >> of aarch64 is not historically a part of the release >> process for openzfs or for operating systems that include >> openzfs. > === Mark Millard marklmi at yahoo.com
Re: Looks like the kyua zfs tests likely are not used on aarch64 or other contexts with unsigned char
On Sep 10, 2023, at 00:31, Mark Millard wrote: > kyua tests that use the: > > /usr/tests/sys/cddl/zfs/bin/mkfile > > program like so (for example): > > mkfile 500M /testpool.1861/bigfile.0 > > (which should be valid) end up with mkfile > instead reporting: > > Standard error: > Usage: mkfile [-nv] [e|p|t|g|m|k|b] ... > > which prevent the kyua test involved from working. > > Turns out this is from expecting char to be always > signed (so a -1 vs. 255 distinction, here in an > aarch64 context): > > . . . > (gdb) list > 179 /* Options. */ > 180 while ((ch = getopt(argc, argv, "nv")) != -1) { > 181 switch (ch) { > 182 case 'n': > 183 nofill = 1; > 184 break; > 185 case 'v': > (gdb) print ch > $16 = 255 '\377' > (gdb) print/x -1 > $17 = 0x > (gdb) print/x ch > $18 = 0xff > . . . > > With the mix of unsigned and signed it ends up > being a 0xffu != 0xu test, which is > always true. > > So the switch is reached as if a "-" prefix was > present (that is not). Then the "option" is classified > as invalid and the usage message is produced. > > Apparently no one had noticed. That, in turn, suggests a > lack of inspected testing on aarch64, powerpc64, > powerpc64le, armv7, powerpc, and powerpcspe. That, in > turn, suggests that kyua test inspection for the likes > of aarch64 is not historically a part of the release > process for openzfs or for operating systems that include > openzfs. > Looks like the mkfile.c traces back to a former port sysutils/mkfile that was unfetchable as of 2019. And, looking around, it seems the kyua zfs tests may be a FreeBSD only thing, not adopted in openzfs. So various implicit assumptions when I wrote the note do not actually hold. FreeBSD would have to do additional testing via kyua, beyond what openzfs does for testing, to discover the unsigned char related mis-behavior in the mkfile that FreeBSD's kyua tests use. Only FreeBSD variants are likely to have a similar status, not general openzfs including operating systems. === Mark Millard marklmi at yahoo.com
Re: Looks like the kyua zfs tests likely are not used on aarch64 or other contexts with unsigned char
On Sep 10, 2023, at 11:21, Warner Losh wrote: > On Sun, Sep 10, 2023, 11:10 AM Mark Millard wrote: >> On Sep 10, 2023, at 00:31, Mark Millard wrote: >> >> > kyua tests that use the: >> > >> > /usr/tests/sys/cddl/zfs/bin/mkfile >> > >> > program like so (for example): >> > >> > mkfile 500M /testpool.1861/bigfile.0 >> > >> > (which should be valid) end up with mkfile >> > instead reporting: >> > >> > Standard error: >> > Usage: mkfile [-nv] [e|p|t|g|m|k|b] ... >> > >> > which prevent the kyua test involved from working. >> > >> > Turns out this is from expecting char to be always >> > signed (so a -1 vs. 255 distinction, here in an >> > aarch64 context): >> > >> > . . . >> > (gdb) list >> > 179 /* Options. */ >> > 180 while ((ch = getopt(argc, argv, "nv")) != -1) { >> > 181 switch (ch) { >> > 182 case 'n': >> > 183 nofill = 1; >> > 184 break; >> > 185 case 'v': >> > (gdb) print ch >> > $16 = 255 '\377' >> > (gdb) print/x -1 >> > $17 = 0x >> > (gdb) print/x ch >> > $18 = 0xff >> > . . . >> > >> > With the mix of unsigned and signed it ends up >> > being a 0xffu != 0xu test, which is >> > always true. >> > >> > So the switch is reached as if a "-" prefix was >> > present (that is not). Then the "option" is classified >> > as invalid and the usage message is produced. >> > >> > Apparently no one had noticed. That, in turn, suggests a >> > lack of inspected testing on aarch64, powerpc64, >> > powerpc64le, armv7, powerpc, and powerpcspe. That, in >> > turn, suggests that kyua test inspection for the likes >> > of aarch64 is not historically a part of the release >> > process for openzfs or for operating systems that include >> > openzfs. >> > >> >> Looks like the mkfile.c traces back to a former port >> sysutils/mkfile that was unfetchable as of 2019. And, >> looking around, it seems the kyua zfs tests may be a >> FreeBSD only thing, not adopted in openzfs. >> >> So various implicit assumptions when I wrote the note >> do not actually hold. >> >> FreeBSD would have to do additional testing via kyua, >> beyond what openzfs does for testing, to discover the >> unsigned char related mis-behavior in the mkfile that >> FreeBSD's kyua tests use. Only FreeBSD variants are >> likely to have a similar status, not general openzfs >> including operating systems. > > I wonder how hard ot would be to look for the char = getopt() pattern with > coccinelle > Unsure. But to be sure that the implication that I was also trying to point out is not lost: kyua testing of zfs (and more?) for aarch64 (tier 1) is apparently not being done (or at least the results are not being inspected). Similarly for armv7 and all the powerpc*'s (not tier 1's, however, so not as surprising). Side note: Via other exchanges that have been going on I learned to look in the likes of: https://ci.freebsd.org/job/FreeBSD-main-amd64-testvm/*/consoleText for what to "pkg install" for kyua test runs to use for normal runs (at least the subset compatible with architecture being tested). I'd only figured out a (large) subset previously for aarch64 and armv7. I'm not aware of there being other documentation for what is appropriate for setting up such for kyua runs. === Mark Millard marklmi at yahoo.com
Re: Looks like the kyua zfs tests likely are not used on aarch64 or other contexts with unsigned char
On Sep 10, 2023, at 23:57, Dag-Erling Smørgrav wrote: > Mark Millard writes: >> I'm not aware of there being other documentation for what >> is appropriate for setting up such for kyua runs. > > https://github.com/freebsd/freebsd-ci/blob/master/scripts/build/build-test_image-head.sh#L69-L84 > Thanks for the reference that does not involve looking at CI log files. Filed away for future references. Side note . . . Turns out that tcptestsuite does not build for aarch64 do to alignment problems via packing in net/packetdrill : In file included from run_packet.c:45: In file included from ./tcp_options_iterator.h:31: ./tcp_options.h:108:2: error: field within 'struct tcp_option' is less aligned than 'union tcp_option::(anonymous at ./tcp_options.h:108:2)' and is usually due to 'struct tcp_option' being packed, which can lead to unaligned accesses [-Werror,-Wunaligned-access] union { ^ --- sctp_iterator.o --- cc -O2 -pipe -mcpu=cortex-a7 -Wno-deprecated -g -fstack-protector-strong -fno-strict-aliasing -mcpu=cortex-a7 -Wall -Werror -g -c sctp_iterator.c -o sctp_iterator.o --- tcp_options.o --- cc -O2 -pipe -mcpu=cortex-a7 -Wno-deprecated -g -fstack-protector-strong -fno-strict-aliasing -mcpu=cortex-a7 -Wall -Werror -g -c tcp_options.c -o tcp_options.o --- run_packet.o --- 1 error generated. *** [run_packet.o] Error code 1 make[1]: stopped in /wrkdirs/usr/ports/net/packetdrill/work/packetdrill-aebdc35/gtests/net/packetdrill --- tcp_options.o --- In file included from tcp_options.c:25: ./tcp_options.h:108:2: error: field within 'struct tcp_option' is less aligned than 'union tcp_option::(anonymous at ./tcp_options.h:108:2)' and is usually due to 'struct tcp_option' being packed, which can lead to unaligned accesses [-Werror,-Wunaligned-access] union { ^ 1 error generated. *** [tcp_options.o] Error code 1 make[1]: stopped in /wrkdirs/usr/ports/net/packetdrill/work/packetdrill-aebdc35/gtests/net/packetdrill 2 errors === Mark Millard marklmi at yahoo.com
Re: Looks like the kyua zfs tests likely are not used on aarch64 or other contexts with unsigned char
On Sep 11, 2023, at 00:03, Mark Millard wrote: > On Sep 10, 2023, at 23:57, Dag-Erling Smørgrav wrote: > >> Mark Millard writes: >>> I'm not aware of there being other documentation for what >>> is appropriate for setting up such for kyua runs. >> >> https://github.com/freebsd/freebsd-ci/blob/master/scripts/build/build-test_image-head.sh#L69-L84 >> > > Thanks for the reference that does not involve looking at > CI log files. Filed away for future references. > > > Side note . . . > > Turns out that tcptestsuite does not build for aarch64 > do to alignment problems via packing in net/packetdrill : > > In file included from run_packet.c:45: > In file included from ./tcp_options_iterator.h:31: > ./tcp_options.h:108:2: error: field within 'struct tcp_option' is less > aligned than 'union tcp_option::(anonymous at ./tcp_options.h:108:2)' and is > usually due to 'struct tcp_option' being packed, which can lead to unaligned > accesses [-Werror,-Wunaligned-access] > union { > ^ > --- sctp_iterator.o --- > cc -O2 -pipe -mcpu=cortex-a7 Looks like I messed up and reported an armv7 context. aarch64 built net/packetdrill and net/tcptestsuite just fine. Sorry for the noise. > -Wno-deprecated -g -fstack-protector-strong -fno-strict-aliasing > -mcpu=cortex-a7 -Wall -Werror -g -c sctp_iterator.c -o sctp_iterator.o > --- tcp_options.o --- > cc -O2 -pipe -mcpu=cortex-a7 -Wno-deprecated -g -fstack-protector-strong > -fno-strict-aliasing -mcpu=cortex-a7 -Wall -Werror -g -c tcp_options.c -o > tcp_options.o > --- run_packet.o --- > 1 error generated. > *** [run_packet.o] Error code 1 > > make[1]: stopped in > /wrkdirs/usr/ports/net/packetdrill/work/packetdrill-aebdc35/gtests/net/packetdrill > --- tcp_options.o --- > In file included from tcp_options.c:25: > ./tcp_options.h:108:2: error: field within 'struct tcp_option' is less > aligned than 'union tcp_option::(anonymous at ./tcp_options.h:108:2)' and is > usually due to 'struct tcp_option' being packed, which can lead to unaligned > accesses [-Werror,-Wunaligned-access] > union { > ^ > 1 error generated. > *** [tcp_options.o] Error code 1 > > make[1]: stopped in > /wrkdirs/usr/ports/net/packetdrill/work/packetdrill-aebdc35/gtests/net/packetdrill > 2 errors > === Mark Millard marklmi at yahoo.com
sys/net/if_lagg_test:status_stress can lead to use-after-free in main (both before and after stable/14 was created), at least on aarch64
See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=273081#c5 and the backtrace in the prior comment. The test context is aarch64. Kyle Evans provided a kgdb patch for devel/gdb for aarch64 that that finally let me track this down to the level of detail on how to interpret the register values reported vs. what code involved using the value. I will say that I've not managed to produce the crash with 14.0-BETA1. But I have produced the crash in my personal non-debug kernel builds and with the main snapshots dd'd to media and booted and used. === Mark Millard marklmi at yahoo.com
FYI: RPi4B via ACPI style boot suggests "armv8crypto0: CPU lacks AES instructions" can lead to a hung up boot sequence in 14.0-BETA2
See: https://lists.freebsd.org/archives/freebsd-arm/2023-September/003071.html for details. (It is not my activity.) === Mark Millard marklmi at yahoo.com
Re: How to Boot FreeBSD Using pftf/RPi4 UEFI (I got: "panic: ram_attach: resource 5 failed to attach" from FreeBSD-14.0-BETA3)
[Mitchell H.: I think this has exposed a possibly general issue not specific to RPi*'s, despite the UEFI/ACPI booting of RPi*'s not being officially supported. See the "BOOT -V RELATED MATERIAL" section towards the end, skipping the earlier explorations.] On Sep 22, 2023, at 08:39, Mark Millard wrote: > On Sep 22, 2023, at 01:02, ykla wrote: > >> But who test FreeBSD-14.0-BETA2-arm64-aarch64-disc1.iso on UEFI on rpi4b? > > I might get to this this weekend or tonight (local time). > > But, as I do not normally deal with FreeBSD-14.0-*-arm64-aarch64-disc1.iso > for RPi4B's, could you list step by step instructions so that I'm sure to > test what you tested in reasonable detail? Please make the step-by-step > instructions be for having the serial console working. > > (My use of any FreeBSD-*.iso has been historically rare.) > > Most likely FreeBSD-14.0-BETA3-arm64-aarch64-disc1.iso will be available > by the time I get to this. So that is likely what I'd test. I'll also note that FreeBSD makes no claim to support pftf/RPi4 UEFI : official support is via the U-Boot port that is used for the aarch64 RPI specific images. I'll note that the RPi4B here is a 8 GiByte one, a modern "C0T" one that does not require the special bounce buffering that was used to avoid the wrapper logic error that limited some address ranges in "B0T" parts for specific types of activity. (But, bounce buffering should still work.) As for attempting to use pftf/RPi4 UEFI . . . (I've no clue how well this matches your procedure.) Prepare microsd card to have just pftf/RPi4 UEFI : # gpart show -p da3 => 63 62521281da3 MBR (30G) 63 40897 - free - (20M) 40960102400 da3s1 fat32lba (50M) 143360 62377984 - free - (30G) # mount -onoatime -tmsdosfs /dev/da3s1 /mnt # ls -Tloa /mnt/ total 9 drwxr-xr-x 1 root wheel - 16384 Dec 31 16:00:00 1979 . drwxr-xr-x 63 root wheel uarch70 Sep 21 10:15:27 2023 .. # tar -xpf RPi4_UEFI_Firmware_v1.35.zip -C /mnt/ RPI_EFI.fd: Can't set user=1001/group=123 for RPI_EFI.fd: Invalid argument bcm2711-rpi-4-b.dtb: Can't set user=1001/group=123 for bcm2711-rpi-4-b.dtb: Invalid argument bcm2711-rpi-400.dtb: Can't set user=1001/group=123 for bcm2711-rpi-400.dtb: Invalid argument bcm2711-rpi-cm4.dtb: Can't set user=1001/group=123 for bcm2711-rpi-cm4.dtb: Invalid argument config.txt: Can't set user=1001/group=123 for config.txt: Invalid argument fixup4.dat: Can't set user=1001/group=123 for fixup4.dat: Invalid argument start4.elf: Can't set user=1001/group=123 for start4.elf: Invalid argument overlays/: Can't set user=1001/group=123 for overlays: Invalid argument overlays/upstream-pi4.dtbo: Can't set user=1001/group=123 for overlays/upstream-pi4.dtbo: Invalid argument overlays/miniuart-bt.dtbo: Can't set user=1001/group=123 for overlays/miniuart-bt.dtbo: Invalid argument Readme.md: Can't set user=1001/group=123 for Readme.md: Invalid argument firmware/: Can't set user=1001/group=123 for firmware: Invalid argument firmware/Readme.txt: Can't set user=1001/group=123 for firmware/Readme.txt: Invalid argument firmware/brcm/: Can't set user=1001/group=123 for firmware/brcm: Invalid argument firmware/brcm/brcmfmac43455-sdio.txt: Can't set user=1001/group=123 for firmware/brcm/brcmfmac43455-sdio.txt: Invalid argument firmware/brcm/brcmfmac43455-sdio.clm_blob: Can't set user=1001/group=123 for firmware/brcm/brcmfmac43455-sdio.clm_blob: Invalid argument firmware/brcm/brcmfmac43455-sdio.bin: Can't set user=1001/group=123 for firmware/brcm/brcmfmac43455-sdio.bin: Invalid argument firmware/brcm/brcmfmac43455-sdio.Raspberry: Can't set user=1001/group=123 for firmware/brcm/brcmfmac43455-sdio.Raspberry: Invalid argument firmware/LICENCE.txt: Can't set user=1001/group=123 for firmware/LICENCE.txt: Invalid argument tar: Error exit delayed from previous errors. # find -s /mnt/ -print /mnt/ /mnt/RPI_EFI.fd /mnt/Readme.md /mnt/bcm2711-rpi-4-b.dtb /mnt/bcm2711-rpi-400.dtb /mnt/bcm2711-rpi-cm4.dtb /mnt/config.txt /mnt/firmware /mnt/firmware/LICENCE.txt /mnt/firmware/Readme.txt /mnt/firmware/brcm /mnt/firmware/brcm/brcmfmac43455-sdio.Raspberry /mnt/firmware/brcm/brcmfmac43455-sdio.bin /mnt/firmware/brcm/brcmfmac43455-sdio.clm_blob /mnt/firmware/brcm/brcmfmac43455-sdio.txt /mnt/fixup4.dat /mnt/overlays /mnt/overlays/miniuart-bt.dtbo /mnt/overlays/upstream-pi4.dtbo /mnt/start4.elf # umount /mnt/ Prepare separate USB3 media to hold the *.iso content: # dd if=FreeBSD-14.0-BETA3-arm64-aarch64-disc1.iso of=/dev/da0 bs=1m conv=fsync,sync status=progress 855638016 bytes (856 MB, 816 MiB) transferred 7.097s, 121 MB/s 933+0 records in 933+0 records out 978321408 bytes transferred in 7.956494 secs (122958854 bytes/sec) Note: the efi part
RE: nvd->nda switch and blocksize changes for ZFS
Frank Behrens wrote on Date: Sat, 23 Sep 2023 16:31:40 UTC : > I created a zpool with a FreeBSD-14.0-CURRENT on February. With > 15.0-CURRENT/14.0-STABLE from now I get the message: > > status: One or more devices are configured to use a non-native block size. > Expect reduced performance. > action: Replace affected devices with devices that support the > configured block size, or migrate data to a properly configured > pool. > NAMESTATE READ WRITE CKSUM > zsysONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > nda0p4 ONLINE 0 0 0 block size: 4096B > configured, 16384B native > nda1p4 ONLINE 0 0 0 block size: 4096B > configured, 16384B native > nda2p4 ONLINE 0 0 0 block size: 4096B > configured, 16384B native > > I use: > nda0: > nda0: nvme version 1.4 > nda0: 953869MB (1953525168 512 byte sectors) > > I cannot imagine, that the native blocksize changed. Do I really expect > a reduced performance? > Is it advisable to switch back to nvd? Looking at: https://www.techpowerup.com/ssd-specs/samsung-980-1-tb.d58 it reports (indifferent places on the page): QUOTE Page Size: 16 KB Notes NAND Die: A Dual-plane Die with 2 sub-planes with 8 KiB pages in order to improve performance through paralellism. Endurance: Could be from 1.500 to 3.000 P.E.C. depending on NAND binning END QUOTE That "A Dual-plane Die with 2 sub-planes with 8 KiB pages", for a total of 16 KB, does suggest to me that the new messages have a chance of being correct about there being a tradeoff. (But I'm no expert in the area.) === Mark Millard marklmi at yahoo.com
15 & 14: ram_attach vs. its using regions_to_avail vs. "bus_alloc_resource" can lead to: panic("ram_attach: resource %d failed to attach", rid)
ram_attach is based on regions_to_avail but that is a problem for its later bus_alloc_resource use --and that can lead to: panic("ram_attach: resource %d failed to attach", rid); Unfortunately, the known example is use of EDK2 on RPi4B class systems, not what is considered the supported way. The panic happens for main [so: 15] and will happen once the cortex-a72 handling in 14.0-* is in a build fixed by: • git: 906bcc44641d - releng/14.0 - arm64: Fix errata workarounds that depend on smccc Andrew Turner The lack of the fix leads to an earlier panic as stands. sys/kern/subr_physmem.c 's regions_to_avail is based on ignoring phys_avail and using only hwregions and exregions. In other words, in part: * Initially dump_avail and phys_avail are identical. Boot time memory * allocations remove extents from phys_avail that may still be included * in dumps. This means that early, dedicated memory allocations are treated as available for general use by regions_to_avail . The distinction is visible in the boot -v output in that: real memory = 3138154496 (2992 MB) Physical memory chunk(s): 0x20 - 0x002b7f, 727711744 bytes (177664 pages) 0x002ce3a000 - 0x003385, 111304704 bytes (27174 pages) 0x00338c - 0x00338c6fff, 28672 bytes (7 pages) 0x0033a3 - 0x0036ef, 55377920 bytes (13520 pages) 0x00372e - 0x003b2f, 67239936 bytes (16416 pages) 0x004000 - 0x00bb3dcfff, 2067648512 bytes (504797 pages) avail memory = 3027378176 (2887 MB) does not list the wider: 0x004000 - 0x00bfff because of phys_avail . But the earlier dump based on hwregions and exregions shows: Physical memory chunk(s): 0x001d - 0x001e, 0 MB ( 32 pages) 0x0020 - 0x338c6fff, 822 MB ( 210631 pages) 0x3392 - 0x3b2f, 121 MB ( 31200 pages) 0x4000 - 0xbfff, 2048 MB ( 524288 pages) Excluded memory regions: 0x001d - 0x001e, 0 MB ( 32 pages) NoAlloc 0x2b80 - 0x2ce39fff,22 MB ( 5690 pages) NoAlloc 0x3386 - 0x338b, 0 MB ( 96 pages) NoAlloc 0x3392 - 0x33a2, 1 MB (272 pages) NoAlloc 0x36f0 - 0x372d, 3 MB (992 pages) NoAlloc which indicates: 0x4000 - 0xbfff is available as far as it is concerned. (Note some code works/displays in terms of: 0x4000 - 0xc000 instead.) For aarch64 , sys/arm64/arm64/nexus.c has a nexus_alloc_resource that is used as bus_alloc_resource . It ends up rejecting the RPi4B boot via using the result of the call in ram_attach: if (bus_alloc_resource(dev, SYS_RES_MEMORY, &rid, start, end, end - start, 0) == NULL) panic("ram_attach: resource %d failed to attach", rid); as shown by the just-prior start/end pair sequence messages: ram0: reserving memory region: 20-2b80 ram0: reserving memory region: 2ce3a000-3386 ram0: reserving memory region: 338c-338c7000 ram0: reserving memory region: 33a3-36f0 ram0: reserving memory region: 372e-3b30 ram0: reserving memory region: 4000-c000 panic: ram_attach: resource 5 failed to attach I do not see anything about this that looks inherently RPi* specific for possibly ending up with an analogous panic. So I expect the example is sufficient context to identify a problem is present, despite EDK2 use not being normal for RPi4B's and the like as far as FreeBSD is concerned. === Mark Millard marklmi at yahoo.com
RE: Base libc++ missing symbol
Joel Bodenmann wrote on Date: Mon, 02 Oct 2023 20:00:29 UTC : > It seems like I finally managed to hose a FreeBSD system. > The machine in question is my workstation at home. It has been running > stable/13 without any problems. Yesterday I've updated to > ef295f69abbffb3447771a30df6906ca56a5d0c0 and since then I'm getting an > undefined symbol on anything using Qt: > > ld-elf.so.1: /usr/local/lib/qt5/libQt5Widgets.so.5: Undefined symbol > "_ZTVNSt3__13pmr25monotonic_buffer_resourceE" > > Unless I'm missing something, it would seem like my base libc++ > is missing the pmr::monotonic_buffer_resource symbol. I do not have a 13.2 context, so you may want to run the analogous steps in your context for confirming/denying the below applies. # llvm-cxxfilt _ZTVNSt3__13pmr25monotonic_buffer_resourceE vtable for std::__1::pmr::monotonic_buffer_resource Using the example "Run this code" source from: https://en.cppreference.com/w/cpp/memory/monotonic_buffer_resource # c++ -std=c++17 -pedantic -O2 monotonic_buffer_resource.cpp # objdump -x a.out | grep _ZTVNSt3__13pmr25monotonic_buffer_resourceE 00204160 g O .bss.rel.ro 0038 _ZTVNSt3__13pmr25monotonic_buffer_resourceE # nm a.out | grep _ZTVNSt3__13pmr25monotonic_buffer_resourceE 00204160 B _ZTVNSt3__13pmr25monotonic_buffer_resourceE # ./a.out t1 (default std alloc): 0.491 sec; t1/t1: 1.000 t2 (default pmr alloc): 0.541 sec; t1/t2: 0.906 t3 (pmr alloc no buf): 0.188 sec; t1/t3: 2.616 t4 (pmr alloc and buf): 0.155 sec; t1/t4: 3.172 Note that the vtable is in the a.out instead of being from a library. It is global but is in the a.out .bss.rel.ro <http://bss.rel.ro/> in the example and is defined. > At first I thought I might have messed up on installworld but rolling > back to the previous boot environment and then performing the same > procedure again lead to the same outcome. If the above works similarly in your context, then I expect that the issue is on the qt5 or port side of things, not the system libraries/headers. As I understand, clang++ 16 is the first vintage with this directly supported, instead of being just in the experimental category/area for libc++. May be tracking that transition is at issue. For reference: # c++ -v FreeBSD clang version 16.0.6 (https://github.com/llvm/llvm-project.git llvmorg-16.0.6-0-g7cbf1a259152) Target: x86_64-unknown-freebsd15.0 Thread model: posix InstalledDir: /usr/bin # uname -apKU FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT #124 main-n265447-e5236d25f2c0-dirty: Thu Sep 21 09:06:08 PDT 2023 root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-NODBG amd64 amd64 151 151 > Any ideas or wild guesses? Anything obvious I'm missing here? > > uname -a > FreeBSD beefy02 13.2-STABLE FreeBSD 13.2-STABLE > stable/13-n256443-ef295f69abbf GENERIC amd64 > > freebsd-version -kru > 13.2-STABLE > 13.2-STABLE > 13.2-STABLE > > clang --version > FreeBSD clang version 16.0.6 > (https://github.com/llvm/llvm-project.git > llvmorg-16.0.6-0-g7cbf1a259152) Target: x86_64-unknown-freebsd13.2 > Thread model: posix > InstalledDir: /usr/bin === Mark Millard marklmi at yahoo.com
Re: Base libc++ missing symbol
On Oct 2, 2023, at 15:56, Mark Millard wrote: > Joel Bodenmann wrote on > Date: Mon, 02 Oct 2023 20:00:29 UTC : > >> It seems like I finally managed to hose a FreeBSD system. >> The machine in question is my workstation at home. It has been running >> stable/13 without any problems. Yesterday I've updated to >> ef295f69abbffb3447771a30df6906ca56a5d0c0 and since then I'm getting an >> undefined symbol on anything using Qt: >> >> ld-elf.so.1: /usr/local/lib/qt5/libQt5Widgets.so.5: Undefined symbol >> "_ZTVNSt3__13pmr25monotonic_buffer_resourceE" >> >> Unless I'm missing something, it would seem like my base libc++ >> is missing the pmr::monotonic_buffer_resource symbol. > > I do not have a 13.2 context, so you may want to run the > analogous steps in your context for confirming/denying > the below applies. > > # llvm-cxxfilt _ZTVNSt3__13pmr25monotonic_buffer_resourceE > vtable for std::__1::pmr::monotonic_buffer_resource > > Using the example "Run this code" source from: > > https://en.cppreference.com/w/cpp/memory/monotonic_buffer_resource > > # c++ -std=c++17 -pedantic -O2 monotonic_buffer_resource.cpp > > # objdump -x a.out | grep _ZTVNSt3__13pmr25monotonic_buffer_resourceE > 00204160 g O .bss.rel.ro 0038 > _ZTVNSt3__13pmr25monotonic_buffer_resourceE > > # nm a.out | grep _ZTVNSt3__13pmr25monotonic_buffer_resourceE > 00204160 B _ZTVNSt3__13pmr25monotonic_buffer_resourceE > > # ./a.out > t1 (default std alloc): 0.491 sec; t1/t1: 1.000 > t2 (default pmr alloc): 0.541 sec; t1/t2: 0.906 > t3 (pmr alloc no buf): 0.188 sec; t1/t3: 2.616 > t4 (pmr alloc and buf): 0.155 sec; t1/t4: 3.172 > > Note that the vtable is in the a.out instead of being from > a library. It is global but is in the a.out .bss.rel.ro <http://bss.rel.ro/> > in > the example and is defined. > >> At first I thought I might have messed up on installworld but rolling >> back to the previous boot environment and then performing the same >> procedure again lead to the same outcome. > > If the above works similarly in your context, then I expect > that the issue is on the qt5 or port side of things, not the > system libraries/headers. > > As I understand, clang++ 16 is the first vintage with this > directly supported, instead of being just in the experimental > category/area for libc++. May be tracking that transition is > at issue. > > For reference: > > # c++ -v > FreeBSD clang version 16.0.6 (https://github.com/llvm/llvm-project.git > llvmorg-16.0.6-0-g7cbf1a259152) > Target: x86_64-unknown-freebsd15.0 > Thread model: posix > InstalledDir: /usr/bin > > # uname -apKU > FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT #124 > main-n265447-e5236d25f2c0-dirty: Thu Sep 21 09:06:08 PDT 2023 > root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-NODBG > amd64 amd64 151 151 > >> Any ideas or wild guesses? Anything obvious I'm missing here? >> >> uname -a >> FreeBSD beefy02 13.2-STABLE FreeBSD 13.2-STABLE >> stable/13-n256443-ef295f69abbf GENERIC amd64 >> >> freebsd-version -kru >> 13.2-STABLE >> 13.2-STABLE >> 13.2-STABLE >> >> clang --version >> FreeBSD clang version 16.0.6 >> (https://github.com/llvm/llvm-project.git >> llvmorg-16.0.6-0-g7cbf1a259152) Target: x86_64-unknown-freebsd13.2 >> Thread model: posix >> InstalledDir: /usr/bin > Given Dimitry Andric's notes: # objdump -x /lib/libc++.so.1 | grep _ZTVNSt3__13pmr25monotonic_buffer_resourceE 001006d8 g O .data.rel.ro 0038 _ZTVNSt3__13pmr25monotonic_buffer_resourceE # nm /lib/libc++.so.1 | grep _ZTVNSt3__13pmr25monotonic_buffer_resourceE 001006d8 D _ZTVNSt3__13pmr25monotonic_buffer_resourceE So /lib/libc++.so.1 has a global symbol naming initialized data for this in my context. Reminder for the a.out: # objdump -x a.out | grep _ZTVNSt3__13pmr25monotonic_buffer_resourceE 00204160 g O .bss.rel.ro 0038 _ZTVNSt3__13pmr25monotonic_buffer_resourceE # nm a.out | grep _ZTVNSt3__13pmr25monotonic_buffer_resourceE 00204160 B _ZTVNSt3__13pmr25monotonic_buffer_resourceE My original thinking makes no sense for this. Sorry for the noise. The procedure of seeing if the a.out is produced without complaint might still be useful. === Mark Millard marklmi at yahoo.com
Re: Base libc++ missing symbol
wrote on Date: Sun, 08 Oct 2023 18:13:16 UTC : > > The procedure of seeing if the a.out is produced without complaint > > might still be useful. > > The program compiles & links fine, but then also fails to run: > > ld-elf.so.1: Undefined symbol "_ZTVNSt3__13pmr25monotonic_buffer_resourceE" > referenced from COPY relocation in /usr/home/jbo/junk/a.out Well, for stable/13 's recent snapshot, freshly dd'd to USB3 media, so an official build, not a personal one that might be odd in some way: # uname -apKU FreeBSD generic 13.2-STABLE FreeBSD 13.2-STABLE stable/13-n256505-2464d8c5e296 GENERIC arm64 aarch64 1302508 1302508 (So, after 2023-Oct-01's ef295f69abbf that you originally referenced: 2023-Oct-04's 2464d8c5e296.) # c++ -v FreeBSD clang version 16.0.6 (https://github.com/llvm/llvm-project.git llvmorg-16.0.6-0-g7cbf1a259152) Target: aarch64-unknown-freebsd13.2 Thread model: posix InstalledDir: /usr/bin # c++ -std=c++17 -pedantic -O2 monotonic_buffer_resource.cpp # ./a.out t1 (default std alloc): 1.827 sec; t1/t1: 1.000 t2 (default pmr alloc): 1.818 sec; t1/t2: 1.005 t3 (pmr alloc no buf): 0.920 sec; t1/t3: 1.986 t4 (pmr alloc and buf): 0.606 sec; t1/t4: 3.015 The example is from in an aarch64 context. It does not agree with your report. > I made no progress on this. So far I've never reproduced your problem or anything like it. (I prefer testing official builds for problem isolation. If only my personal builds fail, then it is likely my build's problem.) > I have reinstalled world twice (from different > commits) and I re-installed all packages multiple times (also from different > ports tree commits). I suggest trying the most recent stable/13 snapshot build at the time of the experiment. No packages are used/needed for the monotonic_buffer_resource.cpp test. This fits well with using a snapshot context for such a test. > Any other wild ideas on how to fix this? I've no evidence about your stable/13 build/install. But the official snapshot that I tried worked just fine. > None of my other machines have any > issues whatsoever running on the same or similar stable/13 commit and using > the same poudriere repository. That, with my results, tends to suggest you have one odd ball context that has a problematical FreeBSD build/install. Again, I've no evidence to work with relative to that build/install. > This is certainly not Qt5 related. I run into the exact same issue with > anything that uses Qt6. > Furthermore, the test program we built experiences > the same issue without any involvement of the Qt libraries. There was no problem for my testing of monotonic_buffer_resource.cpp via the recent official snapshot build of stable/13 . I've not tried to test Qt5 or Qt6, sticking with the simpler/smaller context that you also report as failing in the odd context. I suggest avoiding Qt5/Qt6 testing until you have a context with the monotonic_buffer_resource.cpp test working. === Mark Millard marklmi at yahoo.com
RE: git: d2025992ab68 - releng/14.0 - release: update releng/14.0 from BETA to RC
Glen Barber wrote on Date: Fri, 13 Oct 2023 00:00:10 UTC : > The branch releng/14.0 has been updated by gjb: > > URL: > https://cgit.FreeBSD.org/src/commit/?id=d2025992ab6852d2a9ace62006e3a3ffa067364b > > commit d2025992ab6852d2a9ace62006e3a3ffa067364b > Author: Glen Barber > AuthorDate: 2023-10-12 23:55:33 + > Commit: Glen Barber > CommitDate: 2023-10-12 23:55:33 + > > release: update releng/14.0 from BETA to RC I'll note that today's: https://github.com/openzfs/zfs/commit/2bba9fd479f5 is another openzfs data corruption fix, this time involving TRIMs vs. metaslab allocations. === Mark Millard marklmi at yahoo.com
RE: freebsd-update 12.3 to 14.0RC1 takes 12-24 hours (block cloning regression)
Kevin Bowling wrote on Date: Tue, 17 Oct 2023 16:40:37 UTC : > I have two systems with a zpool 2x2 mirror on 7.2k RPM disks. One > system also has a flash SLOG. > > The flash SLOG system took around 12 hours to complete freebsd-update > from 13.2 to 14.0-RC1. The system without the SLOG took nearly 24 > hours. This was the result of ~50k patches, and ~10k files from > freebsd-update and a very pathological 'install' command performance. > > 'ps auxww | grep install': > root 52225 0.0 0.0 12852 2504 0 D+ 20:55 0:00.00 > install -S -o 0 -g 0 -m 0644 > b6850914127c27fe192a41387f5cec04a1d927e6605ff09e8fd88dcd74fdec9d > ///usr/src/sys/netgraph/ng_vlan.h > root 68042 0.0 0.0 13580 3648 0 I+ 02:24 0:01.14 > /bin/sh /usr/sbin/freebsd-update install root 69946 > 0.0 0.0 13580 3632 0 S+ 02:24 0:15.65 /bin/sh > /usr/sbin/freebsd-update install > > 'control+t on freebsd-update': > > load: 0.16 cmd: install 97128 [tx->tx_sync_done_cv] 0.67r 0.00u 0.00s > 0% 2440k > mi_switch+0xc2 _cv_wait+0x113 txg_wait_synced_impl+0xb9 > txg_wait_synced+0xb dmu_offset_next+0x77 zfs_holey+0x137 zfs_fre > ebsd_ioctl+0x4f vn_generic_copy_file_range+0x64b > kern_copy_file_range+0x327 sys_copy_file_range+0x78 > amd64_syscall+0x10c > fast_syscall_common+0xf8 > > I spoke with mjg about this and because my pools do not have block > cloning enabled, copy_file_range turns into a massive pessimization in > 'install'. Block cloning is new. So the past is sort of like Block cloning not being enabled now. This leads me to wonder: prior to block cloning existing, what would analogous times have been like instead of 12 hrs and 24 hrs? (Not that analogous would be easy to identify in history or test now.) Depending on the results, my next question might have been: "What happened for block cloning being disabled now to make it worse than before block cloning existed?" > He suggested a workaround of 'sysctl > vfs.zfs.dmu_offset_next_sync=0' but we should probably sort this out > for 14.0-RELEASE. === Mark Millard marklmi at yahoo.com
FYI: 13.2-STABLE stable/13-n256634-c4dfacd0b3c3 snapshot's send mail notices
For reasons of investigating a 13.2-STABLE related bugzilla report I'd dd'd and booted the snapshot that results in: # uname -apKU FreeBSD generic 13.2-STABLE FreeBSD 13.2-STABLE stable/13-n256634-c4dfacd0b3c3 GENERIC arm64 aarch64 1302508 1302508 I noticed the following messages on the console: Oct 27 03:01:00 generic sendmail[2521]: My unqualified host name (generic) unknown; sleeping for retry Oct 27 03:02:00 generic sendmail[2521]: unable to qualify my own domain name (generic) -- using short name Oct 27 03:02:01 generic sendmail[2628]: My unqualified host name (generic) unknown; sleeping for retry Oct 27 03:03:02 generic sendmail[2628]: unable to qualify my own domain name (generic) -- using short name Oct 27 03:03:02 generic sendmail[2633]: My unqualified host name (generic) unknown; sleeping for retry Oct 27 03:04:02 generic sendmail[2633]: unable to qualify my own domain name (generic) -- using short name Oct 27 03:04:05 generic sendmail[2787]: My unqualified host name (generic) unknown; sleeping for retry Oct 27 03:04:05 generic sendmail[2832]: My unqualified host name (generic) unknown; sleeping for retry Oct 27 03:04:05 generic sendmail[2833]: My unqualified host name (generic) unknown; sleeping for retry Oct 27 03:04:05 generic sendmail[2847]: My unqualified host name (generic) unknown; sleeping for retry Oct 27 03:05:05 generic sendmail[2787]: unable to qualify my own domain name (generic) -- using short name Oct 27 03:05:05 generic sendmail[2833]: unable to qualify my own domain name (generic) -- using short name Oct 27 03:05:05 generic sendmail[2832]: unable to qualify my own domain name (generic) -- using short name Oct 27 03:05:05 generic sendmail[2847]: unable to qualify my own domain name (generic) -- using short name Oct 28 03:01:00 generic sendmail[4605]: My unqualified host name (generic) unknown; sleeping for retry Oct 28 03:02:00 generic sendmail[4605]: unable to qualify my own domain name (generic) -- using short name Oct 28 03:02:01 generic sendmail[4713]: My unqualified host name (generic) unknown; sleeping for retry Oct 28 03:03:01 generic sendmail[4713]: unable to qualify my own domain name (generic) -- using short name Oct 28 03:03:01 generic sendmail[4718]: My unqualified host name (generic) unknown; sleeping for retry Oct 28 03:04:01 generic sendmail[4718]: unable to qualify my own domain name (generic) -- using short name Oct 28 03:04:03 generic sendmail[4867]: My unqualified host name (generic) unknown; sleeping for retry Oct 28 03:04:03 generic sendmail[4913]: My unqualified host name (generic) unknown; sleeping for retry Oct 28 03:04:03 generic sendmail[4912]: My unqualified host name (generic) unknown; sleeping for retry Oct 28 03:04:03 generic sendmail[4927]: My unqualified host name (generic) unknown; sleeping for retry Oct 28 03:05:03 generic sendmail[4867]: unable to qualify my own domain name (generic) -- using short name Oct 28 03:05:03 generic sendmail[4913]: unable to qualify my own domain name (generic) -- using short name Oct 28 03:05:03 generic sendmail[4912]: unable to qualify my own domain name (generic) -- using short name Oct 28 03:05:03 generic sendmail[4927]: unable to qualify my own domain name (generic) -- using short name Oct 28 04:15:21 generic sendmail[5154]: My unqualified host name (generic) unknown; sleeping for retry Oct 28 04:16:21 generic sendmail[5154]: unable to qualify my own domain name (generic) -- using short name Oct 29 03:01:00 generic sendmail[6807]: My unqualified host name (generic) unknown; sleeping for retry Oct 29 03:02:00 generic sendmail[6807]: unable to qualify my own domain name (generic) -- using short name Oct 29 03:02:01 generic sendmail[6916]: My unqualified host name (generic) unknown; sleeping for retry Oct 29 03:03:01 generic sendmail[6916]: unable to qualify my own domain name (generic) -- using short name Oct 29 03:03:01 generic sendmail[6921]: My unqualified host name (generic) unknown; sleeping for retry Oct 29 03:04:01 generic sendmail[6921]: unable to qualify my own domain name (generic) -- using short name Oct 29 03:04:03 generic sendmail[7070]: My unqualified host name (generic) unknown; sleeping for retry Oct 29 03:04:03 generic sendmail[7115]: My unqualified host name (generic) unknown; sleeping for retry Oct 29 03:04:03 generic sendmail[7116]: My unqualified host name (generic) unknown; sleeping for retry Oct 29 03:04:03 generic sendmail[7130]: My unqualified host name (generic) unknown; sleeping for retry It has been up 2 days 17 hr+ at that last. I do not know if that indicates a problem or not. Very little is changed from the snapshot expansion. I was only looking into the status of the system-clang related libc++ and such and the media content is just temporary for that purpose. === Mark Millard marklmi at yahoo.com
Is 14.0 to released based on 0 for sysctl vfs.zfs.bclone_enabled ?
It looks to me like releng/14.0 (as of 14.0-RC4) still has: int zfs_bclone_enabled; SYSCTL_INT(_vfs_zfs, OID_AUTO, bclone_enabled, CTLFLAG_RWTUN, &zfs_bclone_enabled, 0, "Enable block cloning"); leaving block cloning effectively disabled by default, no matter what the pool has enabled. https://www.freebsd.org/releases/14.0R/relnotes/ also reports: QUOTE OpenZFS has been upgraded to version 2.2. New features include: • block cloning, which allows shallow copies of blocks in file copies. This is optional, and disabled by default; it can be enabled with sysctl vfs.zfs.bclone_enabled=1. END QUOTE Just curiousity on my part about the default completeness of openzfs-2.2 support, not an objection either way. === Mark Millard marklmi at yahoo.com
Re: Is 14.0 to released based on 0 for sysctl vfs.zfs.bclone_enabled ?
On Nov 4, 2023, at 04:38, Mike Karels wrote: > On 4 Nov 2023, at 4:01, Ronald Klop wrote: > >> On 11/4/23 02:39, Mark Millard wrote: >>> It looks to me like releng/14.0 (as of 14.0-RC4) still has: >>> >>> int zfs_bclone_enabled; >>> SYSCTL_INT(_vfs_zfs, OID_AUTO, bclone_enabled, CTLFLAG_RWTUN, >>> &zfs_bclone_enabled, 0, "Enable block cloning"); >>> >>> leaving block cloning effectively disabled by default, no >>> matter what the pool has enabled. >>> >>> https://www.freebsd.org/releases/14.0R/relnotes/ also reports: >>> >>> QUOTE >>> OpenZFS has been upgraded to version 2.2. New features include: >>> • >>> block cloning, which allows shallow copies of blocks in file copies. This >>> is optional, and disabled by default; it can be enabled with sysctl >>> vfs.zfs.bclone_enabled=1. >>> END QUOTE >>> >> >> >> I think this answers your question in the subject. > > I think so too (and I wrote that text). Thanks for the confirmation of the final intent. I believe this makes: QUOTE author Brian Behlendorf 2023-05-25 20:53:08 + committer GitHub 2023-05-25 20:53:08 + commit 91a2325c4a0fbe01d0bf212e44fa9d85017837ce (patch) tree dd01dfce6aeef357ade1775acf18aade535c6271 . . . Update compatibility.d files Add an openzfs-2.2 compatibility file for the next release. Edon-R support has been enabled for FreeBSD removing the need for different FreeBSD and Linux files. Symlinks for the -linux and -freebsd names are created for any scripts expecting that convention. Additionally, a symlink for ubunutu-22.04 was added. Signed-off-by: Brian Behlendorf Closes #14833 END QUOTE technically incorrect in that compatibility.d/openzfs-2.2-freebsd should be distinct in content from compatibility.d/openzfs-2.2 so that block cloning would not be enabled. >>> Just curiousity on my part about the default completeness of >>> openzfs-2.2 support, not an objection either way. >>> >> >> >> I haven't seen new issues with block cloning in the last few weeks mentioned >> on the mailing lists. All known issues are fixed AFAIK. >> But I can imagine that the risk+effect ratio of data corruption is seen as a >> bit too high for a 14.0 release for this particular feature. That does not >> diminish the rest of the completeness of openzfs-2.2. >> >> NB: I'm not involved in developing openzfs or the decision making in the >> release. Just repeating what I read on the lists. > > There was another block cloning fix in 14.0-RC4; see the commit log. > Maybe there will be no more issues, but it seems that corner cases were > still being found recently. >> Looks like I'll stay at openzfs-2.1 pool features until there is a release that no longer has the default status: 0 for sysctl vfs.zfs.bclone_enabled I use main [so: 15 now] but only enable openzfs-2.* pool features supported by default on some FreeBSD release, that has an accurate compatibility.d/openzfs-2.*-freebsd file. === Mark Millard marklmi at yahoo.com
Re: Is 14.0 to released based on 0 for sysctl vfs.zfs.bclone_enabled ?
On Nov 5, 2023, at 16:27, Martin Matuška wrote: > OpenZFS 2.2.0 in FreeBSD 14 fully supports block cloning. You can work with > pools that have feature@block_cloning enabled. > The sysctl variable vfs.zfs.bclone_enabled affects the behavior of > zfs_clone_range() which is called by copy_file_range(). When it is set to 0, > zfs_clone_range() does not do block cloning. > If it is set to anything else than 0, zfs_clone_range() does block cloning > (if all conditions are met - same ZFS pool, correct data alignment, etc.). Ahh. From the naming and vague memories of the history, I did not understand that vfs.zfs.bclone_enabled has a narrower set of consequences than the name suggests and vfs.zfs.bclone_enabled=0 does not imply any lack of support for pools that have block cloning active. May be the wording at, for example https://www.freebsd.org/releases/14.0R/relnotes/ should be more explicit about the relationships involved when vfs.zfs.bclone_enabled=0 since others may read in the same bad interpretation that I did. Thanks for the note. Very helpful. > In FreeBSD-main, this tunable is enabled and I plan to enable it in stable/14 > somewhere around December 11, 2023. > > As of today I personally use block cloning on all my systems. > > mm > > On 04/11/2023 13:35, Mark Millard wrote: >> On Nov 4, 2023, at 04:38, Mike Karels wrote: >> >>> On 4 Nov 2023, at 4:01, Ronald Klop wrote: >>> >>>> On 11/4/23 02:39, Mark Millard wrote: >>>>> It looks to me like releng/14.0 (as of 14.0-RC4) still has: >>>>> >>>>> int zfs_bclone_enabled; >>>>> SYSCTL_INT(_vfs_zfs, OID_AUTO, bclone_enabled, CTLFLAG_RWTUN, >>>>> &zfs_bclone_enabled, 0, "Enable block cloning"); >>>>> >>>>> leaving block cloning effectively disabled by default, no >>>>> matter what the pool has enabled. >>>>> >>>>> https://www.freebsd.org/releases/14.0R/relnotes/ also reports: >>>>> >>>>> QUOTE >>>>> OpenZFS has been upgraded to version 2.2. New features include: >>>>> • >>>>> block cloning, which allows shallow copies of blocks in file copies. This >>>>> is optional, and disabled by default; it can be enabled with sysctl >>>>> vfs.zfs.bclone_enabled=1. >>>>> END QUOTE >>>>> >>>> >>>> I think this answers your question in the subject. >>> I think so too (and I wrote that text). >> Thanks for the confirmation of the final intent. >> >> I believe this makes: >> >> QUOTE >> author Brian Behlendorf 2023-05-25 20:53:08 + >> committer GitHub 2023-05-25 20:53:08 + >> commit 91a2325c4a0fbe01d0bf212e44fa9d85017837ce (patch) >> tree dd01dfce6aeef357ade1775acf18aade535c6271 >> . . . >> Update compatibility.d files >> >> Add an openzfs-2.2 compatibility file for the next release. Edon-R support >> has been enabled for FreeBSD removing the need for different FreeBSD and >> Linux files. Symlinks for the -linux and -freebsd names are created for any >> scripts expecting that convention. Additionally, a symlink for ubunutu-22.04 >> was added. Signed-off-by: Brian Behlendorf Closes >> #14833 >> END QUOTE >> >> technically incorrect in that compatibility.d/openzfs-2.2-freebsd >> should be distinct in content from compatibility.d/openzfs-2.2 so >> that block cloning would not be enabled. >> >> >>>>> Just curiousity on my part about the default completeness of >>>>> openzfs-2.2 support, not an objection either way. >>>>> >>>> >>>> I haven't seen new issues with block cloning in the last few weeks >>>> mentioned on the mailing lists. All known issues are fixed AFAIK. >>>> But I can imagine that the risk+effect ratio of data corruption is seen as >>>> a bit too high for a 14.0 release for this particular feature. That does >>>> not diminish the rest of the completeness of openzfs-2.2. >>>> >>>> NB: I'm not involved in developing openzfs or the decision making in the >>>> release. Just repeating what I read on the lists. >>> There was another block cloning fix in 14.0-RC4; see the commit log. >>> Maybe there will be no more issues, but it seems that corner cases were >>> still being found recently. >> Looks like I'll stay at openzfs-2.1 pool features until there is >> a release that no longer has the default status: >> >> 0 for sysctl vfs.zfs.bclone_enabled >> >> I use main [so: 15 now] but only enable openzfs-2.* pool features >> supported by default on some FreeBSD release, that has an accurate >> compatibility.d/openzfs-2.*-freebsd file. > === Mark Millard marklmi at yahoo.com
RELENG_14 [process] was killed: failed to reclaim memory
mike tancsa wrote on Date: Tue, 14 Nov 2023 13:44:22 UTC : > While testing some new hardware on a recent RELENG_14 image (from Nov > 10th), I noticed some of my ssh sessions would get killed off with the > errors below (twice in 24hrs) > > pid 1697 (sshd), jid 0, uid 1001, was killed: failed to reclaim memory > pid 6274 (sshd), jid 0, uid 1001, was killed: failed to reclaim memory > . . . [My notes below are not specific to releng/14.0 or to stable/14 .] What do you have for ( copied from my /boot/loader.conf ): # # Delay when persistent low free RAM leads to # Out Of Memory killing of processes: vm.pageout_oom_seq=120 The default is 12 (last I knew, anyway). The 120 figure has allowed me and others to do buildworld, buildkernel, and poudriere bulk runs on small arm boards using all cores that otherwise got "failed to reclaim memory" (to use the modern, improved [not misleading] message text). (The units for the 120 are not time units: more like a number of (re)tries to gain at least a target amount of Free RAM before failure handling starts. The comment wording is based on a consequence of the assignment.) The 120 is not a maximum, just a figure that has proved useful in various contexts. Notes: "failed to reclaim memory" can happen even with swap space enabled but no swap in use: sufficiently active pages are just not paged out to swap space. There are some other parameters of possible use for some other modern "was killed" reason texts. === Mark Millard marklmi at yahoo.com
Re: RELENG_14 [process] was killed: failed to reclaim memory
On Nov 15, 2023, at 08:58, mike tancsa wrote: > On 11/14/2023 8:25 PM, Mark Millard wrote: >> mike tancsa wrote on >> Date: Tue, 14 Nov 2023 13:44:22 UTC : >> >>> While testing some new hardware on a recent RELENG_14 image (from Nov >>> 10th), I noticed some of my ssh sessions would get killed off with the >>> errors below (twice in 24hrs) >>> >>> pid 1697 (sshd), jid 0, uid 1001, was killed: failed to reclaim memory >>> pid 6274 (sshd), jid 0, uid 1001, was killed: failed to reclaim memory >>> . . . >> [My notes below are not specific to releng/14.0 or to >> stable/14 .] >> >> What do you have for ( copied from my /boot/loader.conf ): > > Thanks Mark, no tuning in there other than forcing a particular driver to > attach > > # cat /boot/loader.conf > kern.geom.label.disk_ident.enable="0" > kern.geom.label.gptid.enable="0" > cryptodev_load="YES" > zfs_load="YES" > hw.mfi.mrsas_enable=1 > t5fw_cfg_load="YES" > if_cxgbe_load="YES" > # > > > >> # >> # Delay when persistent low free RAM leads to >> # Out Of Memory killing of processes: >> vm.pageout_oom_seq=120 >> >> The default is 12 (last I knew, anyway). > > Any thoughts for a machine with a lot of RAM, Am I better to limit ARC or > change the default to 120 ? > I have vm.pageout_oom_seq=120 everywhere, from the little arm's to the ThreadRipper 1950X with 128 GiBytes of RAM. I've hit the kills in all the contexts, even UFS based on the 1950X (no ARC competing for RAM). (High load average style of bulk -a test run, using USE_TMPFS=data and using even USE_TMPFS=no .) (My bulk -a testing is rare.) === Mark Millard marklmi at yahoo.com
Is releng/13.2 deliberately missing: #15395 1ca531971 Zpool can start allocating from metaslab before TRIMs have completed
zfs: merge openzfs/zfs@d99134be8 (zfs-2.1-release) into stable/13 included a metaslab vs. TRIM related merge: QUOTE OpenZFS release 2.1.14 Notable upstream pull request merges: #15395 1ca531971 Zpool can start allocating from metaslab before TRIMs have completed END QUOTE that does not seem to have been commited into releng/13.2 . Was this deliberate? By contrast, the other 2.1.14 notable upstream request merge commited into stable/13: QUOTE #15571 77b0c6f04 dnode_is_dirty: check dnode and its data for dirtiness END QUOTE was also committed into releng/13.2 . === Mark Millard marklmi at yahoo.com
aarch64 and armv6 vs. armv7 support: armv6 is not supported, despite what "man arch" reports
man arch reports: QUOTE Some machines support more than one FreeBSD ABI. Typically these are 64-bit machines, where the “native” LP64 execution environment is accompanied by the “legacy” ILP32 environment, which was the historical 32-bit predecessor for 64-bit evolution. Examples are: LP64 ILP32 counterpart amd64i386 powerpc64powerpc aarch64 armv6/armv7 aarch64 will support execution of armv6 or armv7 binaries if the CPU implements AArch32 execution state, however older armv4 and armv5 binaries aren't supported. END QUOTE (I take "armv6 or armv7 binaries" as what was built targeting a FreeBSD architecture triple for one of those. FreeBSD keeps them distinct.) However, the armv6 part of that is wrong: The infrastructure supports only one 32-bit alternative for a given kernel, not a family of them at once . . . sys/kern/kern_mib.c : static const char * proc_machine_arch(struct proc *p) { if (p->p_sysent->sv_machine_arch != NULL) return (p->p_sysent->sv_machine_arch(p)); #ifdef COMPAT_FREEBSD32 if (SV_PROC_FLAG(p, SV_ILP32)) return (MACHINE_ARCH32); #endif return (MACHINE_ARCH); } . . . static int sysctl_kern_supported_archs(SYSCTL_HANDLER_ARGS) { const char *supported_archs; supported_archs = #ifdef COMPAT_FREEBSD32 compat_freebsd_32bit ? MACHINE_ARCH " " MACHINE_ARCH32 : #endif MACHINE_ARCH; return (SYSCTL_OUT(req, supported_archs, strlen(supported_archs) + 1)); } sys/arm64/include/param.h : #define MACHINE_ARCHES MACHINE_ARCH " " MACHINE_ARCH32 . . . #define MACHINE_ARCH32 "armv7" (There is no "armv6" alternative present.) But with something like: #define MACHINE_ARCH32 "armv7 armv6" MACHINE_ARCH32 is not interpreted as a list of alternatives, each supported. There is code that would have to be reworked to allow a list of alternatives to work. One can build a custom kernel with: #define MACHINE_ARCH32 "armv6" and then, having booted that kernel, then run armv6 on aarch64 --but, then, not armv7. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256132 is about this and has my messy notes as I explored and discovered that multiple 32-bit alternatives did not work. I see that I forgot various quote (") symbols. This note was prompted by: https://lists.freebsd.org/archives/freebsd-hackers/2023-December/002728.html that mentions "the list of valid MACHINE_ARCH" that reminded me of this old issue. === Mark Millard marklmi at yahoo.com
Re: aarch64 and armv6 vs. armv7 support: armv6 is not supported, despite what "man arch" reports
On Dec 7, 2023, at 01:19, Dimitry Andric wrote: > On 7 Dec 2023, at 05:31, Mark Millard wrote: >> >> man arch reports: >> >> QUOTE >>Some machines support more than one FreeBSD ABI. Typically these are >>64-bit machines, where the “native” LP64 execution environment is >>accompanied by the “legacy” ILP32 environment, which was the historical >>32-bit predecessor for 64-bit evolution. Examples are: >> >> LP64 ILP32 counterpart >> amd64i386 >> powerpc64powerpc >> aarch64 armv6/armv7 > > So, this might be replaced with "armv6^armv7" or "armv6 xor armv7", then? Only for folks that build from source. For those folks, a footnote about updating MACHINE_ARCH32 in sys/arm64/include/param.h would be appropriate. It is not exactly obvious or commonly known. Hmm, thinking more about the old bugzilla information . . . I'll also note that my information predated lib32 on aarch64: just chroot/jail sorts of use back then, and I just tested chroot back then. I've never tested a lib32 context for armv6 on aarch64 for an adjusted MACHINE_ARCH32. === Mark Millard marklmi at yahoo.com
Re: 15 & 14: ram_attach vs. its using regions_to_avail vs. "bus_alloc_resource" can lead to: panic("ram_attach: resource %d failed to attach", rid)
On Jan 12, 2024, at 09:57, Doug Rabson wrote: > On Sat, 30 Sept 2023 at 08:47, Mark Millard wrote: > ram_attach is based on regions_to_avail but that is a problem for > its later bus_alloc_resource use --and that can lead to: > > panic("ram_attach: resource %d failed to attach", rid); > > Unfortunately, the known example is use of EDK2 on RPi4B > class systems, not what is considered the supported way. > The panic happens for main [so: 15] and will happen once > the cortex-a72 handling in 14.0-* is in a build fixed by: > > • git: 906bcc44641d - releng/14.0 - arm64: Fix errata workarounds that > depend on smccc Andrew Turner > > The lack of the fix leads to an earlier panic as stands. > > > sys/kern/subr_physmem.c 's regions_to_avail is based on ignoring > phys_avail and using only hwregions and exregions. In other words, > in part: > > * Initially dump_avail and phys_avail are identical. Boot time memory > * allocations remove extents from phys_avail that may still be included > * in dumps. > > This means that early, dedicated memory allocations are treated > as available for general use by regions_to_avail . The distinction > is visible in the boot -v output in that: > > real memory = 3138154496 (2992 MB) > Physical memory chunk(s): > 0x20 - 0x002b7f, 727711744 bytes (177664 pages) > 0x002ce3a000 - 0x003385, 111304704 bytes (27174 pages) > 0x00338c - 0x00338c6fff, 28672 bytes (7 pages) > 0x0033a3 - 0x0036ef, 55377920 bytes (13520 pages) > 0x00372e - 0x003b2f, 67239936 bytes (16416 pages) > 0x004000 - 0x00bb3dcfff, 2067648512 bytes (504797 pages) > avail memory = 3027378176 (2887 MB) > > does not list the wider: > > 0x004000 - 0x00bfff > > because of phys_avail . But the earlier dump based on hwregions and > exregions shows: > > Physical memory chunk(s): > 0x001d - 0x001e, 0 MB ( 32 pages) > 0x0020 - 0x338c6fff, 822 MB ( 210631 pages) > 0x3392 - 0x3b2f, 121 MB ( 31200 pages) > 0x4000 - 0xbfff, 2048 MB ( 524288 pages) > Excluded memory regions: > 0x001d - 0x001e, 0 MB ( 32 pages) NoAlloc > 0x2b80 - 0x2ce39fff,22 MB ( 5690 pages) NoAlloc > 0x3386 - 0x338b, 0 MB ( 96 pages) NoAlloc > 0x3392 - 0x33a2, 1 MB (272 pages) NoAlloc > 0x36f0 - 0x372d, 3 MB (992 pages) NoAlloc > > which indicates: > > 0x4000 - 0xbfff > > is available as far as it is concerned. > > (Note some code works/displays in terms of: 0x4000 - 0xc000 > instead.) > > For aarch64 , sys/arm64/arm64/nexus.c has a nexus_alloc_resource > that is used as bus_alloc_resource . It ends up rejecting the > RPi4B boot via using the result of the call in ram_attach: > > if (bus_alloc_resource(dev, SYS_RES_MEMORY, &rid, start, end, > end - start, 0) == NULL) > panic("ram_attach: resource %d failed to attach", > rid); > > as shown by the just-prior start/end pair sequence messages: > > ram0: reserving memory region: 20-2b80 > ram0: reserving memory region: 2ce3a000-3386 > ram0: reserving memory region: 338c-338c7000 > ram0: reserving memory region: 33a3-36f0 > ram0: reserving memory region: 372e-3b30 > ram0: reserving memory region: 4000-c000 > panic: ram_attach: resource 5 failed to attach > > I do not see anything about this that looks inherently RPi* > specific for possibly ending up with an analogous panic. So > I expect the example is sufficient context to identify a > problem is present, despite EDK2 use not being normal for > RPi4B's and the like as far as FreeBSD is concerned. > > I'm not quite clear why phys_avail changes Do not be confused by common labeling to distinct data: Note the "phys_avail" vs. "hwregions" despite the label "Physical memory chunk(s):" : static void cpu_startup(void *dummy) { vm_paddr_t size; int i; printf("real memory = %ju (%ju MB)\n", ptoa((uintmax_t)realmem), ptoa((uintmax_t)realmem) / 1024 / 1024); if (bootverbose) { printf("Physical memory chunk(s):\n"); for (i = 0; phys_avail[i + 1] != 0; i += 2) { size = phys_avail[i + 1] - phys_avail[i]; printf("%#016jx - %#016jx, %ju bytes (%ju pages)\n", (uintmax_t)phys_avail[i], (uintmax_t)phys_avail[i + 1] - 1,
RE: Should changes in src/usr.sbin/bhyve/ trigger an llvm rebuild?
David Wolfskill wrote on Date: Sun, 28 Jan 2024 11:50:44 UTC : > Context for this is in-place source-based updates using META_MODE, amd64 > arch. > . . . > But llvm is now being rebuilt. > > Why? The following two sequences are very different: make buildworld make buildworld vs. make buildworld make installworld make buildworld The installworld can update a lot of non-source files that were used to do the first build world. META_MODE notices such updates and does rebuild activity because of them. One more sequence: make buildworld make installworld update some sources make buildworld For that the installworld may be the larger change compared to the source updates as far as contributions to rebuild activity go. This sort of thing is likely what you had happen. === Mark Millard marklmi at yahoo.com
Re: Should changes in src/usr.sbin/bhyve/ trigger an llvm rebuild?
On Jan 28, 2024, at 07:46, David Wolfskill wrote: > On Sun, Jan 28, 2024 at 07:30:53AM -0800, Mark Millard wrote: >> ... >> The following two sequences are very different: >> >> make buildworld >> make buildworld >> >> vs. >> >> make buildworld >> make installworld >> make buildworld >> >> The installworld can update a lot of non-source >> files that were used to do the first build world. >> META_MODE notices such updates and does rebuild >> activity because of them. > > First: Thank you for replying & suggesting the above. > > That said, one of the machines in question is my local "build machine" -- > and for it, in addition to in-place source updates, I also do (weekly) > updates of my "production" machines (at home). > > And for that case, the production machines mount the builder's /usr/src > and /usr/obj (via NFS) read-only. Which machine(s) are doing the llvm rebuild that you were hoping would not happen? What was the context like for the history on that machine? (The below had to be written without understanding of such things.) Here is an example META_MODE line recording a tool used during a particular file's rebuild: E 22961 /bin/sh So installing an update to /bin/sh via isntallworld would lead to the later META_MODE (re)build indicating that the file needs to be rebuilt, just because /bin/sh ends up being newer after the installworld . There are other examples of recorded paths to tools in .meta file, such as (my old context example used in an old E-mail exchange): /usr/obj/amd64_clang/amd64.amd64/usr/fbsd/mm-src/amd64.amd64/tmp/legacy/usr/sbin/awk So if /usr/obj/. . ./tmp/legacy/usr/sbin/awk is newer than the file potentially being rebuilt, make ends up with: file '/usr/obj/amd64_clang/amd64.amd64/usr/fbsd/mm-src/amd64.amd64/tmp/legacy/usr/sbin/awk' is newer than the target... (make has a mode that reports such things. I used it to find out what all contributed to some rebuild activity in order to figure out the general type of thing that was happeneing. Then I used it to find all the "is newer than" material that I expected to be unlikely to contribute to build changes.) It does not matter if: /usr/obj/amd64_clang/amd64.amd64/usr/fbsd/mm-src/amd64.amd64/tmp/legacy/usr/sbin/awk is read-only at the potential-rebuild-of-file time. Only if it is newer. Simon J. Gerraty and I had a long exchange about this in 2023-Feb, that was in turn based on a earlier 2021-Jan report of mine. There are also issues when symbolic links are involved, if I remember right. At the time (2023) I was doing experiments with making some of this "unlikely to cause build differences" material end up being ignored. Ultimately, Simon provided me a patch to share/mk/src.sys.obj.mk to help with my experiments. See "Re: FYI: Why META_MODE rebuilds so much for building again after installworld (no source changes)", starting with the 2023-Feb material at: https://lists.freebsd.org/archives/freebsd-current/2023-February/003239.html I will note that my activity did not involve NFS mounts, only completely self-hosted builds on directly connected media, the boot media. I've no evidence if such NFS involvement makes any additional differences. bectl use can be used to keep around an example "after the build but before the install" place from the most recent build. It can be used for doing the next build to avoid the later installworld consequences on time relationships for the likes of /bin/sh . (It is also a place to revert to if an install went badly.) > And without complaints of attempts to > scribble on read-only stuff. :-} Detailed time relationships are what matter. You may have to work out what those are. > So if "make installworld" messes with anything that META_MODE cares > about ... that would appear to be somewhat surprising. See above. > Mind, I've been wrong before, and I do intend to live long enough to be > wrong again :-) > >> One more sequence: >> >> make buildworld >> make installworld >> update some sources >> make buildworld >> >> For that the installworld may be the larger >> change compared to the source updates as far >> as contributions to rebuild activity go. >> >> This sort of thing is likely what you had >> happen. >> > > Hmm Thanks again. === Mark Millard marklmi at yahoo.com
Re: Should changes in src/usr.sbin/bhyve/ trigger an llvm rebuild?
[Note: your email is rejecting my E-mail: 554: 5.7.1 ] On Jan 28, 2024, at 14:06, David Wolfskill wrote: > On Sun, Jan 28, 2024 at 10:20:30AM -0800, Mark Millard wrote: >> ... >>> That said, one of the machines in question is my local "build machine" -- >>> and for it, in addition to in-place source updates, I also do (weekly) >>> updates of my "production" machines (at home). >>> >>> And for that case, the production machines mount the builder's /usr/src >>> and /usr/obj (via NFS) read-only. >> >> Which machine(s) are doing the llvm rebuild that >> you were hoping would not happen? > > Each of the 3 machines that I update via in-place source updates: the > above-cited "buildl machine" and a couple of laptops. > >> What was the context like for the history on that machine? > > Each of the machines is updated daily (except when I'm away and > off-Net); each is updated to the same commit (as each has a local > private mirror for the FreeBSD git repositories, and after updating the > build machine's mirror, I use rsync to ensure that the laptops' mirrors > are in sync with that). > > Update histories for the build machine and one of the laptops is > available at https://www.catwhisker.org/~david/FreeBSD/history/ > > In each of the 3 cases this morning, the machine was running > stable/14-n266551-63a7e799b32c and updated to > stable/14-n266554-2ee407b6068a, which (as noted earlier) only changed > src/usr.sbin/bhyve/pci_nvme.c. And each machine rebuilt llvm durng > "make buildworld". When you built and then installed stable/14-n266551-63a7e799b32c if you had then simply started another build where you installed, it would have rebuilt llvm at that point --before stable/14-n266554-2ee407b6068a updated source was even present. The install of 63a7e799b32c made various tools used to do builds newer than the files used to do the build of 63a7e799b32c. That is enough for META_MODE to initiate rebuild activity so that things end up synchronized to be based on the updated installed tools. (Some tools might not be updated, others might be. The details depend on which are updated with new timestamps used by makes "newer" checks.) Try running make with the debug mode turned on that reports the "newer than" notices for what leads to rebuild activity (make -dM) after a notable installworld but before any source code updates. You might not like the full range of things checked but you will see why things are rebuilt. META_MODE tests date relationships among more files than you are considering. > >> (The below had to be written without understanding >> of such things.) >> >> Here is an example META_MODE line recording a >> tool used during a particular file's rebuild: >> >> E 22961 /bin/sh >> >> So installing an update to /bin/sh via isntallworld >> would lead to the later META_MODE (re)build >> indicating that the file needs to be rebuilt, just >> because /bin/sh ends up being newer after the >> installworld . > > Perhaps I should rephrase my query to "*Should* an update of (only) > src/usr.sbin/bhyve/pci_nvme.c cause 'make buildworld' using META_MODE to > rebuild llvm?" I seem to have empirical evidence that it does do that. Changes to src/usr.sbin/bhyve/pci_nvme.c are not a cause of the rebuild. The prior installworld of 63a7e799b32c is the cause of the rebuild. If you had tried the build before updating the source tree, it still would have rebuilt llvm. > >> There are other examples of recorded paths to tools >> in .meta file, such as (my old context example >> used in an old E-mail exchange): >> >> /usr/obj/amd64_clang/amd64.amd64/usr/fbsd/mm-src/amd64.amd64/tmp/legacy/usr/sbin/awk >> >> So if /usr/obj/. . ./tmp/legacy/usr/sbin/awk is newer than the file >> potentially being rebuilt, make ends up with: >> > > Right; after some discussion with Simon and/or Bryan (back on 08 July > 2017), I augmented /etc/src.conf on the laptops to include: > > .MAKE.META.IGNORE_PATHS += /usr/local/etc/libmap.d > > because I (also) had: > > PORTS_MODULES+=x11/nvidia-driver-390 > > in there, so x11/nvidia-driver-390 was being rebuilt every time the > kernel was being rebuilt, and that caused /usr/local/etc/libmap.d to > get an update. So META_MODE wasn't cutting down on the rebuilds in that > case. > > The above .MAKE.META.IGNORE_PATHS line helped address that issue. > > Perhaps something somewhat similar is wanted to prevent the situation > that catalyzed the initial message in this thread? === Mark Millard marklmi at yahoo.com
Re: Should changes in src/usr.sbin/bhyve/ trigger an llvm rebuild?
On Jan 28, 2024, at 14:34, Mark Millard wrote: > [Note: your email is rejecting my E-mail: > 554: 5.7.1 ] > > On Jan 28, 2024, at 14:06, David Wolfskill wrote: > >> On Sun, Jan 28, 2024 at 10:20:30AM -0800, Mark Millard wrote: >>> ... >>>> That said, one of the machines in question is my local "build machine" -- >>>> and for it, in addition to in-place source updates, I also do (weekly) >>>> updates of my "production" machines (at home). >>>> >>>> And for that case, the production machines mount the builder's /usr/src >>>> and /usr/obj (via NFS) read-only. >>> >>> Which machine(s) are doing the llvm rebuild that >>> you were hoping would not happen? >> >> Each of the 3 machines that I update via in-place source updates: the >> above-cited "buildl machine" and a couple of laptops. >> >>> What was the context like for the history on that machine? >> >> Each of the machines is updated daily (except when I'm away and >> off-Net); each is updated to the same commit (as each has a local >> private mirror for the FreeBSD git repositories, and after updating the >> build machine's mirror, I use rsync to ensure that the laptops' mirrors >> are in sync with that). >> >> Update histories for the build machine and one of the laptops is >> available at https://www.catwhisker.org/~david/FreeBSD/history/ >> >> In each of the 3 cases this morning, the machine was running >> stable/14-n266551-63a7e799b32c and updated to >> stable/14-n266554-2ee407b6068a, which (as noted earlier) only changed >> src/usr.sbin/bhyve/pci_nvme.c. And each machine rebuilt llvm durng >> "make buildworld". > > When you built and then installed stable/14-n266551-63a7e799b32c > if you had then simply started another build where you installed, > it would have rebuilt llvm at that point --before > stable/14-n266554-2ee407b6068a updated source was even present. > > The install of 63a7e799b32c made various tools used to > do builds newer than the files used to do the build of > 63a7e799b32c. That is enough for META_MODE to initiate > rebuild activity so that things end up synchronized to > be based on the updated installed tools. (Some tools > might not be updated, others might be. The details > depend on which are updated with new timestamps used > by makes "newer" checks.) > > Try running make with the debug mode turned on that > reports the "newer than" notices for what leads to > rebuild activity (make -dM) after a notable installworld > but before any source code updates. You might not like > the full range of things checked but you will see why > things are rebuilt. > > META_MODE tests date relationships among more files > than you are considering. > >> >>> (The below had to be written without understanding >>> of such things.) >>> >>> Here is an example META_MODE line recording a >>> tool used during a particular file's rebuild: >>> >>> E 22961 /bin/sh >>> >>> So installing an update to /bin/sh via isntallworld >>> would lead to the later META_MODE (re)build >>> indicating that the file needs to be rebuilt, just >>> because /bin/sh ends up being newer after the >>> installworld . >> >> Perhaps I should rephrase my query to "*Should* an update of (only) >> src/usr.sbin/bhyve/pci_nvme.c cause 'make buildworld' using META_MODE to >> rebuild llvm?" I seem to have empirical evidence that it does do that. > > Changes to src/usr.sbin/bhyve/pci_nvme.c are not a > cause of the rebuild. The prior installworld of > 63a7e799b32c is the cause of the rebuild. > > If you had tried the build before updating the > source tree, it still would have rebuilt llvm. > >> >>> There are other examples of recorded paths to tools >>> in .meta file, such as (my old context example >>> used in an old E-mail exchange): >>> >>> /usr/obj/amd64_clang/amd64.amd64/usr/fbsd/mm-src/amd64.amd64/tmp/legacy/usr/sbin/awk >>> >>> So if /usr/obj/. . ./tmp/legacy/usr/sbin/awk is newer than the file >>> potentially being rebuilt, make ends up with: >>> >> >> Right; after some discussion with Simon and/or Bryan (back on 08 July >> 2017), I augmented /etc/src.conf on the laptops to include: >> >> .MAKE.META.IGNORE_PATHS += /usr/local/etc/libmap.d >> >> because I (also) had: >> >> PORTS_MODULES+=x11/nvidia-driver-390 &g
Re: Should changes in src/usr.sbin/bhyve/ trigger an llvm rebuild?
On Jan 28, 2024, at 16:05, David Wolfskill wrote: > On Sun, Jan 28, 2024 at 03:00:59PM -0800, Mark Millard wrote: >> ... >> To be clear, referencing details of your context: >> >> When you had the stable/14 machines at 1c090bf880bf: >> >> A) You built (META_MODE): 63a7e799b32c >> B) You installed: 63a7e799b32c >> C) You rebooted into: 63a7e799b32c >> >> I'm claiming that next doing: >> >> D) build again (still META_MODE): 63a7e799b32c >> >> would have rebuilt llvm at that point, the >> time-relationship cause(s) being set up >> during (B). > > As it happens, I rather fumble-fingered the (intended) reboot on the 2nd > laptop (and started another rebuild instead). > > And I do these within script(1), as it's handy to have a record. > > Note that this differes from the sequence you cite above, in that I > failed to do the reboot. > > So I powered it back up and -- without updating sources (or the local > repo mirror, for that matter) -- did another rebuild. > I'm having trouble identifying the detailed sequencing being reported below. Doing on one machine: installworld buidlworld buildworld buildworld . . Will only take large times for the first one (potentially). But doing: installworld buidlworld installworld buildworld installworld buildworld . . Can have each buildworld take large times depending the the details involved. I need to understand more about what happened before each buildworld on each machine to know what sort of timestamp relationships are involved for files. installworld can significantly change various timestamp relationships. > Here is an extract of some salient lines from the typescript file: > > g1-48(14.0-S)[4] egrep ' built in |Installing .* (started|completed)|Removing > old libraries| stable/14-n' s1 > FreeBSD g1-48.catwhisker.org 14.0-STABLE FreeBSD 14.0-STABLE #38 > stable/14-n266551-63a7e799b32c: Sat Jan 27 11:40:05 UTC 2024 > r...@g1-48.catwhisker.org:/common/S1/obj/usr/src/amd64.amd64/sys/CANARY amd64 > 1400506 1400506 >>>> World built in 2351 seconds, ncpu: 8, make -j16 Was a prior step (ignoring reboots, say) an installworld of 63a7e799b32c, with no other buidlworlds after the installworld? (I'm wording for major steps or my description the possibilities would get rather complicated and large.) >>>> Kernel(s) CANARY built in 898 seconds, ncpu: 8, make -j16 >>>> Installing kernel CANARY completed on Sun Jan 28 12:25:27 UTC 2024 >>>> Installing everything started on Sun Jan 28 12:25:57 UTC 2024 >>>> Installing everything completed on Sun Jan 28 12:28:01 UTC 2024 > FreeBSD g1-48.catwhisker.org 14.0-STABLE FreeBSD 14.0-STABLE #38 > stable/14-n266551-63a7e799b32c: Sat Jan 27 11:40:05 UTC 2024 > r...@g1-48.catwhisker.org:/common/S1/obj/usr/src/amd64.amd64/sys/CANARY amd64 > 1400506 1400506 >>>> World built in 116 seconds, ncpu: 8, make -j16 Was a prior step (ignoring reboots, say) an installworld of 63a7e799b32c, with no other buidlworlds after the installworld? Is the answer different here? >>>> Kernel(s) CANARY built in 920 seconds, ncpu: 8, make -j16 >>>> Installing kernel CANARY completed on Sun Jan 28 12:47:55 UTC 2024 >>>> Installing everything started on Sun Jan 28 12:48:25 UTC 2024 >>>> Installing everything completed on Sun Jan 28 12:50:01 UTC 2024 > FreeBSD g1-48.catwhisker.org 14.0-STABLE FreeBSD 14.0-STABLE #40 > stable/14-n266554-2ee407b6068a: Sun Jan 28 12:39:17 UTC 2024 > r...@g1-48.catwhisker.org:/common/S1/obj/usr/src/amd64.amd64/sys/CANARY amd64 > 1400506 1400506 >>>> Removing old libraries > FreeBSD g1-48.catwhisker.org 14.0-STABLE FreeBSD 14.0-STABLE #40 > stable/14-n266554-2ee407b6068a: Sun Jan 28 12:39:17 UTC 2024 > r...@g1-48.catwhisker.org:/common/S1/obj/usr/src/amd64.amd64/sys/CANARY amd64 > 1400506 1400506 >>>> World built in 124 seconds, ncpu: 8, make -j16 Was a prior step (ignoring reboots, say) an installworld of 63a7e799b32c with no other buidlworlds after the, installworld? >>>> Kernel(s) CANARY built in 901 seconds, ncpu: 8, make -j16 >>>> Installing kernel CANARY completed on Sun Jan 28 23:34:39 UTC 2024 >>>> Installing everything started on Sun Jan 28 23:35:09 UTC 2024 >>>> Installing everything completed on Sun Jan 28 23:37:16 UTC 2024 > FreeBSD g1-48.catwhisker.org 14.0-STABLE FreeBSD 14.0-STABLE #41 > stable/14-n266554-2ee407b6068a: Sun Jan 28 23:26:10 UTC 2024 > r...@g1-48.catwhisker.org:/common/S1/obj/usr/src/amd64.amd64/sys/CANARY amd64 > 1400506 1400506 >>>> Removing old libraries > g1-48(14.0-S)[5]
Re: Should changes in src/usr.sbin/bhyve/ trigger an llvm rebuild?
ch the partial rebuild of 63a7e799b32c contributes to timestamps that would cause more rebuilds. The 116 sec indicates: not much gets new timestamps this time. ~/Downloads/build_typescript.txt:119629: >>> Kernel(s) CANARY built in 920 seconds, ncpu: 8, make -j16 ~/Downloads/build_typescript.txt:119636: >>> Installing kernel CANARY on Sun Jan 28 12:47:27 UTC 2024 ~/Downloads/build_typescript.txt:122450: >>> Installing kernel CANARY completed on Sun Jan 28 12:47:55 UTC 2024 installkernel does not change notable timestamp relationships of tools and such vs. other files. ~/Downloads/build_typescript.txt:123346: >>> Installing everything started on Sun Jan 28 12:48:25 UTC 2024 ~/Downloads/build_typescript.txt:162156: >>> Installing everything completed on Sun Jan 28 12:50:01 UTC 2024 This install's both the partial-63a7e799b32c-rebuild material and the 2ee407b6068a material. The 116 sec figure suggests that there is not man files with updated timestamps. A reboot is involved here (or just below), so 2ee407b6068a will show up. ~/Downloads/build_typescript.txt:162840: To remove old libraries run 'make delete-old-libs'. ~/Downloads/build_typescript.txt:162841: >> make delete-old OK ~/Downloads/build_typescript.txt:162895: FreeBSD g1-48.catwhisker.org 14.0-STABLE FreeBSD 14.0-STABLE #40 stable/14-n266554-2ee407b6068a: Sun Jan 28 12:39:17 UTC 2024 r...@g1-48.catwhisker.org:/common/S1/obj/usr/src/amd64.amd64/sys/CANARY amd64 1400506 1400506 The 2ee407b6068a kernel now shows as being in operation. ~/Downloads/build_typescript.txt:162897: >>> Removing old libraries ~/Downloads/build_typescript.txt:162932: FreeBSD g1-48.catwhisker.org 14.0-STABLE FreeBSD 14.0-STABLE #40 stable/14-n266554-2ee407b6068a: Sun Jan 28 12:39:17 UTC 2024 r...@g1-48.catwhisker.org:/common/S1/obj/usr/src/amd64.amd64/sys/CANARY amd64 1400506 1400506 Still 2ee407b6068a. ~/Downloads/build_typescript.txt:162938: >>> World build started on Sun Jan 28 23:17:05 UTC 2024 ~/Downloads/build_typescript.txt:180497: >>> World built in 124 seconds, ncpu: 8, make -j16 It is possible here that little or no 63a7e799b32c related timestamp changes that lead to rebuild activity were involved in the above buildworld . It depends on the details of what was rebuilt. the 116 sec and 124 sec figures both suggest: no much overall. ~/Downloads/build_typescript.txt:200023: >>> Kernel(s) CANARY built in 901 seconds, ncpu: 8, make -j16 ~/Downloads/build_typescript.txt:200030: >>> Installing kernel CANARY on Sun Jan 28 23:34:11 UTC 2024 ~/Downloads/build_typescript.txt:202844: >>> Installing kernel CANARY completed on Sun Jan 28 23:34:39 UTC 2024 installkernel does not change notable timestamp relationships of tools and such vs. other files. ~/Downloads/build_typescript.txt:203743: >>> Installing everything started on Sun Jan 28 23:35:09 UTC 2024 ~/Downloads/build_typescript.txt:242553: >>> Installing everything completed on Sun Jan 28 23:37:16 UTC 2024 2ee407b6068a will still show up after the the reboot. ~/Downloads/build_typescript.txt:243237: To remove old libraries run 'make delete-old-libs'. ~/Downloads/build_typescript.txt:243238: >> make delete-old OK ~/Downloads/build_typescript.txt:243292: FreeBSD g1-48.catwhisker.org 14.0-STABLE FreeBSD 14.0-STABLE #41 stable/14-n266554-2ee407b6068a: Sun Jan 28 23:26:10 UTC 2024 r...@g1-48.catwhisker.org:/common/S1/obj/usr/src/amd64.amd64/sys/CANARY amd64 1400506 1400506 Yep, still 2ee407b6068a. ~/Downloads/build_typescript.txt:243294: >>> Removing old libraries Overall this sequence fits what I expect. The above wording is more detailed than my earlier quick summaries. === Mark Millard marklmi at yahoo.com
Re: Should changes in src/usr.sbin/bhyve/ trigger an llvm rebuild?
On Jan 29, 2024, at 01:50, Alexander Leidinger wrote: > Am 2024-01-29 00:00, schrieb Mark Millard: > >> I would have to see make -dM output from (D) to >> find the specific timing relationships that lead >> to that. There is way to much to analyze the >> specifics manually, especially because dependency >> chains have to be considered. > > Not -stable, but -current Sequence going back to where a commit change was involved and installed/booted? That older commit was what? The newer one? The content of that change contributes to what range of "is newer than" stuff shows up in the first buildworld after the first installworld-then-reboot to the newer commit. A limiting case is doing a buildworld into an empty /usr/obj/ like area so that its later install has everything freshly built (new timestamps) compared to the prior context. Then doing a installworld buildworld sequence may have more "is newer than" notices. (Some cases of updates approximate such a "largely rebuilt" status, others do not.) The list is illustrative as is, just possibly not definitive. > (no change to src, buildworld after installworld to a new BE and booting this > new BE): > # grep newer buildworld_debug.log | grep -E 'amd64.amd64/tmp/(usr|legacy)/' | > cut -d : -f 3 | sort -u > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/include/roken.h' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/asn1_compile' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/awk' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/basename' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/cap_mkdb' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/cat' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/clang-tblgen' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/compile_et' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/cp' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/crunchgen' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/crunchide' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/dd' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/env' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/file2c' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/gencat' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/grep' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/gzip' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/jot' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/lex' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/lldb-tblgen' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/llvm-min-tblgen' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/llvm-tblgen' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/ln' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/m4' > is newer than the target... > file > '/space/system/usr_obj/space/system/usr_src/amd64.amd64/tmp/legacy/usr/sbin/make-roken' > is newer than the target... > fi
Re: 13-STABLE high idprio load gives poor responsiveness and excessive CPU time per task
Questions include (generic list for reference, even if some has been specified): For /boot/loader.conf (for example) : What value of sysctl vm.pageout_oom_seq is in use? This indirectly adjusts the delay before sustained low free RAM leads to killing processes. Default 12 but 120 is what I use across a wide variety of systems. More is possible. For /etc/sysctl.conf : What values of sysctl vm.swap_enabled and sysctl vm.swap_idle_enabled are in use? (They work as a pair.) Together they can avoid kernel stacks beings swapped out. (Processes still can page out inactive pages, but not their kernel stacks.) Processes withe their kernel stacks swapped out to storage media do not run until the kernel stacks are swapped back in. Avoiding such for kernel stacks of processes involved in interacting with the system can be important ot maintaining control. This is a big hammer that is not limited to such processes. Both being 0 is what leads to kernel stacks not being swapped out. For /usr/local/etc/poudriere.conf : What values of the following are in use? NO_ZFS USE_TMPFS PARALLEL_JOBS ALLOW_MAKE_JOBS MAX_EXECUTION_TIME NOHANG_TIME MAX_EXECUTION_TIME_EXTRACT MAX_EXECUTION_TIME_INSTALL MAX_EXECUTION_TIME_PACKAGE MAX_EXECUTION_TIME_DEINSTALL (Some, of course, may still have the default value so the default value would be the answer in such cases.) Also: Other system tmpfs use outside poudriere? ZFS in use in system even if poudriere has NO_ZFS set? (Such is likely uncommon but is possible.) (Other contexts than poudriere could have some analogous questions.) For /usr/local/etc/poudriere.d/make.conf (for example) : What value of the likes of MAKE_JOBS_NUMBER is in use. Note: PARALLEL_JOBS, ALLOW_MAKE_JOBS, and the likes of MAKE_JOBS_NUMBER has as context the number of hardware threads in the context. The 3 load averages (over different time frames) vs. the hardware threads for the system is relevant information. Note: with various examples of package builds that use 25+ GiBytes of temporary file space, USE_TMPFS can be highly relevant, as is the RAM space, SWAP space, and the resultant RAM+SWAP space. But just the file I/O can be relevant, even if there is no tmpfs use. There are questions like: Spinning rust media usage? (An over-specific but suggestive reference form the more general subject area.) Serial console shows a responsiveness problem? Simple ssh session over local EtherNet? Only if there is a GUI present, even it is not being actively used? Only GUI interactions show a responsiveness problem? Going in another direction . . . I'm no ZFS tuning expert but I had performance problems that I described on the lists and the person that had increased vfs.zfs.per_txg_dirty_frees_percent had me try setting it back to vfs.zfs.per_txg_dirty_frees_percent=5 . In my context, the change was very helpful --but, to me, it was pure magic. My point is more that you may need judgments from someone with appropriate internal ZFS knowledge if you are to explore tuning ZFS. I've no evidence that the specific setting would be helpful. There has been a effort to deal with arc_prune problems/overhead. See: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594 === Mark Millard marklmi at yahoo.com
Re: 13-STABLE high idprio load gives poor responsiveness and excessive CPU time per task
Peter 'PMc' Much wrote on Date: Thu, 29 Feb 2024 13:40:05 UTC : > On 2024-02-27, Edward Sanford Sutton, III wrote: > > More recently looked and see top showing threads+system processes > > shows I have one core getting 100% cpu for kernel{arc_prune} which has > > 21.2 hours over a 2 hour 23 minute uptime. > > Ack. > > > I started looking to see if > > https://www.freebsd.org/security/advisories/FreeBSD-EN-23:18.openzfs.asc > > was available as a fix for 13 but it is not (and doesn't quite sound > > like it was supposed to apply to this issue). Would a kernel thread time > > at 100% cpu for only 1 core explain the system becoming unusually > > unresponsive? > > That depends. This arc_prune issue does usually go alongside with some > other kernel thread (vm-whatever) also blocking, so you have two cores > busy. How many remain? > > There is an updated patch in the PR 275594 (5 pieces), that works for > 13.3; I have it installed, and only with that I am able to build gcc12 > - otherwise the system would just OOM-crash (vm.pageout_oom_seq=5120 > does not help with this). The kernel has multiple, distinct OOM messages. Which type are you seeing? : "failed to reclaim memory" "a thread waited too long to allocate a page" "swblk or swpctrie zone exhausted" "unknown OOM reason %d" Also, but only for boot verbose: "proc %d (%s) failed to alloc page on fault, starting OOM\n" vm.pageout_oom_seq is specific to delaying just: "failed to reclaim memory" === Mark Millard marklmi at yahoo.com
Re: 13-STABLE high idprio load gives poor responsiveness and excessive CPU time per task
[I grabbed locally modify text for one of those messages.] On Feb 29, 2024, at 08:02, Mark Millard wrote: > Peter 'PMc' Much wrote on > Date: Thu, 29 Feb 2024 13:40:05 UTC : > >> On 2024-02-27, Edward Sanford Sutton, III wrote: >>> More recently looked and see top showing threads+system processes >>> shows I have one core getting 100% cpu for kernel{arc_prune} which has >>> 21.2 hours over a 2 hour 23 minute uptime. >> >> Ack. >> >>> I started looking to see if >>> https://www.freebsd.org/security/advisories/FreeBSD-EN-23:18.openzfs.asc >>> was available as a fix for 13 but it is not (and doesn't quite sound >>> like it was supposed to apply to this issue). Would a kernel thread time >>> at 100% cpu for only 1 core explain the system becoming unusually >>> unresponsive? >> >> That depends. This arc_prune issue does usually go alongside with some >> other kernel thread (vm-whatever) also blocking, so you have two cores >> busy. How many remain? >> >> There is an updated patch in the PR 275594 (5 pieces), that works for >> 13.3; I have it installed, and only with that I am able to build gcc12 >> - otherwise the system would just OOM-crash (vm.pageout_oom_seq=5120 >> does not help with this). > > The kernel has multiple, distinct OOM messages. Which type are you > seeing? : > > "failed to reclaim memory" > "a thread waited too long to allocate a page" Local text: > "swblk or swpctrie zone exhausted" Should have been: "out of swap space" > "unknown OOM reason %d" > > Also, but only for boot verbose: > > "proc %d (%s) failed to alloc page on fault, starting OOM\n" > > > > vm.pageout_oom_seq is specific to delaying just: > "failed to reclaim memory" > === Mark Millard marklmi at yahoo.com
Re: 13-STABLE high idprio load gives poor responsiveness and excessive CPU time per task
On Feb 29, 2024, at 08:21, Peter wrote: > On Thu, Feb 29, 2024 at 08:02:42AM -0800, Mark Millard wrote: > ! Peter 'PMc' Much wrote on > ! Date: Thu, 29 Feb 2024 13:40:05 UTC : > ! > ! > There is an updated patch in the PR 275594 (5 pieces), that works for > ! > 13.3; I have it installed, and only with that I am able to build gcc12 > ! > - otherwise the system would just OOM-crash (vm.pageout_oom_seq=5120 > ! > does not help with this). > ! > ! The kernel has multiple, distinct OOM messages. Which type are you > ! seeing? : > ! > ! "a thread waited too long to allocate a page" > > That one. That explains why vm.pageout_oom_seq=5120 did not make a notable difference in the time frame. If you cause a verbose boot the code: if (bootverbose) printf( "proc %d (%s) failed to alloc page on fault, starting OOM\n", curproc->p_pid, curproc->p_comm); likely will report what process had failed to get a page in a timely manor. There also is control over the criteria for this but is is more complicated. In /boot/loader.conf (I'm using defaults): # # For plunty of swap/paging space (will not # run out), avoid pageout delays leading to # Out Of Memory killing of processes: #vm.pfault_oom_attempts=-1 # # For possibly insufficient swap/paging space # (might run out), increase the pageout delay # that leads to Out Of Memory killing of # processes (showing defaults at the time): #vm.pfault_oom_attempts= 3 #vm.pfault_oom_wait= 10 # (The multiplication is the total but there # are other potential tradoffs in the factors # multiplied, even for nearly the same total.) If you can be sure of not running out of swap/paging space, you might try vm.pfault_oom_attempts=-1 . If you do run out of swap/paging space, it would deadlock, as I understand. So, if you can tolerate that the -1 might be an option even if you do run out of swap/paging space. I do not have specific suggestions for alternatives to 3 and 10. It would be exploratory for me if I had to try such. For reference: # sysctl -Td vm.pfault_oom_attempts vm.pfault_oom_wait vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler === Mark Millard marklmi at yahoo.com
Re: 13-STABLE high idprio load gives poor responsiveness and excessive CPU time per task
On Feb 29, 2024, at 09:40, Mark Millard wrote: > On Feb 29, 2024, at 08:21, Peter wrote: > >> On Thu, Feb 29, 2024 at 08:02:42AM -0800, Mark Millard wrote: >> ! Peter 'PMc' Much wrote on >> ! Date: Thu, 29 Feb 2024 13:40:05 UTC : >> ! >> ! > There is an updated patch in the PR 275594 (5 pieces), that works for >> ! > 13.3; I have it installed, and only with that I am able to build gcc12 >> ! > - otherwise the system would just OOM-crash (vm.pageout_oom_seq=5120 >> ! > does not help with this). >> ! >> ! The kernel has multiple, distinct OOM messages. Which type are you >> ! seeing? : >> ! >> ! "a thread waited too long to allocate a page" >> >> That one. > > That explains why vm.pageout_oom_seq=5120 did not make a > notable difference in the time frame. > > If you cause a verbose boot the code: > > if (bootverbose) > printf( > "proc %d (%s) failed to alloc page on fault, starting OOM\n", > curproc->p_pid, curproc->p_comm); > > likely will report what process had failed to get a > page in a timely manor. > > There also is control over the criteria for this but is > is more complicated. In /boot/loader.conf (I'm using > defaults): > > # > # For plunty of swap/paging space (will not > # run out), avoid pageout delays leading to > # Out Of Memory killing of processes: > #vm.pfault_oom_attempts=-1 > # > # For possibly insufficient swap/paging space > # (might run out), increase the pageout delay > # that leads to Out Of Memory killing of > # processes (showing defaults at the time): > #vm.pfault_oom_attempts= 3 > #vm.pfault_oom_wait= 10 > # (The multiplication is the total but there > # are other potential tradoffs in the factors > # multiplied, even for nearly the same total.) > > If you can be sure of not running out of swap/paging > space, you might try vm.pfault_oom_attempts=-1 . > If you do run out of swap/paging space, it would > deadlock, as I understand. So, if you can tolerate > that the -1 might be an option even if you do run > out of swap/paging space. > > I do not have specific suggestions for alternatives > to 3 and 10. It would be exploratory for me if I had > to try such. > > For reference: > > # sysctl -Td vm.pfault_oom_attempts vm.pfault_oom_wait > vm.pfault_oom_attempts: Number of page allocation attempts in page fault > handler before it triggers OOM handling > vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying > the page fault handler I'll note that vm.pageout_oom_seq , vm.pfault_oom_attempts , and vm.pfault_oom_wait are all live writable, not just boot-time tunables. In other words, all show a line of output in: # sysctl -Wd vm.pageout_oom_seq vm.pfault_oom_attempts vm.pfault_oom_wait vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler Not just in: # sysctl -Td vm.pageout_oom_seq vm.pfault_oom_attempts vm.pfault_oom_wait vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler (To see values, to not use the "d".) === Mark Millard marklmi at yahoo.com
RE: FreeBSD 14-0 file swapping broken.
Artem Hevorhian wrote on Date: Sun, 09 Jun 2024 15:30:21 UTC : > I would like to report that, likely, in FreeBSD version 14.0-stable, file > swapping is broken. To confirm, here is what I tried to do and what I > achieved. In order to reproduce the problem, please follow the following > steps. > > I was following this tutorial > https://www.cyberciti.biz/faq/create-a-freebsd-swap-file/ > > I created a large swap file (8192 MiB) and saved it to /root/swap.8G.bin. > > After that, I ran > > sudo chmod 0600 /root/swap.8G.bin > > After that, I updated fstab by adding the following line to the end. > > md42 none swap sw,file=/root/swap.8G.bin 0 0 > > On running > > sudo swapon -aq > > I got the swap file working initially, and I saw it after running swapinfo. > But on reboot, it disappeared. Going a different direction than how to enable using of swap files is the following. It is not FreeBSD version specific for any supported version (RELEASE or STABLE) or for main [future: 15.*] and has a long history going back into now long unsupported versions. QUOTE ( of https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=206048#c7 ) On 2017-Feb-13, at 7:20 PM, Konstantin Belousov wrote on the freebsd-arm list: . . swapfile write requires the write request to come through the filesystem write path, which might require the filesystem to allocate more memory and read some data. E.g. it is known that any ZFS write request allocates memory, and that write request on large UFS file might require allocating and reading an indirect block buffer to find the block number of the written block, if the indirect block was not yet read. As result, swapfile swapping is more prone to the trivial and unavoidable deadlocks where the pagedaemon thread, which produces free memory, needs more free memory to make a progress. Swap write on the raw partition over simple partitioning scheme directly over HBA are usually safe, while e.g. zfs over geli over umass is the worst construction. END QUOTE Summary consequence: I recommend only using swap partitions, not swap files. Yes, I have suffered deadlocks from attempted swap file use, with just UFS over umass (USB SSD) being what held the the swap file in question. === Mark Millard marklmi at yahoo.com
RE: New FreeBSD snapshots available: stable/14 (20240606 e77813f7e4a3) [ bad stable/14 info for 21 Jun 2024, empty snapshots/ISO-IMAGES/14.1/ ]
Looking at: https://lists.freebsd.org/archives/freebsd-snapshots/2024-June/000419.html ( Date: Fri, 21 Jun 2024 00:42:00 UTC ) and at: https://lists.freebsd.org/archives/freebsd-snapshots/2024-June/000414.html ( Date: Fri, 07 Jun 2024 00:37:56 UTC ) they both indicate: 0240606 e77813f7e4a3 Also: http://ftp3.freebsd.org/pub/FreeBSD/snapshots/ISO-IMAGES/14.1/ is empty. This prevents me from suggesting a test if a bug report is reproducible from an official stable/14 snapshot instead of just from someone's personal build of stable/14 (for a RPi3B failure context). === Mark Millard marklmi at yahoo.com