Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 3, 2023, at 19:54, Mark Millard wrote: > ThreadRipper 1950X (32 hardware threads) doing bulk -J128 > with USE_TMPFS=no , no ALLOW_MAKE_JOBS , no > ALLOW_MAKE_JOBS_PACKAGES , USB3 NVMe SSD storage/ZFS-boot-media, > debug system build in use : > > [00:03:44] Building 34214 packages using up to 128 builders > [00:03:44] Hit CTRL+t at any time to see build progress and stats > [00:03:44] [01] [00:00:00] Builder starting > [00:04:37] [01] [00:00:53] Builder started > [00:04:37] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.6 > [00:05:53] [01] [00:01:16] Finished ports-mgmt/pkg | pkg-1.20.6: Success > [00:06:15] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1 > [00:06:15] [02] [00:00:00] Builder starting > . . . > [00:06:18] [128] [00:00:00] Builder starting > [00:07:42] [01] [00:01:27] Finished print/indexinfo | indexinfo-0.3.1: Success > [00:07:45] [01] [00:00:00] Building devel/gettext-runtime | > gettext-runtime-0.22_1 > [00:18:45] [01] [00:11:00] Finished devel/gettext-runtime | > gettext-runtime-0.22_1: Success > [00:19:06] [01] [00:00:00] Building devel/gmake | gmake-4.3_2 > [00:24:13] [01] [00:05:07] Finished devel/gmake | gmake-4.3_2: Success > [00:24:39] [01] [00:00:00] Building devel/libtextstyle | libtextstyle-0.22 > [00:31:08] [125] [00:24:50] Builder started > [00:31:08] [125] [00:00:00] Building print/t1utils | t1utils-1.32 > [00:31:15] [33] [00:25:00] Builder started > [00:31:15] [81] [00:24:59] Builder started > [00:31:15] [33] [00:00:00] Building databases/xapian-core | > xapian-core-1.4.23,1 > [00:31:15] [13] [00:25:00] Builder started > [00:31:15] [81] [00:00:00] Building devel/bmake | bmake-20230723 > [00:31:15] [13] [00:00:00] Building devel/evdev-proto | evdev-proto-5.8 > [00:31:16] [41] [00:25:00] Builder started > [00:31:16] [41] [00:00:00] Building devel/pcre | pcre-8.45_3 > . . . > > (Looks like lang/go120 ignores the lack of ALLOW_MAKE_JOBS . > There may be others that still have signficant parallel > activity.) > > [main-amd64-bulk_a-default] [2023-09-03_13h48m45s] [parallel_build:] Queued: > 34588 Built: 727 Failed: 1 Skipped: 40Ignored: 335 Fetched: 0 > Tobuild: 33485 Time: 01:36:51 > > (So about 1 hr after the last "Builder starting" it had > built 727.) > > The vast majority of the time: lots of cpdup's with tx->tx > showing most of the time for STATE but showing having some > CPU time. > > ^T commonly showed various Builders in starting PHASE for > 3min..6min. > > Around 66% mean Idle time (guess from watching top). > > After ^C "gstat -spod" reports it is almost always writing > 2200 to 2500 writes per second or so for *hours* (still > going on). > > ztop reports 1500 to 3200 d/s or so almost always for > Dataset zamd64/poudriere/data/.m instead (also still going > on). Normally no other Dataset is shown. > > With all the disk I/O activity, this is definitely "live" > in some sense. But I've no clue if it is just repeating > itself over and over vs. if it making some sort of progress. > > For reference for the ^C and after: > > ^C[01:39:00] [20] [00:00:03] Building sysutils/linux-c7-dosfstools | > linux-c7-dosfstools-3.0.20 > [01:39:00] [93] [00:07:12] Finished science/dimod | dimod-0.12.11: Success > [01:39:00] Error: Signal SIGINT caught, cleaning up and exiting > [01:39:02] [63] [00:06:34] Finished archivers/unarj | unarj-2.65_2: Success > [01:39:03] [128] [00:07:47] Finished sysutils/shuf | shuf-3.0: Success > [01:39:04] [113] [00:07:06] Finished devel/bsddialog | bsddialog-0.4.1: > Success > [main-amd64-bulk_a-default] [2023-09-03_13h48m45s] [sigint:] Queued: 34588 > Built: 752 Failed: 1 Skipped: 40Ignored: 335 Fetched: 0 > Tobuild: 33460 Time: 01:38:56 > [01:39:06] Logs: > /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-03_13h48m45s > [01:39:14] [12] [00:09:07] Finished archivers/rzip | rzip-2.1_1: Success > [01:39:14] Cleaning up > exit: cannot open ./var/run/49_nohang.pid: No such file or directory > exit: cannot open ./var/run/87_nohang.pid: No such file or directory > > After that ^t produced the likes of: > > load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% > 13004k > #0 0x80b6f103 at mi_switch+0x173 > #1 0x80bc0f24 at sleepq_switch+0x104 > #2 0x80aec4c5 at _cv_wait+0x165 > #3 0x82aba365 at txg_wait_open+0xf5 > #4 0x82a11b81 at dmu_free_long_range+0x151 > #5 0x829a87d2 at zfs_rmnode+0x72 > #6 0x829b658d at zfs_freebsd_reclaim+0x3d > #7 0x8113a495 at VOP_RECLAIM_APV+0x35 > #8 0x80c5a7d9 at vgonel+0x3a9 > #9 0x80c5af7f at vrecycle+0x3f > #10 0x829b643e at zfs_freebsd_inactive+0x4e > #11 0x80c598cf at vinactivef+0xbf > #12 0x80c590da at vput_final+0x2aa > #13 0x80c68886 at kern_funlinkat+0x2f6 > #14 0x80c68588 at sys_unlink+0x28 > #15 0x8106323f at amd64_syscall+0x14f > #16 0x8103512b at fast_syscall_comm
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 3, 2023, at 23:35, Mark Millard wrote: > On Sep 3, 2023, at 22:06, Alexander Motin wrote: > >> >> On 03.09.2023 22:54, Mark Millard wrote: >>> After that ^t produced the likes of: >>> load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s >>> 1% 13004k >> >> So the full state is not "tx->tx", but is actually a >> "tx->tx_quiesce_done_cv", which means the thread is waiting for new >> transaction to be opened, which means some previous to be quiesced and then >> synced. >> >>> #0 0x80b6f103 at mi_switch+0x173 >>> #1 0x80bc0f24 at sleepq_switch+0x104 >>> #2 0x80aec4c5 at _cv_wait+0x165 >>> #3 0x82aba365 at txg_wait_open+0xf5 >>> #4 0x82a11b81 at dmu_free_long_range+0x151 >> >> Here it seems like transaction commit is waited due to large amount of >> delete operations, which ZFS tries to spread between separate TXGs. > > That fit the context: cleaning out /usr/local/poudriere/data/.m/ > >> You should probably see some large and growing number in sysctl >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . > > After the reboot I started a -J64 example. It has avoided the > early "witness exhausted". Again I ^C'd after about an hours > after the 2nd builder had started. So: again cleaning out > /usr/local/poudriere/data/.m/ Only seconds between: > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042 > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427 > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323 > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027 > > I have found a measure of progress: zfs list's USED > for /usr/local/poudriere/data/.m is decreasing. So > ztop's d/s was a good classification: deletes. > >>> #5 0x829a87d2 at zfs_rmnode+0x72 >>> #6 0x829b658d at zfs_freebsd_reclaim+0x3d >>> #7 0x8113a495 at VOP_RECLAIM_APV+0x35 >>> #8 0x80c5a7d9 at vgonel+0x3a9 >>> #9 0x80c5af7f at vrecycle+0x3f >>> #10 0x829b643e at zfs_freebsd_inactive+0x4e >>> #11 0x80c598cf at vinactivef+0xbf >>> #12 0x80c590da at vput_final+0x2aa >>> #13 0x80c68886 at kern_funlinkat+0x2f6 >>> #14 0x80c68588 at sys_unlink+0x28 >>> #15 0x8106323f at amd64_syscall+0x14f >>> #16 0x8103512b at fast_syscall_common+0xf8 >> >> What we don't see here is what quiesce and sync threads of the pool are >> actually doing. Sync thread has plenty of different jobs, including async >> write, async destroy, scrub and others, that all may delay each other. >> >> Before you rebooted the system, depending how alive it is, could you save a >> number of outputs of `procstat -akk`, or at least specifically `procstat >> -akk | grep txg_thread_enter` if the full is hard? Or somehow else observe >> what they are doing. > > # procstat -akk > ~/mmjnk00.txt > # procstat -akk > ~/mmjnk01.txt > # procstat -akk > ~/mmjnk02.txt > # procstat -akk > ~/mmjnk03.txt > # procstat -akk > ~/mmjnk04.txt > # procstat -akk > ~/mmjnk05.txt > # grep txg_thread_enter ~/mmjnk0[0-5].txt > /usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk03.txt:6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 4, 2023, at 02:00, Mark Millard wrote: > On Sep 3, 2023, at 23:35, Mark Millard wrote: > >> On Sep 3, 2023, at 22:06, Alexander Motin wrote: >> >>> >>> On 03.09.2023 22:54, Mark Millard wrote: After that ^t produced the likes of: load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 13004k >>> >>> So the full state is not "tx->tx", but is actually a >>> "tx->tx_quiesce_done_cv", which means the thread is waiting for new >>> transaction to be opened, which means some previous to be quiesced and then >>> synced. >>> #0 0x80b6f103 at mi_switch+0x173 #1 0x80bc0f24 at sleepq_switch+0x104 #2 0x80aec4c5 at _cv_wait+0x165 #3 0x82aba365 at txg_wait_open+0xf5 #4 0x82a11b81 at dmu_free_long_range+0x151 >>> >>> Here it seems like transaction commit is waited due to large amount of >>> delete operations, which ZFS tries to spread between separate TXGs. >> >> That fit the context: cleaning out /usr/local/poudriere/data/.m/ >> >>> You should probably see some large and growing number in sysctl >>> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . >> >> After the reboot I started a -J64 example. It has avoided the >> early "witness exhausted". Again I ^C'd after about an hours >> after the 2nd builder had started. So: again cleaning out >> /usr/local/poudriere/data/.m/ Only seconds between: >> >> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042 >> >> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427 >> >> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323 >> >> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027 >> >> I have found a measure of progress: zfs list's USED >> for /usr/local/poudriere/data/.m is decreasing. So >> ztop's d/s was a good classification: deletes. >> #5 0x829a87d2 at zfs_rmnode+0x72 #6 0x829b658d at zfs_freebsd_reclaim+0x3d #7 0x8113a495 at VOP_RECLAIM_APV+0x35 #8 0x80c5a7d9 at vgonel+0x3a9 #9 0x80c5af7f at vrecycle+0x3f #10 0x829b643e at zfs_freebsd_inactive+0x4e #11 0x80c598cf at vinactivef+0xbf #12 0x80c590da at vput_final+0x2aa #13 0x80c68886 at kern_funlinkat+0x2f6 #14 0x80c68588 at sys_unlink+0x28 #15 0x8106323f at amd64_syscall+0x14f #16 0x8103512b at fast_syscall_common+0xf8 >>> >>> What we don't see here is what quiesce and sync threads of the pool are >>> actually doing. Sync thread has plenty of different jobs, including async >>> write, async destroy, scrub and others, that all may delay each other. >>> >>> Before you rebooted the system, depending how alive it is, could you save a >>> number of outputs of `procstat -akk`, or at least specifically `procstat >>> -akk | grep txg_thread_enter` if the full is hard? Or somehow else observe >>> what they are doing. >> >> # procstat -akk > ~/mmjnk00.txt >> # procstat -akk > ~/mmjnk01.txt >> # procstat -akk > ~/mmjnk02.txt >> # procstat -akk > ~/mmjnk03.txt >> # procstat -akk > ~/mmjnk04.txt >> # procstat -akk > ~/mmjnk05.txt >> # grep txg_thread_enter ~/mmjnk0[0-5].txt >> /usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb >> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b >> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 >> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb >> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b >> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 >> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb >> txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter >>mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b >> _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 >> txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe >> /usr/home/root/mmjnk03.txt:
Re: Speed improvements in ZFS
On 9/4/23, Alexander Leidinger wrote: > Am 2023-08-28 22:33, schrieb Alexander Leidinger: >> Am 2023-08-22 18:59, schrieb Mateusz Guzik: >>> On 8/22/23, Alexander Leidinger wrote: Am 2023-08-21 10:53, schrieb Konstantin Belousov: > On Mon, Aug 21, 2023 at 08:19:28AM +0200, Alexander Leidinger wrote: >> Am 2023-08-20 23:17, schrieb Konstantin Belousov: >> > On Sun, Aug 20, 2023 at 11:07:08PM +0200, Mateusz Guzik wrote: >> > > On 8/20/23, Alexander Leidinger wrote: >> > > > Am 2023-08-20 22:02, schrieb Mateusz Guzik: >> > > >> On 8/20/23, Alexander Leidinger >> > > >> wrote: >> > > >>> Am 2023-08-20 19:10, schrieb Mateusz Guzik: >> > > On 8/18/23, Alexander Leidinger >> > > wrote: >> > > >>> >> > > > I have a 51MB text file, compressed to about 1MB. Are you >> > > > interested >> > > > to >> > > > get it? >> > > > >> > > >> > > Your problem is not the vnode limit, but nullfs. >> > > >> > > https://people.freebsd.org/~mjg/netchild-periodic-find.svg >> > > >>> >> > > >>> 122 nullfs mounts on this system. And every jail I setup has >> > > >>> several >> > > >>> null mounts. One basesystem mounted into every jail, and then >> > > >>> shared >> > > >>> ports (packages/distfiles/ccache) across all of them. >> > > >>> >> > > First, some of the contention is notorious VI_LOCK in order >> > > to >> > > do >> > > anything. >> > > >> > > But more importantly the mind-boggling off-cpu time comes >> > > from >> > > exclusive locking which should not be there to begin with -- >> > > as >> > > in >> > > that xlock in stat should be a slock. >> > > >> > > Maybe I'm going to look into it later. >> > > >>> >> > > >>> That would be fantastic. >> > > >>> >> > > >> >> > > >> I did a quick test, things are shared locked as expected. >> > > >> >> > > >> However, I found the following: >> > > >> if ((xmp->nullm_flags & NULLM_CACHE) != 0) { >> > > >> mp->mnt_kern_flag |= >> > > >> lowerrootvp->v_mount->mnt_kern_flag & >> > > >> (MNTK_SHARED_WRITES | MNTK_LOOKUP_SHARED | >> > > >> MNTK_EXTENDED_SHARED); >> > > >> } >> > > >> >> > > >> are you using the "nocache" option? it has a side effect of >> > > >> xlocking >> > > > >> > > > I use noatime, noexec, nosuid, nfsv4acls. I do NOT use nocache. >> > > > >> > > >> > > If you don't have "nocache" on null mounts, then I don't see how >> > > this >> > > could happen. >> > >> > There is also MNTK_NULL_NOCACHE on lower fs, which is currently set >> > for >> > fuse and nfs at least. >> >> 11 of those 122 nullfs mounts are ZFS datasets which are also NFS >> exported. >> 6 of those nullfs mounts are also exported via Samba. The NFS >> exports >> shouldn't be needed anymore, I will remove them. > By nfs I meant nfs client, not nfs exports. No NFS client mounts anywhere on this system. So where is this exclusive lock coming from then... This is a ZFS system. 2 pools: one for the root, one for anything I need space for. Both pools reside on the same disks. The root pool is a 3-way mirror, the "space-pool" is a 5-disk raidz2. All jails are on the space-pool. The jails are all basejail-style jails. >>> >>> While I don't see why xlocking happens, you should be able to dtrace >>> or printf your way into finding out. >> >> dtrace looks to me like a faster approach to get to the root than >> printf... my first naive try is to detect exclusive locks. I'm not 100% >> sure I got it right, but at least dtrace doesn't complain about it: >> ---snip--- >> #pragma D option dynvarsize=32m >> >> fbt:nullfs:null_lock:entry >> /args[0]->a_flags & 0x08 != 0/ >> { >> stack(); >> } >> ---snip--- >> >> In which direction should I look with dtrace if this works in tonights >> run of periodic? I don't have enough knowledge about VFS to come up >> with some immediate ideas. > > After your sysctl fix for maxvnodes I increased the amount of vnodes 10 > times compared to the initial report. This has increased the speed of > the operation, the find runs in all those jails finished today after ~5h > (@~8am) instead of in the afternoon as before. Could this suggest that > in parallel some null_reclaim() is running which does the exclusive > locks and slows down the entire operation? > That may be a slowdown to some extent, but the primary problem is exclusive vnode locking for stat lookup, which should not be happening. -- Mateusz Guzik
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On 04.09.2023 05:56, Mark Millard wrote: On Sep 4, 2023, at 02:00, Mark Millard wrote: On Sep 3, 2023, at 23:35, Mark Millard wrote: On Sep 3, 2023, at 22:06, Alexander Motin wrote: On 03.09.2023 22:54, Mark Millard wrote: After that ^t produced the likes of: load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 13004k So the full state is not "tx->tx", but is actually a "tx->tx_quiesce_done_cv", which means the thread is waiting for new transaction to be opened, which means some previous to be quiesced and then synced. #0 0x80b6f103 at mi_switch+0x173 #1 0x80bc0f24 at sleepq_switch+0x104 #2 0x80aec4c5 at _cv_wait+0x165 #3 0x82aba365 at txg_wait_open+0xf5 #4 0x82a11b81 at dmu_free_long_range+0x151 Here it seems like transaction commit is waited due to large amount of delete operations, which ZFS tries to spread between separate TXGs. That fit the context: cleaning out /usr/local/poudriere/data/.m/ You should probably see some large and growing number in sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . After the reboot I started a -J64 example. It has avoided the early "witness exhausted". Again I ^C'd after about an hours after the 2nd builder had started. So: again cleaning out /usr/local/poudriere/data/.m/ Only seconds between: # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027 As expected, deletes trigger and wait for TXG commits. I have found a measure of progress: zfs list's USED for /usr/local/poudriere/data/.m is decreasing. So ztop's d/s was a good classification: deletes. #5 0x829a87d2 at zfs_rmnode+0x72 #6 0x829b658d at zfs_freebsd_reclaim+0x3d #7 0x8113a495 at VOP_RECLAIM_APV+0x35 #8 0x80c5a7d9 at vgonel+0x3a9 #9 0x80c5af7f at vrecycle+0x3f #10 0x829b643e at zfs_freebsd_inactive+0x4e #11 0x80c598cf at vinactivef+0xbf #12 0x80c590da at vput_final+0x2aa #13 0x80c68886 at kern_funlinkat+0x2f6 #14 0x80c68588 at sys_unlink+0x28 #15 0x8106323f at amd64_syscall+0x14f #16 0x8103512b at fast_syscall_common+0xf8 What we don't see here is what quiesce and sync threads of the pool are actually doing. Sync thread has plenty of different jobs, including async write, async destroy, scrub and others, that all may delay each other. Before you rebooted the system, depending how alive it is, could you save a number of outputs of `procstat -akk`, or at least specifically `procstat -akk | grep txg_thread_enter` if the full is hard? Or somehow else observe what they are doing. # grep txg_thread_enter ~/mmjnk0[0-5].txt /usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk03.txt:6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk03.txt:6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0
Re: Possible issue with linux xattr support?
Hello Dmitry, * Dmitry Chagin [20230830 14:04]: > Thanks, I see, I agree with your change, taken into account. Thanks a lot for committing the fix/workaround as well as the new feature! I just upgraded my test builders to 15-CURRENT to double-check. For some reason (I still have to try to get more information about it, will do soon), something else is broken now. While install from GNU coreutils works fine, some build systems use 'cp -p' to install files instead, which now fails with: | cp: preserving permissions for [...]: Operation not permitted I'm pretty sure this worked fine with the patch you linked earlier, and also with my own change taking the namespace into account ... Did you change anything else in your commits compared to the patch? I'll post a followup as soon as I could get some truss output to know exactly which syscall causes the issue now! Cheers, Felix -- Felix Palmen {private} fe...@palmen-it.de -- ports committer -- {web} http://palmen-it.de {pgp public key} http://palmen-it.de/pub.txt {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231 signature.asc Description: PGP signature
Re: Possible issue with linux xattr support?
* Felix Palmen [20230904 15:39]: > For some reason (I still have to try to get more information about it, > will do soon), something else is broken now. While install from GNU > coreutils works fine, some build systems use 'cp -p' to install files > instead, which now fails with: > > | cp: preserving permissions for [...]: Operation not permitted The issue seems to be with listing attributes: | root@15-default:/wrkdirs/usr/ports/x11/linuxsrc-libxcb/work/libxcb-1.15/doc # truss /compat/linux/bin/cp -pR ./tutorial '/wrkdirs/usr/ports/x11/linuxsrc-libxcb/work/stage/compat/linux/usr/share/doc/libxcb/' 2>&1 | grep xattr | linux_flistxattr(0x4,0x0,0x0)ERR#-1 'Operation not permitted' | linux_flistxattr(0x4,0x0,0x0)ERR#-1 'Operation not permitted' | linux_llistxattr(0xcdd9,0x0,0x0) ERR#-1 'Operation not permitted' The following q&d patch makes it work again: #v+ diff --git a/sys/compat/linux/linux_xattr.c b/sys/compat/linux/linux_xattr.c index 74b47f1cbaec..0b5af084b1b1 100644 --- a/sys/compat/linux/linux_xattr.c +++ b/sys/compat/linux/linux_xattr.c @@ -198,7 +198,7 @@ listxattr(struct thread *td, struct listxattr_args *args) if (error == 0) td->td_retval[0] = cnt; free(data, M_LINUX); - return (error_to_xattrerror(attrnamespace, error)); + return (error_to_xattrerror(EXTATTR_NAMESPACE_SYSTEM, error)); } int #v- I think this makes sense, because listxattr iterates over all namespaces, so there's no sane way to know whether the EPERM was caused by trying to access "system" or something else. Cheers, Felix -- Felix Palmen {private} fe...@palmen-it.de -- ports committer -- {web} http://palmen-it.de {pgp public key} http://palmen-it.de/pub.txt {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231 signature.asc Description: PGP signature
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 4, 2023, at 06:09, Alexander Motin wrote: > On 04.09.2023 05:56, Mark Millard wrote: >> On Sep 4, 2023, at 02:00, Mark Millard wrote: >>> On Sep 3, 2023, at 23:35, Mark Millard wrote: On Sep 3, 2023, at 22:06, Alexander Motin wrote: > On 03.09.2023 22:54, Mark Millard wrote: >> After that ^t produced the likes of: >> load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u >> 121.32s 1% 13004k > > So the full state is not "tx->tx", but is actually a > "tx->tx_quiesce_done_cv", which means the thread is waiting for new > transaction to be opened, which means some previous to be quiesced and > then synced. > >> #0 0x80b6f103 at mi_switch+0x173 >> #1 0x80bc0f24 at sleepq_switch+0x104 >> #2 0x80aec4c5 at _cv_wait+0x165 >> #3 0x82aba365 at txg_wait_open+0xf5 >> #4 0x82a11b81 at dmu_free_long_range+0x151 > > Here it seems like transaction commit is waited due to large amount of > delete operations, which ZFS tries to spread between separate TXGs. That fit the context: cleaning out /usr/local/poudriere/data/.m/ > You should probably see some large and growing number in sysctl > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . After the reboot I started a -J64 example. It has avoided the early "witness exhausted". Again I ^C'd after about an hours after the 2nd builder had started. So: again cleaning out /usr/local/poudriere/data/.m/ Only seconds between: # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027 > > As expected, deletes trigger and wait for TXG commits. > I have found a measure of progress: zfs list's USED for /usr/local/poudriere/data/.m is decreasing. So ztop's d/s was a good classification: deletes. >> #5 0x829a87d2 at zfs_rmnode+0x72 >> #6 0x829b658d at zfs_freebsd_reclaim+0x3d >> #7 0x8113a495 at VOP_RECLAIM_APV+0x35 >> #8 0x80c5a7d9 at vgonel+0x3a9 >> #9 0x80c5af7f at vrecycle+0x3f >> #10 0x829b643e at zfs_freebsd_inactive+0x4e >> #11 0x80c598cf at vinactivef+0xbf >> #12 0x80c590da at vput_final+0x2aa >> #13 0x80c68886 at kern_funlinkat+0x2f6 >> #14 0x80c68588 at sys_unlink+0x28 >> #15 0x8106323f at amd64_syscall+0x14f >> #16 0x8103512b at fast_syscall_common+0xf8 > > What we don't see here is what quiesce and sync threads of the pool are > actually doing. Sync thread has plenty of different jobs, including > async write, async destroy, scrub and others, that all may delay each > other. > > Before you rebooted the system, depending how alive it is, could you save > a number of outputs of `procstat -akk`, or at least specifically > `procstat -akk | grep txg_thread_enter` if the full is hard? Or somehow > else observe what they are doing. # grep txg_thread_enter ~/mmjnk0[0-5].txt /usr/home/root/mmjnk00.txt:6 100881 zfskern txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk00.txt:6 100882 zfskern txg_thread_entermi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk01.txt:6 100881 zfskern txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk01.txt:6 100882 zfskern txg_thread_entermi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk02.txt:6 100881 zfskern txg_thread_entermi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk02.txt:6 100882 zfskern txg_thread_entermi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio
14.0-CURRENT boots fine but keyboard does not work
I have a 14.0-CURRENT compiled from sources of head from August 4, which boots fine from a produced USB key, but the keyboard does not work on an Acer C720 (amd64), on other laptops the keyboard is fine. The keyboard works during the boot menu (for example to enable verbose boot messages) but not on the login: prompt of the booted system. I've enabled SSH access into the C720 (if someone need more information) and I'm attaching /var/log/messages of the booted system. Any ideas? matthias -- Matthias Apitz, ✉ g...@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub Sep 4 13:44:55 C720-1400094 syslogd: kernel boot file is /boot/kernel/kernel Sep 4 13:44:55 C720-1400094 kernel: ---<>--- Sep 4 13:44:55 C720-1400094 kernel: APIC: Using the MADT enumerator. Sep 4 13:44:55 C720-1400094 kernel: Copyright (c) 1992-2023 The FreeBSD Project. Sep 4 13:44:55 C720-1400094 kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 Sep 4 13:44:55 C720-1400094 kernel:The Regents of the University of California. All rights reserved. Sep 4 13:44:55 C720-1400094 kernel: FreeBSD is a registered trademark of The FreeBSD Foundation. Sep 4 13:44:55 C720-1400094 kernel: FreeBSD 14.0-CURRENT amd64 1400094 #0 main-n264568-1d7ffb373c9d: Fri Aug 18 12:55:08 CEST 2023 Sep 4 13:44:55 C720-1400094 kernel: guru@jet:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 Sep 4 13:44:55 C720-1400094 kernel: FreeBSD clang version 16.0.6 (https://github.com/llvm/llvm-project.git llvmorg-16.0.6-0-g7cbf1a259152) Sep 4 13:44:55 C720-1400094 kernel: WARNING: WITNESS option enabled, expect reduced performance. Sep 4 13:44:55 C720-1400094 kernel: PPIM 0: PA=0xa, VA=0x8281, size=0x1, mode=0 Sep 4 13:44:55 C720-1400094 kernel: pmap: large map 8 PML4 slots (4096 GB) Sep 4 13:44:55 C720-1400094 kernel: VT(vga): resolution 640x480 Sep 4 13:44:55 C720-1400094 kernel: Preloaded elf kernel "/boot/kernel/kernel" at 0x81fee540. Sep 4 13:44:55 C720-1400094 kernel: Preloaded boot_entropy_cache "/boot/entropy" at 0x81ff77c8. Sep 4 13:44:55 C720-1400094 kernel: Preloaded TSLOG data "TSLOG" at 0x81ff7820. Sep 4 13:44:55 C720-1400094 kernel: Early TSC frequency 1396833720Hz calibrated from 8254 PIT Sep 4 13:44:55 C720-1400094 kernel: CPU: Intel(R) Celeron(R) 2955U @ 1.40GHz (1396.83-MHz K8-class CPU) Sep 4 13:44:55 C720-1400094 kernel: Origin="GenuineIntel" Id=0x40651 Family=0x6 Model=0x45 Stepping=1 Sep 4 13:44:55 C720-1400094 kernel: Features=0xbfebfbff Sep 4 13:44:55 C720-1400094 kernel: Features2=0x4ddaebbf Sep 4 13:44:55 C720-1400094 kernel: AMD Features=0x2c100800 Sep 4 13:44:55 C720-1400094 kernel: AMD Features2=0x21 Sep 4 13:44:55 C720-1400094 kernel: Structured Extended Features=0x2603 Sep 4 13:44:55 C720-1400094 kernel: XSAVE Features=0x1 Sep 4 13:44:55 C720-1400094 kernel: VT-x: (disabled in BIOS) Basic Features=0xda0400 Sep 4 13:44:55 C720-1400094 kernel: Pin-Based Controls=0x7f Sep 4 13:44:55 C720-1400094 kernel: Primary Processor Controls=0xfff9fffe Sep 4 13:44:55 C720-1400094 kernel: Secondary Processor Controls=0x1cff Sep 4 13:44:55 C720-1400094 kernel: Exit Controls=0xda0400 Sep 4 13:44:55 C720-1400094 kernel: Entry Controls=0xda0400 Sep 4 13:44:55 C720-1400094 kernel: EPT Features=0x6334141 Sep 4 13:44:55 C720-1400094 kernel: VPID Features=0xf01 Sep 4 13:44:55 C720-1400094 kernel: TSC: P-state invariant, performance statistics Sep 4 13:44:55 C720-1400094 kernel: Data TLB: 2 MByte or 4 MByte pages, 4-way set associative, 32 entries and a separate array with 1 GByte pages, 4-way set associative, 4 entries Sep 4 13:44:55 C720-1400094 kernel: Data TLB: 4 KB pages, 4-way set associative, 64 entries Sep 4 13:44:55 C720-1400094 kernel: Instruction TLB: 2M/4M pages, fully associative, 8 entries Sep 4 13:44:55 C720-1400094 kernel: Instruction TLB: 4KByte pages, 8-way set associative, 128 entries Sep 4 13:44:55 C720-1400094 kernel: 64-Byte prefetching Sep 4 13:44:55 C720-1400094 kernel: Shared 2nd-Level TLB: 4 KByte/2MByte pages, 8-way associative, 1024 entries Sep 4 13:44:55 C720-1400094 kernel: L2 cache: 256 kbytes, 8-way associative, 64 bytes/line Sep 4 13:44:55 C720-1400094 kernel: real memory = 4301258752 (4102 MB) Sep 4 13:44:55 C720-1400094 kernel: Physical memory chunk(s): Sep 4 13:44:55 C720-1400094 kernel: 0x0001 - 0x0002, 131072 bytes (32 pages) Sep 4 13:44:55 C720-1400094 kernel: 0x0004 - 0x0009dfff, 385024 bytes (94 pages) Sep 4 13:44:55 C720-1400094 kernel: 0x0010 - 0x001f, 1048576 bytes (256 pages) Sep 4 13:44:55 C720-1400094 kernel: 0x02402000 - 0x751aefff, 1926942720 bytes (470445 pages) Sep 4 13:44:55 C720-1400094 kernel: 0x7bc0 - 0x7bcf3fff, 9
Re: 14.0-CURRENT boots fine but keyboard does not work
On Mon, 4 Sep 2023 18:43:11 +0200 Matthias Apitz wrote: > I have a 14.0-CURRENT compiled from sources of head from August 4, > which boots fine from a produced USB key, but the keyboard does not > work on an Acer C720 (amd64), on other laptops the keyboard is fine. > > The keyboard works during the boot menu (for example to enable verbose > boot messages) but not on the login: prompt of the booted system. > > I've enabled SSH access into the C720 (if someone need more > information) and I'm attaching /var/log/messages of the booted system. Hi Matthias, The C720 required special patches for the keyboard to work, which I originally added here: https://cgit.freebsd.org/src/commit/?id=6c176113bbdd598231ec47d161d4c3714997169b I assume that something in that area changed recently. Without digging into it, this looks like a possible cause: https://cgit.freebsd.org/src/commit/sys/dev/atkbdc/atkbd.c?id=ce881170088c4c98c036fe561f8ee8413c2e2585 atkbd: Disable periodic polling by default. It is one of the few remaining Giant-locked callouts. It would be good to remove it, not mentioning that polling itself is not good. If this cause keyboard/mouse freezes on some hardware, please set loader tunable hw.atkbd.hz=1 as workaround and report the issue. So you could try to set hw.atkbd.hz=1 (or hw.atkbd.hz=10) in /boot/loader.conf, then reboot and see if it helps. Best Michael -- Michael Gmelin
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On 04.09.2023 11:45, Mark Millard wrote: On Sep 4, 2023, at 06:09, Alexander Motin wrote: per_txg_dirty_frees_percent is directly related to the delete delays we see here. You are forcing ZFS to commit transactions each 5% of dirty ARC limit, which is 5% of 10% or memory size. I haven't looked on that code recently, but I guess setting it too low can make ZFS commit transactions too often, increasing write inflation for the underlying storage. I would propose you to restore the default and try again. While this machine is different, the original problem was worse than the issue here: the load average was less than 1 for the most part the parallel bulk build when 30 was used. The fraction of time waiting was much longer than with 5. If I understand right, both too high and too low for a type of context can lead to increased elapsed time and getting it set to a near optimal is a non-obvious exploration. IIRC this limit was modified several times since originally implemented. May be it could benefit from another look, if default 30% is not good. It would be good if generic ZFS issues like this were reported to OpenZFS upstream to be visible to a wider public. Unfortunately I have several other project I must work on, so if it is not a regression I can't promise I'll take it right now, so anybody else is welcome. An overall point for the goal of my activity is: what makes a good test context for checking if ZFS is again safe to use? May be other tradeoffs make, say, 4 hardware threads more reasonable than 32. Thank you for your testing. The best test is one that nobody else run. It also correlates with the topic of "safe to use", which also depends on what it is used for. :) -- Alexander Motin
Re: 14.0-CURRENT boots fine but keyboard does not work
Added Alexander Motin to To: as the origin of the CI; Neither hw.atkbd.hz=1 nor hw.atkbd.hz=10 makes the keyboard working on my beloved Acer C720. Should I file a new PR? Thanks matthias El día lunes, septiembre 04, 2023 a las 06:55:52p. m. +0200, Michael Gmelin escribió: > > > On Mon, 4 Sep 2023 18:43:11 +0200 > Matthias Apitz wrote: > > > I have a 14.0-CURRENT compiled from sources of head from August 4, > > which boots fine from a produced USB key, but the keyboard does not > > work on an Acer C720 (amd64), on other laptops the keyboard is fine. > > > > The keyboard works during the boot menu (for example to enable verbose > > boot messages) but not on the login: prompt of the booted system. > > > > I've enabled SSH access into the C720 (if someone need more > > information) and I'm attaching /var/log/messages of the booted system. > > Hi Matthias, > > The C720 required special patches for the keyboard to work, which I > originally added here: > https://cgit.freebsd.org/src/commit/?id=6c176113bbdd598231ec47d161d4c3714997169b > > I assume that something in that area changed recently. > > Without digging into it, this looks like a possible cause: > > > https://cgit.freebsd.org/src/commit/sys/dev/atkbdc/atkbd.c?id=ce881170088c4c98c036fe561f8ee8413c2e2585 > > atkbd: Disable periodic polling by default. > It is one of the few remaining Giant-locked callouts. It would be > good to remove it, not mentioning that polling itself is not good. > > If this cause keyboard/mouse freezes on some hardware, please set > loader tunable hw.atkbd.hz=1 as workaround and report the issue. > > So you could try to set hw.atkbd.hz=1 (or hw.atkbd.hz=10) in > /boot/loader.conf, then reboot and see if it helps. > > Best > Michael > > -- > Michael Gmelin > -- Matthias Apitz, ✉ g...@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Re: 14.0-CURRENT boots fine but keyboard does not work
> On 4. Sep 2023, at 19:23, Matthias Apitz wrote: > > > Added Alexander Motin to To: as the origin of the CI; > > Neither hw.atkbd.hz=1 nor hw.atkbd.hz=10 makes the keyboard working on > my beloved Acer C720. Should I file a new PR? > Filing a PR makes sense, could you please Cc me on it? Do you know which version of FreeBSD was the last that worked for you? Cheers > > >> El día lunes, septiembre 04, 2023 a las 06:55:52p. m. +0200, Michael Gmelin >> escribió: >> >> >> >> On Mon, 4 Sep 2023 18:43:11 +0200 >> Matthias Apitz wrote: >> >>> I have a 14.0-CURRENT compiled from sources of head from August 4, >>> which boots fine from a produced USB key, but the keyboard does not >>> work on an Acer C720 (amd64), on other laptops the keyboard is fine. >>> >>> The keyboard works during the boot menu (for example to enable verbose >>> boot messages) but not on the login: prompt of the booted system. >>> >>> I've enabled SSH access into the C720 (if someone need more >>> information) and I'm attaching /var/log/messages of the booted system. >> >> Hi Matthias, >> >> The C720 required special patches for the keyboard to work, which I >> originally added here: >> https://cgit.freebsd.org/src/commit/?id=6c176113bbdd598231ec47d161d4c3714997169b >> >> I assume that something in that area changed recently. >> >> Without digging into it, this looks like a possible cause: >> >> >> https://cgit.freebsd.org/src/commit/sys/dev/atkbdc/atkbd.c?id=ce881170088c4c98c036fe561f8ee8413c2e2585 >> >> atkbd: Disable periodic polling by default. >> It is one of the few remaining Giant-locked callouts. It would be >> good to remove it, not mentioning that polling itself is not good. >> >> If this cause keyboard/mouse freezes on some hardware, please set >> loader tunable hw.atkbd.hz=1 as workaround and report the issue. >> >> So you could try to set hw.atkbd.hz=1 (or hw.atkbd.hz=10) in >> /boot/loader.conf, then reboot and see if it helps. >> >> Best >> Michael >> >> -- >> Michael Gmelin >> > > -- > Matthias Apitz, ✉ g...@unixarea.de, http://www.unixarea.de/ +49-176-38902045 > Public GnuPG key: http://www.unixarea.de/key.pub
Re: 14.0-CURRENT boots fine but keyboard does not work
El día lunes, septiembre 04, 2023 a las 07:29:41p. m. +0200, Michael Gmelin escribió: > > > > On 4. Sep 2023, at 19:23, Matthias Apitz wrote: > > > > > > Added Alexander Motin to To: as the origin of the CI; > > > > Neither hw.atkbd.hz=1 nor hw.atkbd.hz=10 makes the keyboard working on > > my beloved Acer C720. Should I file a new PR? > > > > Filing a PR makes sense, could you please Cc me on it? > > Do you know which version of FreeBSD was the last that worked for you? I'm actually using (and typing this on it) r368166. Will file a PR tomorrow. Thanks matthias > >> El día lunes, septiembre 04, 2023 a las 06:55:52p. m. +0200, Michael > >> Gmelin escribió: > >> > >> > >> > >> On Mon, 4 Sep 2023 18:43:11 +0200 > >> Matthias Apitz wrote: > >> > >>> I have a 14.0-CURRENT compiled from sources of head from August 4, > >>> which boots fine from a produced USB key, but the keyboard does not > >>> work on an Acer C720 (amd64), on other laptops the keyboard is fine. > >>> > >>> The keyboard works during the boot menu (for example to enable verbose > >>> boot messages) but not on the login: prompt of the booted system. > >>> > >>> I've enabled SSH access into the C720 (if someone need more > >>> information) and I'm attaching /var/log/messages of the booted system. > >> > >> Hi Matthias, > >> > >> The C720 required special patches for the keyboard to work, which I > >> originally added here: > >> https://cgit.freebsd.org/src/commit/?id=6c176113bbdd598231ec47d161d4c3714997169b > >> > >> I assume that something in that area changed recently. > >> > >> Without digging into it, this looks like a possible cause: > >> > >> > >> https://cgit.freebsd.org/src/commit/sys/dev/atkbdc/atkbd.c?id=ce881170088c4c98c036fe561f8ee8413c2e2585 > >> > >> atkbd: Disable periodic polling by default. > >> It is one of the few remaining Giant-locked callouts. It would be > >> good to remove it, not mentioning that polling itself is not good. > >> > >> If this cause keyboard/mouse freezes on some hardware, please set > >> loader tunable hw.atkbd.hz=1 as workaround and report the issue. > >> > >> So you could try to set hw.atkbd.hz=1 (or hw.atkbd.hz=10) in > >> /boot/loader.conf, then reboot and see if it helps. > >> > >> Best > >> Michael > >> > >> -- > >> Michael Gmelin > >> > > > > -- > > Matthias Apitz, ✉ g...@unixarea.de, http://www.unixarea.de/ +49-176-38902045 > > Public GnuPG key: http://www.unixarea.de/key.pub > > -- Matthias Apitz, ✉ g...@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Re: 14.0-CURRENT boots fine but keyboard does not work
> On 4. Sep 2023, at 19:34, Matthias Apitz wrote: > > El día lunes, septiembre 04, 2023 a las 07:29:41p. m. +0200, Michael Gmelin > escribió: > >> >> On 4. Sep 2023, at 19:23, Matthias Apitz wrote: >>> >>> >>> Added Alexander Motin to To: as the origin of the CI; >>> >>> Neither hw.atkbd.hz=1 nor hw.atkbd.hz=10 makes the keyboard working on >>> my beloved Acer C720. Should I file a new PR? >>> >> >> Filing a PR makes sense, could you please Cc me on it? >> >> Do you know which version of FreeBSD was the last that worked for you? > > I'm actually using (and typing this on it) r368166. Will file a PR > tomorrow. Thanks > This could also be related: https://cgit.freebsd.org/src/commit/?id=319d2bf407b3762da6f1c67ffe8dce2fee587aaf You could try to undo that patch and build a new kernel. Best Michael >matthias > El día lunes, septiembre 04, 2023 a las 06:55:52p. m. +0200, Michael Gmelin escribió: On Mon, 4 Sep 2023 18:43:11 +0200 Matthias Apitz wrote: > I have a 14.0-CURRENT compiled from sources of head from August 4, > which boots fine from a produced USB key, but the keyboard does not > work on an Acer C720 (amd64), on other laptops the keyboard is fine. > > The keyboard works during the boot menu (for example to enable verbose > boot messages) but not on the login: prompt of the booted system. > > I've enabled SSH access into the C720 (if someone need more > information) and I'm attaching /var/log/messages of the booted system. Hi Matthias, The C720 required special patches for the keyboard to work, which I originally added here: https://cgit.freebsd.org/src/commit/?id=6c176113bbdd598231ec47d161d4c3714997169b I assume that something in that area changed recently. Without digging into it, this looks like a possible cause: https://cgit.freebsd.org/src/commit/sys/dev/atkbdc/atkbd.c?id=ce881170088c4c98c036fe561f8ee8413c2e2585 atkbd: Disable periodic polling by default. It is one of the few remaining Giant-locked callouts. It would be good to remove it, not mentioning that polling itself is not good. If this cause keyboard/mouse freezes on some hardware, please set loader tunable hw.atkbd.hz=1 as workaround and report the issue. So you could try to set hw.atkbd.hz=1 (or hw.atkbd.hz=10) in /boot/loader.conf, then reboot and see if it helps. Best Michael -- Michael Gmelin >>> >>> -- >>> Matthias Apitz, ✉ g...@unixarea.de, http://www.unixarea.de/ +49-176-38902045 >>> Public GnuPG key: http://www.unixarea.de/key.pub >> >> > > -- > Matthias Apitz, ✉ g...@unixarea.de, http://www.unixarea.de/ +49-176-38902045 > Public GnuPG key: http://www.unixarea.de/key.pub
Re: 14.0-CURRENT boots fine but keyboard does not work
On Mon, Sep 4, 2023, 11:40 AM Michael Gmelin wrote: > > > On 4. Sep 2023, at 19:34, Matthias Apitz wrote: > > El día lunes, septiembre 04, 2023 a las 07:29:41p. m. +0200, Michael > Gmelin escribió: > > > > On 4. Sep 2023, at 19:23, Matthias Apitz wrote: > > > > > Added Alexander Motin to To: as the origin of the CI; > > > Neither hw.atkbd.hz=1 nor hw.atkbd.hz=10 makes the keyboard working on > > my beloved Acer C720. Should I file a new PR? > > > > Filing a PR makes sense, could you please Cc me on it? > > > Do you know which version of FreeBSD was the last that worked for you? > > > I'm actually using (and typing this on it) r368166. Will file a PR > tomorrow. Thanks > > > This could also be related: > > > https://cgit.freebsd.org/src/commit/?id=319d2bf407b3762da6f1c67ffe8dce2fee587aaf > > You could try to undo that patch and build a new kernel. > It shouldn't make a difference... but I may have been given bad advice if it did... if it's this one, I'll help sort it out. Warner Best > Michael > > > >matthias > > El día lunes, septiembre 04, 2023 a las 06:55:52p. m. +0200, Michael > Gmelin escribió: > > > > > On Mon, 4 Sep 2023 18:43:11 +0200 > > Matthias Apitz wrote: > > > I have a 14.0-CURRENT compiled from sources of head from August 4, > > which boots fine from a produced USB key, but the keyboard does not > > work on an Acer C720 (amd64), on other laptops the keyboard is fine. > > > The keyboard works during the boot menu (for example to enable verbose > > boot messages) but not on the login: prompt of the booted system. > > > I've enabled SSH access into the C720 (if someone need more > > information) and I'm attaching /var/log/messages of the booted system. > > > Hi Matthias, > > > The C720 required special patches for the keyboard to work, which I > > originally added here: > > > https://cgit.freebsd.org/src/commit/?id=6c176113bbdd598231ec47d161d4c3714997169b > > > I assume that something in that area changed recently. > > > Without digging into it, this looks like a possible cause: > > > > https://cgit.freebsd.org/src/commit/sys/dev/atkbdc/atkbd.c?id=ce881170088c4c98c036fe561f8ee8413c2e2585 > > > atkbd: Disable periodic polling by default. > > It is one of the few remaining Giant-locked callouts. It would be > > good to remove it, not mentioning that polling itself is not good. > > > If this cause keyboard/mouse freezes on some hardware, please set > > loader tunable hw.atkbd.hz=1 as workaround and report the issue. > > > So you could try to set hw.atkbd.hz=1 (or hw.atkbd.hz=10) in > > /boot/loader.conf, then reboot and see if it helps. > > > Best > > Michael > > > -- > > Michael Gmelin > > > > -- > > Matthias Apitz, ✉ g...@unixarea.de, http://www.unixarea.de/ > +49-176-38902045 > > Public GnuPG key: http://www.unixarea.de/key.pub > > > > > -- > Matthias Apitz, ✉ g...@unixarea.de, http://www.unixarea.de/ > +49-176-38902045 > Public GnuPG key: http://www.unixarea.de/key.pub > >
Re: Possible issue with linux xattr support?
On Mon, Sep 04, 2023 at 04:03:54PM +0200, Felix Palmen wrote: > * Felix Palmen [20230904 15:39]: > > For some reason (I still have to try to get more information about it, > > will do soon), something else is broken now. While install from GNU > > coreutils works fine, some build systems use 'cp -p' to install files > > instead, which now fails with: > > > > | cp: preserving permissions for [...]: Operation not permitted > > The issue seems to be with listing attributes: > > | root@15-default:/wrkdirs/usr/ports/x11/linuxsrc-libxcb/work/libxcb-1.15/doc > # truss /compat/linux/bin/cp -pR ./tutorial > '/wrkdirs/usr/ports/x11/linuxsrc-libxcb/work/stage/compat/linux/usr/share/doc/libxcb/' > 2>&1 | grep xattr > | linux_flistxattr(0x4,0x0,0x0)ERR#-1 'Operation not > permitted' > | linux_flistxattr(0x4,0x0,0x0)ERR#-1 'Operation not > permitted' > | linux_llistxattr(0xcdd9,0x0,0x0) ERR#-1 'Operation not > permitted' > > The following q&d patch makes it work again: > > #v+ > diff --git a/sys/compat/linux/linux_xattr.c b/sys/compat/linux/linux_xattr.c > index 74b47f1cbaec..0b5af084b1b1 100644 > --- a/sys/compat/linux/linux_xattr.c > +++ b/sys/compat/linux/linux_xattr.c > @@ -198,7 +198,7 @@ listxattr(struct thread *td, struct listxattr_args *args) > if (error == 0) > td->td_retval[0] = cnt; > free(data, M_LINUX); > - return (error_to_xattrerror(attrnamespace, error)); > + return (error_to_xattrerror(EXTATTR_NAMESPACE_SYSTEM, error)); > } > > int > #v- > > I think this makes sense, because listxattr iterates over all > namespaces, so there's no sane way to know whether the EPERM was caused > by trying to access "system" or something else. Thanks for the report, please, try this: https://people.freebsd.org/~dchagin/lxattr.patch Don't be surprised, there 2 fixes > > Cheers, Felix > > -- > Felix Palmen {private} fe...@palmen-it.de > -- ports committer -- {web} http://palmen-it.de > {pgp public key} http://palmen-it.de/pub.txt > {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231
Re: Possible issue with linux xattr support?
Hello Dmitry, * Dmitry Chagin [20230904 21:29]: > Thanks for the report, > please, try this: https://people.freebsd.org/~dchagin/lxattr.patch > Don't be surprised, there 2 fixes Thanks a lot for the quick patch! I have to admit I don't fully understand it immediately, but I assume it's an attempt to only list attributes not in system.* when that namespace is restricted? Well, it *does* solve the issue with 'cp -p' from GNU coreutils! I guess I'll now do a full rebuild of my linuxulator userspace branch on a kernel with that patch, just to be sure it won't break anything else, this will take a few hours, I will report back. So far looking really good :) Cheers, Felix -- Felix Palmen {private} fe...@palmen-it.de -- ports committer -- {web} http://palmen-it.de {pgp public key} http://palmen-it.de/pub.txt {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231 signature.asc Description: PGP signature
Re: Possible issue with linux xattr support?
On Mon, Sep 04, 2023 at 08:43:02PM +0200, Felix Palmen wrote: > Hello Dmitry, > > * Dmitry Chagin [20230904 21:29]: > > Thanks for the report, > > please, try this: https://people.freebsd.org/~dchagin/lxattr.patch > > Don't be surprised, there 2 fixes > > Thanks a lot for the quick patch! I have to admit I don't fully > understand it immediately, but I assume it's an attempt to only list > attributes not in system.* when that namespace is restricted? no, + if (error == EPERM) + break; mean that listxattr will fail if user is unauthorized to access system namespace, ie, listxattr will return ENOTSUP > > Well, it *does* solve the issue with 'cp -p' from GNU coreutils! > > I guess I'll now do a full rebuild of my linuxulator userspace branch on > a kernel with that patch, just to be sure it won't break anything else, > this will take a few hours, I will report back. So far looking really > good :) > nice, thanks > Cheers, Felix > > -- > Felix Palmen {private} fe...@palmen-it.de > -- ports committer -- {web} http://palmen-it.de > {pgp public key} http://palmen-it.de/pub.txt > {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231
Re: Possible issue with linux xattr support?
* Dmitry Chagin [20230904 21:49]: > On Mon, Sep 04, 2023 at 08:43:02PM +0200, Felix Palmen wrote: > > Thanks a lot for the quick patch! I have to admit I don't fully > > understand it immediately, but I assume it's an attempt to only list > > attributes not in system.* when that namespace is restricted? > > no, > + if (error == EPERM) > + break; > mean that listxattr will fail if user is unauthorized to access system > namespace, ie, listxattr will return ENOTSUP Ah, now I get it, so breaking out of the loop there will leave attrnamespace at EXTATTR_NAMESPACE_SYSTEM if that was the cause, therefore error_to_xattrerror() works as expected. Nice! Well, given that I really don't expect any further problems, but give my poor builder a few hours to *really* verify :) Cheers, Felix -- Felix Palmen {private} fe...@palmen-it.de -- ports committer -- {web} http://palmen-it.de {pgp public key} http://palmen-it.de/pub.txt {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231 signature.asc Description: PGP signature
FYI: ^T use during poudriere bulk vs. /bin/sh operation: I got a "Unsafe ckmalloc() call" Abort trap that left a mess
During a (zfs based) poudriere bulk -a run a ^T got a: Unsafe ckmalloc() call Abort trap (core dumped) from /bin/sh . AT that point it had built something like 350 ports into packages in .building/ . After the last finish was reported for the already building ports, I killed 2 stuck poudriere related processes (the rest were gone already) and then I used: # poudriere jail -jmain-amd64-bulk_a -k [00:06:03] Unmounting file systems umount: unmount of /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/tests failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/share failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/lib32 failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/rescue failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/tests failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/share failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/lib32 failed: Device busy umount: unmount of /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/rescue failed: Device busy chflags: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/rescue/rcorder: Read-only file system chflags: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/rescue/mdconfig: Read-only file system chflags: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/rescue/mdmfs: Read-only file system . . . chflags: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/tests/lib: Read-only file system chflags: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/tests: Read-only file system rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/rescue: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/lib32: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/share: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr/tests: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/usr: Directory not empty rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref: Directory not empty rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/dev: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/var/db/ports: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/var/db: Directory not empty rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/var/empty: Operation not permitted rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/var: Directory not empty rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/rescue: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/sbin/init: Operation not permitted rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/sbin: Directory not empty rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/libexec/ld-elf.so.1: Operation not permitted rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/libexec/ld-elf32.so.1: Operation not permitted rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/libexec: Directory not empty rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/packages: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/proc: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/distfiles: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/compat/linux/proc: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/compat/linux: Directory not empty rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/compat: Directory not empty rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/lib/libthr.so.3: Operation not permitted rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/lib/libc.so.7: Operation not permitted rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/lib/libcrypt.so.5: Operation not permitted rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/lib: Directory not empty rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/usr/src: Device busy rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/usr/bin/crontab: Operation not permitted rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/usr/bin/passwd: Operation not permitted rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/usr/bin/chpass: Operation not permitted rm: /usr/local/poudriere/data/.m/main-amd64-bulk_a-default/ref/../40/usr/bin/su: Operation
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 4, 2023, at 10:05, Alexander Motin wrote: > On 04.09.2023 11:45, Mark Millard wrote: >> On Sep 4, 2023, at 06:09, Alexander Motin wrote: >>> per_txg_dirty_frees_percent is directly related to the delete delays we see >>> here. You are forcing ZFS to commit transactions each 5% of dirty ARC >>> limit, which is 5% of 10% or memory size. I haven't looked on that code >>> recently, but I guess setting it too low can make ZFS commit transactions >>> too often, increasing write inflation for the underlying storage. I would >>> propose you to restore the default and try again. >> While this machine is different, the original problem was worse than >> the issue here: the load average was less than 1 for the most part >> the parallel bulk build when 30 was used. The fraction of time waiting >> was much longer than with 5. If I understand right, both too high and >> too low for a type of context can lead to increased elapsed time and >> getting it set to a near optimal is a non-obvious exploration. > > IIRC this limit was modified several times since originally implemented. May > be it could benefit from another look, if default 30% is not good. It would > be good if generic ZFS issues like this were reported to OpenZFS upstream to > be visible to a wider public. Unfortunately I have several other project I > must work on, so if it is not a regression I can't promise I'll take it right > now, so anybody else is welcome. As I understand, there are contexts were 5 is inappropriate and 30 works fairly well: no good single answer as to what value range will avoid problems. >> An overall point for the goal of my activity is: what makes a >> good test context for checking if ZFS is again safe to use? >> May be other tradeoffs make, say, 4 hardware threads more >> reasonable than 32. > > Thank you for your testing. The best test is one that nobody else run. It > also correlates with the topic of "safe to use", which also depends on what > it is used for. :) Looks like use of a M.2 Samsung SSD 960 PRO 1TB with a non-debug FreeBSD build is suitable for the bulk -a -J128 test (no ALLOW_MAKE_JOBS variants enabled, USE_TMPFS=no in use) on the 32 hardware thread system. (The swap partition in use is from the normal environment's PCIe Optane media.) The %idle and the load averages and %user stayed reasonable in a preliminary test. One thing it does introduce is trim management (both available and potentially useful). (Optane media does not support or need it.) No per_txg_dirty_frees_percent adjustment involved (still 5). I've learned to not use ^T for fear of /bin/sh aborting and messing up poudriere's context. So I now monitor with: # poudriere status -b in a separate ssh session. I'll note that I doubt I'd try for a complete bulk -a . I'd probably stop it if I notice that the number of active builders drops off for a notable time (normal waiting for prerequisites appearing to be why). === Mark Millard marklmi at yahoo.com
Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned
On Sep 4, 2023, at 18:39, Mark Millard wrote: > On Sep 4, 2023, at 10:05, Alexander Motin wrote: > >> On 04.09.2023 11:45, Mark Millard wrote: >>> On Sep 4, 2023, at 06:09, Alexander Motin wrote: per_txg_dirty_frees_percent is directly related to the delete delays we see here. You are forcing ZFS to commit transactions each 5% of dirty ARC limit, which is 5% of 10% or memory size. I haven't looked on that code recently, but I guess setting it too low can make ZFS commit transactions too often, increasing write inflation for the underlying storage. I would propose you to restore the default and try again. >>> While this machine is different, the original problem was worse than >>> the issue here: the load average was less than 1 for the most part >>> the parallel bulk build when 30 was used. The fraction of time waiting >>> was much longer than with 5. If I understand right, both too high and >>> too low for a type of context can lead to increased elapsed time and >>> getting it set to a near optimal is a non-obvious exploration. >> >> IIRC this limit was modified several times since originally implemented. >> May be it could benefit from another look, if default 30% is not good. It >> would be good if generic ZFS issues like this were reported to OpenZFS >> upstream to be visible to a wider public. Unfortunately I have several >> other project I must work on, so if it is not a regression I can't promise >> I'll take it right now, so anybody else is welcome. > > As I understand, there are contexts were 5 is inappropriate > and 30 works fairly well: no good single answer as to what > value range will avoid problems. > >>> An overall point for the goal of my activity is: what makes a >>> good test context for checking if ZFS is again safe to use? >>> May be other tradeoffs make, say, 4 hardware threads more >>> reasonable than 32. >> >> Thank you for your testing. The best test is one that nobody else run. It >> also correlates with the topic of "safe to use", which also depends on what >> it is used for. :) > > Looks like use of a M.2 Samsung SSD 960 PRO 1TB with a > non-debug FreeBSD build is suitable for the bulk -a -J128 > test (no ALLOW_MAKE_JOBS variants enabled, USE_TMPFS=no in > use) on the 32 hardware thread system. (The swap partition > in use is from the normal environment's PCIe Optane media.) > The %idle and the load averages and %user stayed reasonable > in a preliminary test. One thing it does introduce is trim > management (both available and potentially useful). (Optane > media does not support or need it.) No > per_txg_dirty_frees_percent adjustment involved (still 5). > > I've learned to not use ^T for fear of /bin/sh aborting > and messing up poudriere's context. So I now monitor with: > > # poudriere status -b > > in a separate ssh session. > > I'll note that I doubt I'd try for a complete bulk -a . > I'd probably stop it if I notice that the number of > active builders drops off for a notable time (normal > waiting for prerequisites appearing to be why). > > So much for that idea. It has reached a state of staying under 3500 w/s and up to 4.5ms/w (normally above 2ms/w). %busy wondering in the range 85% to 101%. Lots of top showing tx->tx. There is some read and other activity as well. Of course the kBps figures are larger than the earlier USB3 context (larger kB figures). It reached about 1350 port->package builds over the first hour after the 2nd "Buildee started". autotrim is off. Doing a "zpool trim -w zamd64" leads to . . . larger w/s figures during the process. So more exploring to do at some point. Possibly: autotrim per_txg_dirty_frees_percent For now, I'm just running "zpool trim -w zamd64" once and a while so the process continues better. Still no evidence of deadlocks. No evidence of builds failing for corruptions. . . . At around the end of 2nd hour: 2920 or so built. Still no evidence of deadlocks. No evidence of builds failing for corruptions. . . . I've turned on autotrim without stopping the bulk build. . . . At around the end of 3rd hour: 4080 or so built. Still no evidence of deadlocks. No evidence of builds failing for corruptions. Looks like the % idle has been high for a significant time. I think I'll stop this specific test and clean things out. Looks like lang/guile* are examples of not respecting the lack of ALLOW_MAKE_JOBS use. Hmm. The ^C ended up with: ^C[03:41:07] Error: Signal SIGINT caught, cleaning up and exiting [main-amd64-bulk_a-default] [2023-09-04_17h49m28s] [sigint:] Queued: 34588 Built: 4331 Failed: 12Skipped: 303 Ignored: 335 Fetched: 0 Tobuild: 29607 Time: 03:41:05 [03:41:07] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-04_17h49m28s [03:41:19] [95] [00:06:59] Finished databases/php80-pdo_sqlite | php80-pdo_sqlite-8.0.30: Failed: build-depends [03:41:20] [60] [00:10:53] Finished www/php82-session | php82-session-8.2.9: S
Re: Possible issue with linux xattr support?
* Dmitry Chagin [20230904 21:49]: > On Mon, Sep 04, 2023 at 08:43:02PM +0200, Felix Palmen wrote: > > I guess I'll now do a full rebuild of my linuxulator userspace branch on > > a kernel with that patch, just to be sure it won't break anything else, > > this will take a few hours, I will report back. So far looking really > > good :) This completed while I was sleeping. No issues! With ~120 "Linux ports" building fine, using all sorts of tools (like python and some tools from coreutils), I think it's fair to assume it's all fixed now. Cheers, Felix -- Felix Palmen {private} fe...@palmen-it.de -- ports committer -- {web} http://palmen-it.de {pgp public key} http://palmen-it.de/pub.txt {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231 signature.asc Description: PGP signature
Re: 14.0-CURRENT boots fine but keyboard does not work
El día lunes, septiembre 04, 2023 a las 12:10:56p. m. -0600, Warner Losh escribió: > On Mon, Sep 4, 2023, 11:40 AM Michael Gmelin wrote: > > > This could also be related: > > > > > > https://cgit.freebsd.org/src/commit/?id=319d2bf407b3762da6f1c67ffe8dce2fee587aaf > > > > You could try to undo that patch and build a new kernel. > > > > It shouldn't make a difference... but I may have been given bad advice if > it did... if it's this one, I'll help sort it out. > > Warner I've reverted the above change, compiled the kernel, copied it to the USB key into /boot/kernel/kernel (only this file), removed /boot/loader.conf and the keyboard is fine. Thanks, Michael. Warner, do you need a new PR for this? matthias -- Matthias Apitz, ✉ g...@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub