On Sep 3, 2023, at 22:06, Alexander Motin <m...@freebsd.org> wrote: > > On 03.09.2023 22:54, Mark Millard wrote: >> After that ^t produced the likes of: >> load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% >> 13004k > > So the full state is not "tx->tx", but is actually a > "tx->tx_quiesce_done_cv", which means the thread is waiting for new > transaction to be opened, which means some previous to be quiesced and then > synced. > >> #0 0xffffffff80b6f103 at mi_switch+0x173 >> #1 0xffffffff80bc0f24 at sleepq_switch+0x104 >> #2 0xffffffff80aec4c5 at _cv_wait+0x165 >> #3 0xffffffff82aba365 at txg_wait_open+0xf5 >> #4 0xffffffff82a11b81 at dmu_free_long_range+0x151 > > Here it seems like transaction commit is waited due to large amount of delete > operations, which ZFS tries to spread between separate TXGs.
That fit the context: cleaning out /usr/local/poudriere/data/.m/ > You should probably see some large and growing number in sysctl > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . After the reboot I started a -J64 example. It has avoided the early "witness exhausted". Again I ^C'd after about an hours after the 2nd builder had started. So: again cleaning out /usr/local/poudriere/data/.m/ Only seconds between: # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323 # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027 I have found a measure of progress: zfs list's USED for /usr/local/poudriere/data/.m is decreasing. So ztop's d/s was a good classification: deletes. >> #5 0xffffffff829a87d2 at zfs_rmnode+0x72 >> #6 0xffffffff829b658d at zfs_freebsd_reclaim+0x3d >> #7 0xffffffff8113a495 at VOP_RECLAIM_APV+0x35 >> #8 0xffffffff80c5a7d9 at vgonel+0x3a9 >> #9 0xffffffff80c5af7f at vrecycle+0x3f >> #10 0xffffffff829b643e at zfs_freebsd_inactive+0x4e >> #11 0xffffffff80c598cf at vinactivef+0xbf >> #12 0xffffffff80c590da at vput_final+0x2aa >> #13 0xffffffff80c68886 at kern_funlinkat+0x2f6 >> #14 0xffffffff80c68588 at sys_unlink+0x28 >> #15 0xffffffff8106323f at amd64_syscall+0x14f >> #16 0xffffffff8103512b at fast_syscall_common+0xf8 > > What we don't see here is what quiesce and sync threads of the pool are > actually doing. Sync thread has plenty of different jobs, including async > write, async destroy, scrub and others, that all may delay each other. > > Before you rebooted the system, depending how alive it is, could you save a > number of outputs of `procstat -akk`, or at least specifically `procstat -akk > | grep txg_thread_enter` if the full is hard? Or somehow else observe what > they are doing. # procstat -akk > ~/mmjnk00.txt # procstat -akk > ~/mmjnk01.txt # procstat -akk > ~/mmjnk02.txt # procstat -akk > ~/mmjnk03.txt # procstat -akk > ~/mmjnk04.txt # procstat -akk > ~/mmjnk05.txt # grep txg_thread_enter ~/mmjnk0[0-5].txt /usr/home/root/mmjnk00.txt: 6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk00.txt: 6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk01.txt: 6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk01.txt: 6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk02.txt: 6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk02.txt: 6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk03.txt: 6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk03.txt: 6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk04.txt: 6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk04.txt: 6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk05.txt: 6 100881 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe /usr/home/root/mmjnk05.txt: 6 100882 zfskern txg_thread_enter mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe (Hopefully that will be a sufficiently useful start.) > `zpool status`, `zpool get all` and `sysctl -a` would also not harm. It is a very simple zpool configuration: one partition. I only use it for bectl BE reasons, not the general range of reasons for using zfs. I created the media with my normal content, then checkpointed before doing the git fetch to start to set up the experiment. # zpool status pool: zamd64 state: ONLINE status: Some supported and requested features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details. checkpoint: created Sun Sep 3 11:46:54 2023, consumes 2.17M config: NAME STATE READ WRITE CKSUM zamd64 ONLINE 0 0 0 gpt/amd64zfs ONLINE 0 0 0 errors: No known data errors There was also a snapshot in place before I did the checkpoint operation. I deliberately did not use my typical openzfs-2.1-freebsd for compatibility but used defaults when creating the pool: # zpool get all NAME PROPERTY VALUE SOURCE zamd64 size 872G - zamd64 capacity 21% - zamd64 altroot - default zamd64 health ONLINE - zamd64 guid 4817074778276814820 - zamd64 version - default zamd64 bootfs zamd64/ROOT/main-amd64 local zamd64 delegation on default zamd64 autoreplace off default zamd64 cachefile - default zamd64 failmode wait default zamd64 listsnapshots off default zamd64 autoexpand off default zamd64 dedupratio 1.00x - zamd64 free 688G - zamd64 allocated 184G - zamd64 readonly off - zamd64 ashift 0 default zamd64 comment - default zamd64 expandsize - - zamd64 freeing 0 - zamd64 fragmentation 17% - zamd64 leaked 0 - zamd64 multihost off default zamd64 checkpoint 2.17M - zamd64 load_guid 17719601284614388220 - zamd64 autotrim off default zamd64 compatibility off default zamd64 bcloneused 0 - zamd64 bclonesaved 0 - zamd64 bcloneratio 1.00x - zamd64 feature@async_destroy enabled local zamd64 feature@empty_bpobj active local zamd64 feature@lz4_compress active local zamd64 feature@multi_vdev_crash_dump enabled local zamd64 feature@spacemap_histogram active local zamd64 feature@enabled_txg active local zamd64 feature@hole_birth active local zamd64 feature@extensible_dataset active local zamd64 feature@embedded_data active local zamd64 feature@bookmarks enabled local zamd64 feature@filesystem_limits enabled local zamd64 feature@large_blocks enabled local zamd64 feature@large_dnode enabled local zamd64 feature@sha512 enabled local zamd64 feature@skein enabled local zamd64 feature@edonr enabled local zamd64 feature@userobj_accounting active local zamd64 feature@encryption enabled local zamd64 feature@project_quota active local zamd64 feature@device_removal enabled local zamd64 feature@obsolete_counts enabled local zamd64 feature@zpool_checkpoint active local zamd64 feature@spacemap_v2 active local zamd64 feature@allocation_classes enabled local zamd64 feature@resilver_defer enabled local zamd64 feature@bookmark_v2 enabled local zamd64 feature@redaction_bookmarks enabled local zamd64 feature@redacted_datasets enabled local zamd64 feature@bookmark_written enabled local zamd64 feature@log_spacemap active local zamd64 feature@livelist enabled local zamd64 feature@device_rebuild enabled local zamd64 feature@zstd_compress enabled local zamd64 feature@draid enabled local zamd64 feature@zilsaxattr active local zamd64 feature@head_errlog active local zamd64 feature@blake3 enabled local zamd64 feature@block_cloning enabled local zamd64 feature@vdev_zaps_v2 active local zamd64 feature@redaction_list_spill disabled local /etc/sysctl.conf does have: vfs.zfs.min_auto_ashift=12 vfs.zfs.per_txg_dirty_frees_percent=5 The vfs.zfs.per_txg_dirty_frees_percent is from prior Mateusz Guzik help, where after testing the change I reported: Result summary: Seems to have avoided the sustained periods of low load average activity. Much better for the context. But it was for a different machine (aarch64, 8 cores). But it was for poudriere bulk use. Turns out the default of 30 was causing sort of like what is seen here: I could have presented some of the information via the small load average figures here. (Note: 5 is the old default, 30 is newer. Other contexts have other problems with 5: no single right setting and no automated configuration.) Other than those 2 items, zfs is untuned (defaults). sysctl -a is a lot more output (864930 Bytes) so I'll skip it for now. > PS: I may be wrong, but USB in "USB3 NVMe SSD storage" makes me shiver. Make > sure there is no storage problems, like some huge delays, timeouts, etc, that > can be seen, for example, as busy percents regularly spiking far above 100% > in your `gstat -spod`. > The "gstat -spod" output showed (and shows): around 0.8ms/w to 3ms/w, mostly at the lower end of the range. < 98%busy, no spikes to > 100%. It is a previously unused Samsung PSSD T7 Touch. I was not prepared to replace the content of a PCIe slot's media or M.2 connection's media for the temporary purpose. No spare supply for those so no simple swapping for those. === Mark Millard marklmi at yahoo.com