On Sep 3, 2023, at 23:35, Mark Millard <mark...@yahoo.com> wrote: > On Sep 3, 2023, at 22:06, Alexander Motin <m...@freebsd.org> wrote: > >> >> On 03.09.2023 22:54, Mark Millard wrote: >>> After that ^t produced the likes of: >>> load: 6.39 cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s >>> 1% 13004k >> >> So the full state is not "tx->tx", but is actually a >> "tx->tx_quiesce_done_cv", which means the thread is waiting for new >> transaction to be opened, which means some previous to be quiesced and then >> synced. >> >>> #0 0xffffffff80b6f103 at mi_switch+0x173 >>> #1 0xffffffff80bc0f24 at sleepq_switch+0x104 >>> #2 0xffffffff80aec4c5 at _cv_wait+0x165 >>> #3 0xffffffff82aba365 at txg_wait_open+0xf5 >>> #4 0xffffffff82a11b81 at dmu_free_long_range+0x151 >> >> Here it seems like transaction commit is waited due to large amount of >> delete operations, which ZFS tries to spread between separate TXGs. > > That fit the context: cleaning out /usr/local/poudriere/data/.m/ > >> You should probably see some large and growing number in sysctl >> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay . > > After the reboot I started a -J64 example. It has avoided the > early "witness exhausted". Again I ^C'd after about an hours > after the 2nd builder had started. So: again cleaning out > /usr/local/poudriere/data/.m/ Only seconds between: > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042 > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427 > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323 > > # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay > kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027 > > I have found a measure of progress: zfs list's USED > for /usr/local/poudriere/data/.m is decreasing. So > ztop's d/s was a good classification: deletes. > >>> #5 0xffffffff829a87d2 at zfs_rmnode+0x72 >>> #6 0xffffffff829b658d at zfs_freebsd_reclaim+0x3d >>> #7 0xffffffff8113a495 at VOP_RECLAIM_APV+0x35 >>> #8 0xffffffff80c5a7d9 at vgonel+0x3a9 >>> #9 0xffffffff80c5af7f at vrecycle+0x3f >>> #10 0xffffffff829b643e at zfs_freebsd_inactive+0x4e >>> #11 0xffffffff80c598cf at vinactivef+0xbf >>> #12 0xffffffff80c590da at vput_final+0x2aa >>> #13 0xffffffff80c68886 at kern_funlinkat+0x2f6 >>> #14 0xffffffff80c68588 at sys_unlink+0x28 >>> #15 0xffffffff8106323f at amd64_syscall+0x14f >>> #16 0xffffffff8103512b at fast_syscall_common+0xf8 >> >> What we don't see here is what quiesce and sync threads of the pool are >> actually doing. Sync thread has plenty of different jobs, including async >> write, async destroy, scrub and others, that all may delay each other. >> >> Before you rebooted the system, depending how alive it is, could you save a >> number of outputs of `procstat -akk`, or at least specifically `procstat >> -akk | grep txg_thread_enter` if the full is hard? Or somehow else observe >> what they are doing. > > # procstat -akk > ~/mmjnk00.txt > # procstat -akk > ~/mmjnk01.txt > # procstat -akk > ~/mmjnk02.txt > # procstat -akk > ~/mmjnk03.txt > # procstat -akk > ~/mmjnk04.txt > # procstat -akk > ~/mmjnk05.txt > # grep txg_thread_enter ~/mmjnk0[0-5].txt > /usr/home/root/mmjnk00.txt: 6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk00.txt: 6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk01.txt: 6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk01.txt: 6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk02.txt: 6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk02.txt: 6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk03.txt: 6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk03.txt: 6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk04.txt: 6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk04.txt: 6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk05.txt: 6 100881 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb > txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe > /usr/home/root/mmjnk05.txt: 6 100882 zfskern txg_thread_enter > mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b > _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 > txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe > > (Hopefully that will be a sufficiently useful start.) > >> `zpool status`, `zpool get all` and `sysctl -a` would also not harm. > > It is a very simple zpool configuration: one partition. > I only use it for bectl BE reasons, not the general > range of reasons for using zfs. I created the media with > my normal content, then checkpointed before doing the > git fetch to start to set up the experiment. > > # zpool status > pool: zamd64 > state: ONLINE > status: Some supported and requested features are not enabled on the pool. > The pool can still be used, but some features are unavailable. > action: Enable all features using 'zpool upgrade'. Once this is done, > the pool may no longer be accessible by software that does not support > the features. See zpool-features(7) for details. > checkpoint: created Sun Sep 3 11:46:54 2023, consumes 2.17M > config: > > NAME STATE READ WRITE CKSUM > zamd64 ONLINE 0 0 0 > gpt/amd64zfs ONLINE 0 0 0 > > errors: No known data errors > > There was also a snapshot in place before I did the > checkpoint operation. > > I deliberately did not use my typical openzfs-2.1-freebsd > for compatibility but used defaults when creating the pool: > > # zpool get all > NAME PROPERTY VALUE SOURCE > zamd64 size 872G - > zamd64 capacity 21% - > zamd64 altroot - default > zamd64 health ONLINE - > zamd64 guid 4817074778276814820 - > zamd64 version - default > zamd64 bootfs zamd64/ROOT/main-amd64 local > zamd64 delegation on default > zamd64 autoreplace off default > zamd64 cachefile - default > zamd64 failmode wait default > zamd64 listsnapshots off default > zamd64 autoexpand off default > zamd64 dedupratio 1.00x - > zamd64 free 688G - > zamd64 allocated 184G - > zamd64 readonly off - > zamd64 ashift 0 default > zamd64 comment - default > zamd64 expandsize - - > zamd64 freeing 0 - > zamd64 fragmentation 17% - > zamd64 leaked 0 - > zamd64 multihost off default > zamd64 checkpoint 2.17M - > zamd64 load_guid 17719601284614388220 - > zamd64 autotrim off default > zamd64 compatibility off default > zamd64 bcloneused 0 - > zamd64 bclonesaved 0 - > zamd64 bcloneratio 1.00x - > zamd64 feature@async_destroy enabled local > zamd64 feature@empty_bpobj active local > zamd64 feature@lz4_compress active local > zamd64 feature@multi_vdev_crash_dump enabled local > zamd64 feature@spacemap_histogram active local > zamd64 feature@enabled_txg active local > zamd64 feature@hole_birth active local > zamd64 feature@extensible_dataset active local > zamd64 feature@embedded_data active local > zamd64 feature@bookmarks enabled local > zamd64 feature@filesystem_limits enabled local > zamd64 feature@large_blocks enabled local > zamd64 feature@large_dnode enabled local > zamd64 feature@sha512 enabled local > zamd64 feature@skein enabled local > zamd64 feature@edonr enabled local > zamd64 feature@userobj_accounting active local > zamd64 feature@encryption enabled local > zamd64 feature@project_quota active local > zamd64 feature@device_removal enabled local > zamd64 feature@obsolete_counts enabled local > zamd64 feature@zpool_checkpoint active local > zamd64 feature@spacemap_v2 active local > zamd64 feature@allocation_classes enabled local > zamd64 feature@resilver_defer enabled local > zamd64 feature@bookmark_v2 enabled local > zamd64 feature@redaction_bookmarks enabled local > zamd64 feature@redacted_datasets enabled local > zamd64 feature@bookmark_written enabled local > zamd64 feature@log_spacemap active local > zamd64 feature@livelist enabled local > zamd64 feature@device_rebuild enabled local > zamd64 feature@zstd_compress enabled local > zamd64 feature@draid enabled local > zamd64 feature@zilsaxattr active local > zamd64 feature@head_errlog active local > zamd64 feature@blake3 enabled local > zamd64 feature@block_cloning enabled local > zamd64 feature@vdev_zaps_v2 active local > zamd64 feature@redaction_list_spill disabled local > > /etc/sysctl.conf does have: > > vfs.zfs.min_auto_ashift=12 > vfs.zfs.per_txg_dirty_frees_percent=5 > > The vfs.zfs.per_txg_dirty_frees_percent is from prior > Mateusz Guzik help, where after testing the change I > reported: > > Result summary: Seems to have avoided the sustained periods > of low load average activity. Much better for the context. > > But it was for a different machine (aarch64, 8 cores). But > it was for poudriere bulk use. > > Turns out the default of 30 was causing sort of like > what is seen here: I could have presented some of the > information via the small load average figures here. > > (Note: 5 is the old default, 30 is newer. Other contexts > have other problems with 5: no single right setting and > no automated configuration.) > > Other than those 2 items, zfs is untuned (defaults). > > sysctl -a is a lot more output (864930 Bytes) so I'll skip > it for now. > >> PS: I may be wrong, but USB in "USB3 NVMe SSD storage" makes me shiver. Make >> sure there is no storage problems, like some huge delays, timeouts, etc, >> that can be seen, for example, as busy percents regularly spiking far above >> 100% in your `gstat -spod`. >> > > The "gstat -spod" output showed (and shows): around 0.8ms/w to 3ms/w, > mostly at the lower end of the range. < 98%busy, no spikes to > 100%. > It is a previously unused Samsung PSSD T7 Touch.
A little more context here: that is for the "kB" figures seen during the cleanup/delete activity. During port builds into packages larger "kB" figures are seen and the ms/w figures will tend to be larger as well. The larger sizes can also lead to reaching somewhat above 100 %busy some of the time. I'll also note that I've ended up doing a lot more write activity exploring than I'd expected. > I was not prepared to replace the content of a PCIe slot's media > or M.2 connection's media for the temporary purpose. No spare > supply for those so no simple swapping for those. === Mark Millard marklmi at yahoo.com