On Fri, Jan 28, 2022, 4:33 AM Steven J. West <stevenjonw...@gmail.com>
wrote:

> Dear all,
>
> TL;DR/summary:
>
>    - Tuning vm.watermark_boost_factor to 0 (disable) on Debian
>    significantly improves performance on memory-intensive tasks that utilise
>    SWAP space, by stopping preemptive kswapd freeing of memory, and
>    subsequent page thrashing.
>    - I suggest that Debian should tune vm-watermark_boost_fact=0 by
>    default to prevent this problem
>
> I'm not a Debian maintainer, but this has got to be the best problem
report I ever saw :-)

But for years I have adopted the philosophy at home which is demanded in
every data center I've worked in: If your Linux system is swapping, you
have configured it wrong. In the server farms there is no swapping. You
make sure you have enough RAM to prevent swapping. EOS.


I have recently installed Debian 11 on a HP Z8 G4 Workstation (Z3Z16AV) -
> 32GB RAM, installed with ~120GB SWAP on a 2TB solid state drive (specs at
> end of this message).
>
> I have been running some compute-intensive image processing tasks (CPU-
> and memory- intensive), which has on occasion had to dip into SWAP space,
> depending on image sizes (the processing I am running is image registration
> using elastix/transformix).
>
> I had benchmarked the code on my Ubuntu laptop (similar spec) without any
> problems, but when running on Debian, whenever SWAP was needed, the system
> processing significantly slowed down/essentially froze.
>
> After much debugging, I have traced this to the vm.watermark_boost_factor
> kernel parameter:
>
> Comparing the Ubuntu and Debian kernel parameters using sudo sysctl -a
> showed two key differences in virtual memory (vm) management parameters.
>
>    - Ubuntu:
>       - vm.swappiness=60
>       - vm.watermark_boost_factor=0
>       - Debian:
>       - vm.swappiness=10
>       - vm.watermark_boost_factor=150
>
>
> I identified what these two parameters control:
>
>
>    - vm.swappiness : a parameter used to calculate the swap tendency (
>    https://access.redhat.com/solutions/103833)
>    - vm.watermark_boost_factor : controls the level of reclaim when
>    memory is being fragmented.. A boost factor of 0 will disable the feature. 
> (
>    
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/8.4_release_notes/kernel_parameters_changes
>    )
>
>
> I changed swappiness and then watermark_boost_factor sequentially, to see
> whether tuning these parameters to match my Ubuntu system prevented the
> system from freezing under my memory-intensive task.
>
>
>    - sudo sysctl vm.swappiness=60 on my Debian system did not prevent the
>    freezing behaviour.
>    - sudo sysctl vm.watermark_boost_factor=0 (disabling it) on my Debian
>    system prevented the freezing behaviour.
>
>
> I then set these permanently by adding the following to /etc/sysctl.conf
>
> vm.swappiness=60
> vm.watermark_boost_factor=0
>
>
> Further searching revealed this Ubuntu bug report:
>
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1861359
>
> swap storms kills interactive use
> With this key entry:
>
> Sultan Alsawaf (kerneltoast) wrote on 2020-03-27: #56
>
> This problem is caused by an upstream memory management feature called
> watermark boosting. Normally, when a memory allocation fails and falls back
> to the page allocator, the page allocator will wake up kswapd to free up
> pages in order to make the memory allocation succeed. kswapd tries to free
> memory until it reaches a minimum amount of memory for each memory zone
> called the high watermark.
>
> What watermark boosting does is try to preemptively fire up kswapd to free
> memory when there hasn't been an allocation failure. It does this by
> increasing kswapd's high watermark goal and then firing up kswapd. The
> reason why this causes freezes is because, with the increased high
> watermark goal, kswapd will steal memory from processes that need it in
> order to make forward progress. These processes will, in turn, try to
> allocate memory again, which will cause kswapd to steal necessary pages
> from those processes again, in a positive feedback loop known as page
> thrashing. When page thrashing occurs, your system is essentially
> livelocked until the necessary forward progress can be made to stop
> processes from trying to continuously allocate memory and trigger kswapd to
> steal it back.
>
> This problem already occurs with kswapd *without* watermark boosting, but
> it's usually only encountered on machines with a small amount of memory
> and/or a slow CPU. Watermark boosting just makes the existing problem worse
> enough to notice on higher spec'd machines.
>
> To fix the issue in this bug, watermark boosting can be disabled with the
> following:
> # echo 0 > /proc/sys/vm/watermark_boost_factor
>
> There's really no harm in doing so, because watermark boosting is an
> inherently broken feature...
>
>
> So essentially, disabling watermark_boost_factor ensures effective
> swapping and reduces page thrashing.
>
> *I therefore suggest that Debian should
> tune vm.watermark_boost_factor=0 by default.*
>
> Cheers,
>
> Steve.
>
>
> Below are some more detailed specs of my Debian machine for reference:
>
>
>   $ uname -a
> Linux panseer 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64
> GNU/Linux
>
>
>   $ lscpu
> Architecture:                    x86_64
> CPU op-mode(s):                  32-bit, 64-bit
> Byte Order:                      Little Endian
> Address sizes:                   46 bits physical, 48 bits virtual
> CPU(s):                          20
> On-line CPU(s) list:             0-19
> Thread(s) per core:              2
> Core(s) per socket:              10
> Socket(s):                       1
> NUMA node(s):                    1
> Vendor ID:                       GenuineIntel
> CPU family:                      6
> Model:                           85
> Model name:                      Intel(R) Xeon(R) Silver 4210R CPU @
> 2.40GHz
> Stepping:                        7
> CPU MHz:                         2511.149
> CPU max MHz:                     3200.0000
> CPU min MHz:                     1000.0000
> BogoMIPS:                        4800.00
> Virtualization:                  VT-x
> L1d cache:                       320 KiB
> L1i cache:                       320 KiB
> L2 cache:                        10 MiB
> L3 cache:                        13.8 MiB
> NUMA node0 CPU(s):               0-19
> Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
> Vulnerability L1tf:              Not affected
> Vulnerability Mds:               Not affected
> Vulnerability Meltdown:          Not affected
> Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass
> disabled via prctl and seccomp
> Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and
> __user pointer sanitization
> Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB
> conditional, RSB filling
> Vulnerability Srbds:             Not affected
> Vulnerability Tsx async abort:   Mitigation; TSX disabled
> Flags:                           fpu vme de pse tsc msr pae mce cx8 apic
> sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
> pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon
>                                   pebs bts rep_good nopl xtopology
> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
> est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic mov
>                                  be popcnt tsc_deadline_timer aes xsave
> avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3
> invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tp
>                                  r_shadow vnmi flexpriority ept vpid
> ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a
> avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd a
>                                  vx512bw avx512vl xsaveopt xsavec xgetbv1
> xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat
> pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_
>                                  vnni md_clear flush_l1d arch_capabilities
>
>
>   $ free -h
>                total        used        free      shared  buff/cache
> available
> Mem:            31Gi       3.6Gi        24Gi       160Mi       3.2Gi
>  26Gi
> Swap:          119Gi       242Mi       118Gi
>
>
>
>
> Steven J. West
>                  BSc DPhil FRMS
> _________________________________
> International Brain Lab Histology Research Fellow
> Sainsbury Wellcome Centre for Neural Circuits and Behaviour
> University College London
> 25 Howland St, Fitzrovia, London W1T 4JG
> +44 (0) 203 108 8197
> steven.w...@internationalbrainlab.org
> https://www.internationalbrainlab.com/
> https://www.sainsburywellcome.org/
>
>
>

Reply via email to