mikemccand commented on issue #15662:
URL: https://github.com/apache/lucene/issues/15662#issuecomment-3946125768

   `beast3` benchmarking lives!!!
   
   Yesterday's run finally succeeded again end-to-end benchmarks, on downgraded 
packages, downgraded JDK (25.0.1), recent Lucene sources ([only 77 Lucene 
changes](https://github.com/apache/lucene/compare/2f9aa8ae26d6c1087884c734e1b3d137bd8c6601...338a79181f0347ce7ba39e0210341c38afbfdbe9)
 since previous successful benchy run, heh).
   
   The results are not yet trustworthy -- I have `FCLK` mis-configured on the 
current `beast3` boot -- I'll fix that, re-update box to latest arch linux, and 
get benchy running again each night, and pray that in those 77 Lucene changes, 
or arch linux package changes, there is not another regression.
   
   The smoking gun was the [CPU 
governor](https://wiki.archlinux.org/title/CPU_frequency_scaling) mixed with 
too-old bios!
   
   Somehow the governor switched somewhere in that Jan 22 - 29 window, but then 
the driver (that actually interacts w/ the CPU cores to read/write 
targets/limits) `amd_pstate` was unable to interact with the too-old BIOS -- 
all errors trying to query each CPU's capabilities -- so it fell back to 
godawful slow safe defaults.
   
   Annoyingly that CPU governor change stuck even with attempted whole system 
downgrades.  Claude was great fun in iterating theories, testing them, teaching 
me all sorts of wild Linux tooling to inspect every last detail about your 
hardware ([`turbostat`](https://archlinux.org/packages/?name=turbostat), 
[`cpupower`](https://archlinux.org/packages/?name=cpupower), `/sys/devices/*`, 
[`mcelog`](https://mcelog.org/), 
[`decode-dimms`](https://man.archlinux.org/man/decode-dimms.1.en), 
[`numactl`](https://man.archlinux.org/man/numactl.8.en), 
[`dmidecode`](https://man.archlinux.org/man/dmidecode.8.en), 
[`htop`](https://man.archlinux.org/man/htop.1.en), 
[`btop`](https://github.com/aristocratos/btop), 
[`s-tui`](https://github.com/amanusk/s-tui) (<-- phew this was able to get all 
128 cores maxed out!!  oh the amps of DC going into the CPU... sheesh.  nothing 
seemed to melt.),  [`nvme`](https://man.archlinux.org/man/nvme.1), ...).
   
   Claude does a pretty good job understanding photos -- so I would boot to 
BIOS, take pictures for Claude, Claude would tell me which setting to fix / 
dive into next / where.  I took pictures of my hardware and it told me which 
components they were, e.g. the pump for the water cooler, the open case/frame.  
See the blow-by-blow with Claude: 
[here](https://claude.ai/share/dae1030c-0ecb-491b-8166-f391334ffec9), 
[here](https://claude.ai/share/dba21376-c29e-4526-a597-7a4ba9d1e5d3), 
[here](https://claude.ai/share/0f96e528-bf2e-4fdd-a096-c3575dcd94ca), 
[here](https://claude.ai/share/01461d59-34c8-4b42-96ec-6ffdea27b6d2), 
[here](https://claude.ai/share/4eb4b79f-e54a-480c-9942-4e338290c915) (sheesh 
there are more, I'll stop).
   
   I made these changes:
     * Upgraded to modern BIOS, `amd_pstate_epp` is able to talk to CPU cores 
now
     * Governor is now wired to performance, boost is enabled/active
     * I turned on all fans to max (there was a handy switch on the 
motherboard).  Thermal throttling was never happening (not logged anyways), but 
some temps were hot, so ... also added an external house fan for good measure
     * Discovered, insanely, that I failed to pull the plastic off the 
thermal-paste inside the motherboard's cover for the NVMe drives, sheesh.  It 
didn't cause problems (no thermal throttling) but made the NMVe ssds run hot 
(though they are not holding the index -- that's the Intel Optane PCIe card)
     * Also discovered I had not plugged in additional power for PCIe -- it's 
likely that doesn't matter -- the two extra power motherboard plugs for CPU 
power are plugged in.  Still, Claude thinks it's possible my power supply is 
under-spec'd ... I'll swap in an upgrade and see if it moves the needle ... 
unlikely
   
   I plan to add additional logging to benchy's nightly artifacts to monitor 
CPU freqs / turbo too, and add more health metrics for statuscake to help me 
watch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to