Hi, On 2023-11-15 10:09:06 -0500, Tom Lane wrote: > "Anton A. Melnikov" <a.melni...@postgrespro.ru> writes: > > I can't understand why i get the opposite results on my pc and on the > > server. It is clear that the absolute > > TPS values will be different for various configurations. This is normal. > > But differences? > > Is it unlikely that some kind of reference configuration is needed to > > accurately > > measure the difference in performance. Probably something wrong with my pc, > > but now > > i can not figure out what's wrong. > > > Would be very grateful for any advice or comments to clarify this problem. > > Benchmarking is hard :-(.
Indeed. > IME it's absolutely typical to see variations of a couple of percent even > when "nothing has changed", for example after modifying some code that's > nowhere near any hot code path for the test case. I usually attribute this > to cache effects, such as a couple of bits of hot code now sharing or not > sharing a cache line. FWIW, I think we're overusing that explanation in our community. Of course you can encounter things like this, but the replacement policies of cpu caches have gotten a lot better and the caches have gotten bigger too. IME this kind of thing is typically dwarfed by much bigger variations from things like - cpu scheduling - whether the relevant pgbench thread is colocated on the same core as the relevant backend can make a huge difference, particularly when CPU power saving modes are not disabled. Just looking at tps from a fully cached readonly pgbench, with a single client: Power savings enabled, same core: 37493 Power savings enabled, different core: 28539 Power savings disabled, same core: 38167 Power savings disabled, different core: 37365 - can transparent huge pages be used for the executable mapping, or not On newer kernels linux (and some filesystems) can use huge pages for the executable. To what degree that succeeds is a large factor in performance. Single threaded read-only pgbench postgres mapped without huge pages: 37155 TPS with 2MB of postgres as huge pages: 37695 TPS with 6MB of postgres as huge pages: 42733 TPS The really annoying thing about this is that entirely unpredictable whether huge pages are used or not. Building the same way, sometimes 0, sometimes 2MB, sometimes 6MB are mapped huge. Even though the on-disk contents are precisely the same. And it can even change without rebuilding, if the binary is evicted from the page cache. This alone makes benchmarking extremely annoying. It basically can't be controlled and has huge effects. - How long has the server been started If e.g. once you run your benchmark on the first connection to a database, and after a restart not (e.g. autovacuum starts up beforehand), you can get a fairly different memory layout and cache situation, due to [not] using the relcache init file. If not, you'll have a catcache that's populated, otherwise not. Another mean one is whether you start your benchmark within a relatively short time of the server starting. Readonly pgbench with a single client, started immediately after the server: progress: 12.0 s, 37784.4 tps, lat 0.026 ms stddev 0.001, 0 failed progress: 13.0 s, 37779.6 tps, lat 0.026 ms stddev 0.001, 0 failed progress: 14.0 s, 37668.2 tps, lat 0.026 ms stddev 0.001, 0 failed progress: 15.0 s, 32133.0 tps, lat 0.031 ms stddev 0.113, 0 failed progress: 16.0 s, 37564.9 tps, lat 0.027 ms stddev 0.012, 0 failed progress: 17.0 s, 37731.7 tps, lat 0.026 ms stddev 0.001, 0 failed There's a dip at 15s, odd - turns out that's due to bgwriter writing a WAL record, which triggers walwriter to write it out and then initialize the whole WAL buffers with 0s - happens once. In this case I've exagerated the effect a bit by using a 1GB wal_buffers, but it's visible otherwise too. Whether your benchmark period includes that dip or not adds a fair bit of noise. You can even see the effects of autovacuum workers launching - even if there's nothing to do! Not as a huge dip, but enough to add some "run to run" variation. - How much other dirty data is there in the kernel pagecache. If you e.g. just built a new binary, even with just minor changes, the kernel will need to flush those pages eventually, which may contend for IO and increases page faults. Rebuilding an optimized build generates something like 1GB of dirty data. Particularly with ccache, that'll typically not yet be flushed by the time you run a benchmark. That's not nothing, even with a decent NVMe SSD. - many more, unfortunately Greetings, Andres Freund