Re: Some performance degradation in REL_16 vs REL_15

Andres Freund Wed, 15 Nov 2023 12:21:58 -0800

Hi,

On 2023-11-15 10:09:06 -0500, Tom Lane wrote:
> "Anton A. Melnikov" <a.melni...@postgrespro.ru> writes:
> > I can't understand why i get the opposite results on my pc and on the 
> > server. It is clear that the absolute
> > TPS values will be different for various configurations. This is normal. 
> > But differences?
> > Is it unlikely that some kind of reference configuration is needed to 
> > accurately
> > measure the difference in performance. Probably something wrong with my pc, 
> > but now
> > i can not figure out what's wrong.
>
> > Would be very grateful for any advice or comments to clarify this problem.
>
> Benchmarking is hard :-(.


Indeed.


> IME it's absolutely typical to see variations of a couple of percent even
> when "nothing has changed", for example after modifying some code that's
> nowhere near any hot code path for the test case.  I usually attribute this
> to cache effects, such as a couple of bits of hot code now sharing or not
> sharing a cache line.

FWIW, I think we're overusing that explanation in our community. Of course you
can encounter things like this, but the replacement policies of cpu caches
have gotten a lot better and the caches have gotten bigger too.

IME this kind of thing is typically dwarfed by much bigger variations from
things like

- cpu scheduling - whether the relevant pgbench thread is colocated on the
  same core as the relevant backend can make a huge difference,
  particularly when CPU power saving modes are not disabled.  Just looking at
  tps from a fully cached readonly pgbench, with a single client:

  Power savings enabled, same core:
  37493

  Power savings enabled, different core:
  28539

  Power savings disabled, same core:
  38167

  Power savings disabled, different core:
  37365


- can transparent huge pages be used for the executable mapping, or not

  On newer kernels linux (and some filesystems) can use huge pages for the
  executable. To what degree that succeeds is a large factor in performance.

  Single threaded read-only pgbench

  postgres mapped without huge pages:
  37155 TPS

  with 2MB of postgres as huge pages:
  37695 TPS

  with 6MB of postgres as huge pages:
  42733 TPS

  The really annoying thing about this is that entirely unpredictable whether
  huge pages are used or not. Building the same way, sometimes 0, sometimes 2MB,
  sometimes 6MB are mapped huge. Even though the on-disk contents are
  precisely the same.  And it can even change without rebuilding, if the
  binary is evicted from the page cache.

  This alone makes benchmarking extremely annoying. It basically can't be
  controlled and has huge effects.


- How long has the server been started

  If e.g. once you run your benchmark on the first connection to a database,
  and after a restart not (e.g. autovacuum starts up beforehand), you can get
  a fairly different memory layout and cache situation, due to [not] using the
  relcache init file. If not, you'll have a catcache that's populated,
  otherwise not.

  Another mean one is whether you start your benchmark within a relatively
  short time of the server starting. Readonly pgbench with a single client,
  started immediately after the server:

  progress: 12.0 s, 37784.4 tps, lat 0.026 ms stddev 0.001, 0 failed
  progress: 13.0 s, 37779.6 tps, lat 0.026 ms stddev 0.001, 0 failed
  progress: 14.0 s, 37668.2 tps, lat 0.026 ms stddev 0.001, 0 failed
  progress: 15.0 s, 32133.0 tps, lat 0.031 ms stddev 0.113, 0 failed
  progress: 16.0 s, 37564.9 tps, lat 0.027 ms stddev 0.012, 0 failed
  progress: 17.0 s, 37731.7 tps, lat 0.026 ms stddev 0.001, 0 failed

  There's a dip at 15s, odd - turns out that's due to bgwriter writing a WAL
  record, which triggers walwriter to write it out and then initialize the
  whole WAL buffers with 0s - happens once.  In this case I've exagerated the
  effect a bit by using a 1GB wal_buffers, but it's visible otherwise too.
  Whether your benchmark period includes that dip or not adds a fair bit of
  noise.

  You can even see the effects of autovacuum workers launching - even if
  there's nothing to do!  Not as a huge dip, but enough to add some "run to
  run" variation.


- How much other dirty data is there in the kernel pagecache. If you e.g. just
  built a new binary, even with just minor changes, the kernel will need to
  flush those pages eventually, which may contend for IO and increases page
  faults.

  Rebuilding an optimized build generates something like 1GB of dirty
  data. Particularly with ccache, that'll typically not yet be flushed by the
  time you run a benchmark. That's not nothing, even with a decent NVMe SSD.

- many more, unfortunately

Greetings,

Andres Freund

Re: Some performance degradation in REL_16 vs REL_15

Reply via email to