Hi, On Tue, Jul 22, 2025 at 10:07:30AM -0400, Andres Freund wrote: > Hi, > > On 2025-07-22 12:24:46 +0000, Bertrand Drouvot wrote: > > Anyway, let's forget about eBPF, I ran another experiment by counting the > > cycles > > with: > > > > static inline uint64_t rdtsc(void) { > > uint32_t lo, hi; > > __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi)); > > return ((uint64_t)hi << 32) | lo; > > } > > > > and then calling this function before and after waitEventIncrementCounter() > > and also at wait_start() and wait_end() (without the increment counters > > patchs). > > I think you're still going to get massively increased baseline numbers that > way - the normal cost of a wait event is under 10 cycles. Doing two rdtscs > costs somewhere between 60-90 cycles. Which means that any increase due to > counters & timers will look a lot cheaper if compared to that increased > baseline, than if you compared to the actual current cost of a wait event.
Let's name the overhead O, what is done is computing: 100 * (actual p50_counter_timing + O) / (actual p50_wait_time + O) Assuming the rdtscs cost is say 80 cycles, that gives: 100 * (actual p50_counter_timing + 80) / (actual p50_wait_time + 80) which means (as p50_counter_timing is 146): 100 * (66 + 80) / (actual p50_wait_time + 80) So, if we look at say the lock class result: 0.03%, it means: actual p50_wait_time + 80 = 100 * 146 / 0.03 = 486667 cycles Then it means that the 66-cycles counter increment overhead becomes negligible. So, when: 100 * (actual p50_counter_timing + O) / (actual p50_wait_time + O) results in very small numbers, I think that we can still rely on the ratio to make conclusions. I think the current computation could provide false negatives but not false positives: when the ratio is very small, we can reliably conclude that the counters overhead is negligible. > I think that's largely because there is relatively few such wait events, The exact numbers are: # cat make_check-world_cycles_percentiles.csv | grep ",lock" | awk -F, '{print $3,$4}' advisory 332 applytransaction 44 extend 156 frozenid 2 object 2 relation 409 spectoken 1 transactionid 693 tuple 17 virtualxid 257 # cat make_check-world_cycles_percentiles.csv | grep ",timeout" | awk -F, '{print $3,$4}' BASE_BACKUP_THROTTLE 5 CHECKPOINT_WRITE_DELAY 1262 PG_SLEEP 55 RECOVERY_APPLY_DELAY 4 RECOVERY_RETRIEVE_RETRY_INTERVAL 62 SPIN_DELAY 46 VACUUM_TRUNCATE 200 # cat make_check-world_cycles_percentiles.csv | grep ",lwlock" | awk -F, '{print $3,$4}' AioWorkerSubmissionQueue 1498 BufferContent 44825 BufferMapping 808 CheckpointerComm 431 ControlFile 6 DynamicSharedMemoryControl 67 LockFastPath 1096 LockManager 1565 OidGen 1264 ParallelHashJoin 33 ParallelQueryDSA 17 PerSessionDSA 2 PgStatsDSA 136 PgStatsData 41 PgStatsHash 28 ProcArray 263 RelCacheInit 156 RelationMapping 3 ReplicationSlotAllocation 5 ReplicationSlotControl 1 SInvalRead 577 SInvalWrite 166 SerializableXactHash 2 SharedTidBitmap 12 TwoPhaseState 8 WALInsert 4935 WALWrite 5875 XactBuffer 1 XactSLRU 98 XidGen 56 unknown_event_95 1662 # cat make_check-world_cycles_percentiles.csv | grep ",client" | awk -F, '{print $3,$4}' CLIENT_READ 330753 CLIENT_WRITE 6382 LIBPQWALRECEIVER_CONNECT 2137 LIBPQWALRECEIVER_RECEIVE 3875 WAL_SENDER_WAIT_FOR_WAL 7 WAL_SENDER_WRITE_DATA 2740 unknown_event_8 48 # cat make_check-world_cycles_percentiles.csv | grep ",bufferpin" | awk -F, '{print $3,$4}' BUFFER_PIN 89 There are a few for some of them and some have more decent numbers. > because there is very very little contention in the regression tests and we > just don't do a whole lot intensive things in the tests. I suspect that at > least some of the high events here will actually be due to tests that > explicitly test the contention behaviour, and thus will have very high wait > times. Yeah, probably, I would need to check more in details. > E.g. if you measure client timings, the overhead here will be fairly low, > because we're not going to be CPU bound by the back/forth between client and > server, and thus many of the waits will be longer. If you instead measure a > single client readonly pgbench, it'll look different. Similar, if you have > lwlock contention in a real world workload, most of the waits will be > incredibly short, but in our tests that will not necessarily be the case. Agreed. > For counters + timings I can see that it'd be useful. But i don't believe it's > close to as cheap as you say it is. Thanks for the confirmation. > You don't need precise duration to see what wait events are a problem. If you > see that some event is samples a lot you know it's because there either are a > *lot* of those wait events or the wait events are entered into for a long > time. That's right, but tracking the duration would help to determine which case it is: a *lot* of those wait events or the wait events are entered into for a long time. That said, I also understand your concerns (and the ones shared by Robert in this thread [1]), even if I do not necessarily agree with all of them, and I also understand that you both think that there is a room for instrumentation improvements. I'll try to think of other approaches (like the ones mentioned by Robert), as I think that fully makes sense and they are probably even better (means they provide more context). Doing so, I'll hold on the wait events stats approach (as both of you are skeptical about it). Regards, [1]: https://www.postgresql.org/message-id/CA%2BTgmobptuUWo7X5zcQrWKh22qeAn4eL%2B%3Dwtb8_ajCOR%2B7_tcw%40mail.gmail.com -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com