Hi,

On Tue, Jul 22, 2025 at 10:07:30AM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-07-22 12:24:46 +0000, Bertrand Drouvot wrote:
> > Anyway, let's forget about eBPF, I ran another experiment by counting the 
> > cycles
> > with:
> >
> > static inline uint64_t rdtsc(void) {
> >     uint32_t lo, hi;
> >     __asm__ __volatile__("rdtsc" : "=a" (lo), "=d" (hi));
> >     return ((uint64_t)hi << 32) | lo;
> > }
> >
> > and then calling this function before and after waitEventIncrementCounter()
> > and also at wait_start() and wait_end() (without the increment counters 
> > patchs).
> 
> I think you're still going to get massively increased baseline numbers that
> way - the normal cost of a wait event is under 10 cycles. Doing two rdtscs
> costs somewhere between 60-90 cycles. Which means that any increase due to
> counters & timers will look a lot cheaper if compared to that increased
> baseline, than if you compared to the actual current cost of a wait event.

Let's name the overhead O, what is done is computing:

100 * (actual p50_counter_timing + O) / (actual p50_wait_time + O)

Assuming the rdtscs cost is say 80 cycles, that gives:

100 * (actual p50_counter_timing + 80) / (actual p50_wait_time + 80)

which means (as p50_counter_timing is 146):

100 * (66 + 80) / (actual p50_wait_time + 80)

So, if we look at say the lock class result: 0.03%, it means:

actual p50_wait_time + 80 = 100 * 146 / 0.03 = 486667 cycles

Then it means that the 66-cycles counter increment overhead becomes negligible.

So, when:

100 * (actual p50_counter_timing + O) / (actual p50_wait_time + O)

results in very small numbers, I think that we can still rely on the ratio to
make conclusions. I think the current computation could provide false negatives
but not false positives: when the ratio is very small, we can reliably conclude
that the counters overhead is negligible.

> I think that's largely because there is relatively few such wait events,

The exact numbers are:

# cat make_check-world_cycles_percentiles.csv | grep ",lock" | awk -F, '{print 
$3,$4}'
advisory 332
applytransaction 44
extend 156
frozenid 2
object 2
relation 409
spectoken 1
transactionid 693
tuple 17
virtualxid 257

# cat make_check-world_cycles_percentiles.csv | grep ",timeout" | awk -F, 
'{print $3,$4}'
BASE_BACKUP_THROTTLE 5
CHECKPOINT_WRITE_DELAY 1262
PG_SLEEP 55
RECOVERY_APPLY_DELAY 4
RECOVERY_RETRIEVE_RETRY_INTERVAL 62
SPIN_DELAY 46
VACUUM_TRUNCATE 200

# cat make_check-world_cycles_percentiles.csv | grep ",lwlock" | awk -F, 
'{print $3,$4}'
AioWorkerSubmissionQueue 1498
BufferContent 44825
BufferMapping 808
CheckpointerComm 431
ControlFile 6
DynamicSharedMemoryControl 67
LockFastPath 1096
LockManager 1565
OidGen 1264
ParallelHashJoin 33
ParallelQueryDSA 17
PerSessionDSA 2
PgStatsDSA 136
PgStatsData 41
PgStatsHash 28
ProcArray 263
RelCacheInit 156
RelationMapping 3
ReplicationSlotAllocation 5
ReplicationSlotControl 1
SInvalRead 577
SInvalWrite 166
SerializableXactHash 2
SharedTidBitmap 12
TwoPhaseState 8
WALInsert 4935
WALWrite 5875
XactBuffer 1
XactSLRU 98
XidGen 56
unknown_event_95 1662

# cat make_check-world_cycles_percentiles.csv | grep ",client" | awk -F, 
'{print $3,$4}'
CLIENT_READ 330753
CLIENT_WRITE 6382
LIBPQWALRECEIVER_CONNECT 2137
LIBPQWALRECEIVER_RECEIVE 3875
WAL_SENDER_WAIT_FOR_WAL 7
WAL_SENDER_WRITE_DATA 2740
unknown_event_8 48

# cat make_check-world_cycles_percentiles.csv | grep ",bufferpin" | awk -F, 
'{print $3,$4}'
BUFFER_PIN 89

There are a few for some of them and some have more decent numbers.

> because there is very very little contention in the regression tests and we
> just don't do a whole lot intensive things in the tests. I suspect that at
> least some of the high events here will actually be due to tests that
> explicitly test the contention behaviour, and thus will have very high wait
> times.

Yeah, probably, I would need to check more in details.

> E.g. if you measure client timings, the overhead here will be fairly low,
> because we're not going to be CPU bound by the back/forth between client and
> server, and thus many of the waits will be longer. If you instead measure a
> single client readonly pgbench, it'll look different.  Similar, if you have
> lwlock contention in a real world workload, most of the waits will be
> incredibly short, but in our tests that will not necessarily be the case.

Agreed.

> For counters + timings I can see that it'd be useful. But i don't believe it's
> close to as cheap as you say it is.

Thanks for the confirmation.

> You don't need precise duration to see what wait events are a problem. If you
> see that some event is samples a lot you know it's because there either are a
> *lot* of those wait events or the wait events are entered into for a long
> time.

That's right, but tracking the duration would help to determine which case it 
is:

a *lot* of those wait events or the wait events are entered into for a long 
time.

That said, I also understand your concerns (and the ones shared by Robert in
this thread [1]), even if I do not necessarily agree with all of them, and I 
also
understand that you both think that there is a room for instrumentation 
improvements.

I'll try to think of other approaches (like the ones mentioned by Robert), as
I think that fully makes sense and they are probably even better (means they
provide more context).

Doing so, I'll hold on the wait events stats approach (as both of you are
skeptical about it).

Regards,

[1]: 
https://www.postgresql.org/message-id/CA%2BTgmobptuUWo7X5zcQrWKh22qeAn4eL%2B%3Dwtb8_ajCOR%2B7_tcw%40mail.gmail.com

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Reply via email to