Hi,

It is an area that genuinely lacks some observability however, I have some 
concerns about this patch that I think we can review. 
The condition being logged is correct, intentional behavior.  The skip 
mechanism is designed exactly for this case: 
Worker B backs off, moves to the next table, and the system makes progress. 
Logging correct behavior as if it were a warning conflates a healthy scheduler 
decision with a fault condition.
On any busy OLTP system with autovacuum_max_workers > 1, workers will skip 
tables held by other workers in every vacuum cycle.  That is not a transient 
edge case and it is the steady state of a loaded database.

This means the GUC has two operating modes: on a quiet system it never fires 
(no value), and on a busy system it always fires (pure noise).
The checkpoint_warning analogy does not hold up.  checkpoint_warning fires when 
the system deviates from healthy behavior (checkpoints too frequent).  This GUC 
fires during an expected behavior.  Furthermore, checkpoint_warning is an 
integer (seconds) with built-in rate limiting via elapsed time comparison; a 
bare boolean offers none of that, so on a loaded system it would emit one log 
line per skipped table per vacuum cycle per worker.
If the goal is to detect genuine autovacuum saturation, I think an example case 
would be a. worker that completes an entire vacuum cycle having done no work at 
all because every candidate table was already held by another worker.  That 
condition is already tracked, fires once per wasted cycle rather than once per 
table, and is a strong signal that a worker slot was completely wasted.  That 
is worth a single LOG

Also when it comes to name of GUC shouldn't we follow the log_autovacuum_* 
pattern ? 

Regards,
Demir.

Reply via email to