Re: Vacuum statistics

Jim Nasby Mon, 28 Oct 2024 14:04:16 -0700


> On Oct 28, 2024, at 2:07 PM, Alexander Korotkov <aekorot...@gmail.com> wrote:
> 
>> I suppose that if we turn off statistics collection for a certain object, we 
>> can miss it. In addition, the user may not enable the parameter for the 
>> object in time, because he will forget about it.
> 
> I agree with this point.  Additionally, in order to benefit from
> gatherting vacuum statistics only for some relations in terms of
> space, we need to handle variable-size stat entries.  That would
> greatly increase the complexity.


Could vacuum stats be treated as a separate category instead of adding it to 
PgStat_TableCounts?

>> As for the second option, now I cannot say which statistics can be removed, 
>> to be honest. So far, they all seem necessary.
> 
> Yes, but as Masahiko-san pointed out, PgStat_TableCounts is almost
> tripled in space.  That a huge change from having no statistics on
> vacuum to have it in much more detail than everything else we
> currently have.  I think the feasible way might be to introduce some
> most demanded statistics first then see how it goes.

Looking at the stats I do think the WAL stats are probably not helpful. First, 
there’s nothing users can do to tune how much WAL is generated by vacuum. 
Second, this introduces the risk of users saying “Wow, vacuum is creating a lot 
of WAL! I’m going to turn it down!”, which is most likely to make matters 
worse. There’s already a lot of stuff that goes into WAL without any detailed 
logging; if we ever wanted to provide a comprehensive view of what data is in 
WAL that should be handled separately.

The rest of the stats all look important. In fact, I think there’s even more 
stats that could be included (such as all frozen/visible pages skipped) - even 
more reason to look at having separate controls for tracking vacuum stats. 
There’s also an argument to be made for tracking autovac separately from manual 
vacuum. So long-term we might want to look at other ways to handle these stats, 
not only because of the large number of stats, but because they would be 
updated very infrequently compared to other stats counters. Ironically, the old 
stats system would probably have been more than sufficient for these stats. 
Tracking them in a real table might also be an option.

Is there a reason some fields are omitted from pg_stat_vacuum_database? While 
some stats are certainly more interesting at the per-relation level, I can’t 
really think of any that don’t make sense at the database level as well.

Looking at the per table/index stats, I strongly dislike the use of the term 
“delete” - it is a recipe for confusion with row deletion.. A much better term 
is “remove” or “removed”. I realize the term “delete” is used in places in 
vacuum logging, but IMO we should fix that as well instead of doubling-down on 
it.

I think “interrupts” is also a very confusing name - those fields should just 
be called “errors”.

I realize “relname” is being used for consistency with 
pg_stat_all_(tables|indexes), but I’m not sure it makes sense to double-down on 
that. Especially in pg_stat_vacuum_indexes, where it’s not completely clear 
whether relname is referring to the table or the index. I’m also inclined to 
say that the name of the table should be included in pg_stat_vacuum_indexes.

For all the views the docs should clarify that total_blks_written means blocks 
written by vacuum, as opposed to the background writer. Similarly they should 
clarify the difference between rel_blks_(read|hit) and total_blks_(read|hit). 
In the case of pg_stat_vacuum_indexes it’d be better if rel_blks_(read|hit) 
were called index_blks_(read|hit). Although… if total_blks_* is actually the 
count across the table and all the indexes I don’t know that we even need that 
counter. I realize that not ever vacuum even looks at the indexes, but if we’re 
going to go into that level of detail then we would (at minimum) need to count 
the number of times a vacuum completely skipped scanning the indexes.

Having rev_all_(frozen|visible)_pages in the same view as vacuum stats will 
confuse users into thinking that vacuum is clearing the bits. Those fields 
really belong in pg_stat_all_tables.

Sadly index_vacuum_count is may not useful at all at present. At minimum you’d 
need to know the number of times vacuum had run in total. I realize that’s in 
pg_stat_all_tables, but that doesn’t help if vacuum stats are tracked or reset 
separately. At minimum the docs should mention them. They also need to clarify 
if index_vacuum_count is incremented per-index or per-pass (hopefully the 
later). Assuming it’s per-pass, a better name for the field would be 
index_vacuum_passes, index_passes, index_pass_count, or similar. But even with 
that we still need a counter for the number of vacuums where index processing 
was skipped.

Other items
First, thanks to everyone that’s put work into this patch - it’s a big step 
forward. I certainly don’t want the perfect to be the enemy of the good, but 
since the size of these stats entries has already come up as a concern I want 
to consider use cases that would still not be covered by this patch. I’m not 
suggesting these need to be added now, but IMHO they’re logical next steps 
(that would also mean more counters). The cases below would probably mean at 
least doubling the number of vacuum-related counters, at least at the table 
level.

First, there’s still gaps in trying to track HOT; most notably a counter for 
how many updates would never be HOT eligible because they modify indexes. 
pg_stat_all_tables.n_tup_newpage_upd is really limited without that info.

There should also be stats about unused line pointers - in degenerate cases the 
lp array can consume a significant portion of heap storage.

Monitoring bloat would be a lot more accurate if vacuum reported total tuple 
length for each run along with the total number of tuples it looked at. Having 
that info would make it trivial to calculate average tuple size, which could 
then be applied to reltuples and relpages to calculate how much space would 
being lost to bloat.

Autovacuum will self-terminate if it would block another process (unless it’s 
an aggressive vacuum) - that’s definitely something that should be tracked. Not 
just the number of times that happens, but also stats about how much work was 
lost because of this.

Shrinking a relation (what vacuum calls truncation, which is very confusing 
with the truncate command) is a rather complex process that currently has no 
visibility.

Tuning vacuum_freeze_min_age (and the MXID variant) is rather complicated. We 
maybe have enough stats on whether it could be set lower, but there’s no 
visibility on how the settings affect how often vacuum decides to be 
aggressive. At minimum, we should have stats on when vacuum is aggressive, 
especially since it significantly changes the behavior of autovac.

I saw someone else already mentioned tuning vacuum memory usage, but I’ll 
mention it again. Even if the issues with index_vacuum_count are fixed that 
still only tells you if you have a problem; it doesn’t give you a great idea of 
how much more memory you need. The best you can do is assuming you need (number 
of passes - 1) * current memory.

Speaking of which… there should be stats on any time vacuum decided on it’s own 
to skip index processing due to wraparound proximity.

I’m sure there’s some other use cases that I’m not thinking of.

Re: Vacuum statistics

Reply via email to