Re: shared-memory based stats collector

Andres Freund Thu, 01 Apr 2021 19:44:44 -0700

Hi,

I spent quite a bit more time working on the patch. There's are large
changes:


- postmaster/pgstats.c (which is an incorrect location now that it's not
  a subprocess anymore) is split into utils/activity/pgstat.c and
  utils/activity/pgstat_kind.c. I don't love the _kind name, but I
  couldn't come up with anything better.

- Implemented a new GUC, stats_fetch_consistency = {none, cache,
  snapshot}. I think the code overhead of it is pretty ok - most of the
  handling is entirely generalized.

- Most of the "per stats kind" handling is done in pgstat_kind.c. Nearly
  all the rest is done through an array with per-stats-kind information
  (extending what was already done with pgstat_sharedentsize etc).

- There is no separate "pending stats" hash anymore. If there are
  pending stats, they are referenced from 'pgStatSharedRefHash' (which
  used to be the "lookup cache" hash). All the entries with pending
  stats are in double linked list pgStatPending.

- A stat's entry's lwlock, refcount, .. are moved into the dshash
  entry. There is no need for them to be separate anymore. Also allows
  to avoid doing some dsa lookups while holding dshash locks.

- The dshash entries are not deleted until the refcount has reached
  0. That's an important building block to avoid constantly re-creating
  stats when flushing pending stats for a dropped object.

- The reference to the shared entry is established the first time stats
  for an object are reported. Together with the previous entry that
  avoids nearly all the avenues for re-creating already dropped stats
  (see below for the hole).

- I addded a bunch of pg_regress style tests, and a larger amount of
  isolationtester tests. The latter are possibly due to a new
  pg_stat_force_next_flush() function, avoiding the need to wait until
  stats are submitted.

- 2PC support for "precise" dropping of stats has been added, the
  collect_oids() based approach removed.

- lots of bugfixes, comments, etc...


I know of one nontrivial issue that can lead to dropped stats being
revived:

Within a transaction, a functions can be called even when another
transaction that dropped that function has already committed. I added a
spec test reproducing the issue:

# FIXME: this shows the bug that stats will be revived, because the
# shared stats in s2 is only referenced *after* the DROP FUNCTION
# committed. That's only possible because there is no locking (and
# thus no stats invalidation) around function calls.
permutation
  "s1_track_funcs_all" "s2_track_funcs_none"
  "s1_func_call" "s2_begin" "s2_func_call" "s1_func_drop" "s2_track_funcs_all" 
"s2_func_call" "s2_commit" "s2_ff" "s1_func_stats" "s2_func_stats"

I think the best fix here would be to associate an xid with the dropped
stats object, and only delete the dshash entry once there's no alive
with a horizon from before that xid...


There's also a second issue (stats for newly created objects surviving
the transaction), but that's pretty simple to resolve.

Here's all the gory details of my changes happening incrementally:

https://github.com/anarazel/postgres/compare/master...shmstat

I'll squash and split tomorrow. Too tired for today.


I think this is starting to look a lot better than what we have now. But
I'm getting less confident that it's realistic to get any of this into
PG14, given the state of the release cycle.



> I'm impressed that the way you resolved "who should load stats". Using
> static shared memory area to hold the point to existing DSA memory
> resolves the "first attacher problem".  However somewhat doubtful
> about the "who should write the stats file", I think it is reasonable
> in general.
>
> But the current place of calling pgstat_write_stats() is a bit to
> early.  Checkpointer reports some stats *after* calling
> ShutdownXLOG().  Perhaps we need to move it after pg_stat_report_*()
> calls in HandleCheckpointerInterrupts().

I now moved it into a before_shmem_exit(). I think that should avoid
that problem?



> https://github.com/anarazel/postgres/commit/03824a236597c87c99d07aa14b9af9d6fe04dd37
>
> +              * XXX: Why is this a good place to do this?
>
> Agreed. We don't need to so haste to clean up stats entries.  We could
> run that in pgstat_reporT_stat()?

I've not changed that yet, but I had the same thought.


> Agreed that it's better to move database stat entries to fixed
> pointers.

I actually ended up reverting that. My main motivation for it was that
it was problematic that new pending database stats entries could be
created at some random place in the hashtable. But with the linked list
of pending entries that's not a problem anymore. And I found it
nontrivial to manage the refcounts to the shared entry accurately this
way.

We could still add a cache for the two stats entries though...


> > - consider removing PgStatTypes and replacing it with the oid of the
> >   table the type of stats reside in. So PGSTAT_TYPE_DB would be
> >   DatabaseRelationId, PGSTAT_TYPE_TABLE would be RelationRelationId, ...
> >
> >   I think that'd make the system more cleanly extensible going forward?
>
> I'm not sure that works as expected.  We already separated repliation
> stats from the unified stats hash and pgstat_read/write_statsfile()
> needs have the corresponding specific code path.

I didn't quite go towards my proposal, but I think I got a lot closer
towards not needing much extra code for additional types of stats. I
even added an XXX to pgstat_read/write_statsfile() that show how they
now could be made generic.


> > - the replication slot stuff isn't quite right in my branch
>
> Ah, yeah. As I mentioned above I think it should be in the unified
> stats and should have a special means of shotcut.  And the global
> stats also should be the same.

The problem is that I use indexes for addressing, but that they can
change between restarts. I think we can fix that fairly easily, by
mapping names to indices once, pgstat_restore_stats().  At the point we
call pgstat_restore_stats() StartupReplicationSlots() already was
executed, so we can just inquire at that point...

Greetings,

Andres Freund

Re: shared-memory based stats collector

Reply via email to