Re: Improve monitoring of shared memory allocations

Rahila Syed Thu, 06 Mar 2025 07:58:46 -0800

Hi,

Thank you for the review.



cfbot found a few compiler warnings:
>
> https://cirrus-ci.com/task/6526903542087680
> [16:47:46.964] make -s -j${BUILD_JOBS} clean
> [16:47:47.452] time make -s -j${BUILD_JOBS} world-bin
> [16:49:10.496] lwlock.c: In function ‘CreateLWLocks’:
> [16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’
> [-Werror=unused-variable]
> [16:49:10.496]   467 |                 bool found;
> [16:49:10.496]       |                      ^~~~~
> [16:49:10.496] cc1: all warnings being treated as errors
> [16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1
> [16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37:
> lmgr-recursive] Error 2
> [16:49:10.496] make[3]: *** Waiting for unfinished jobs....
> [16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2
> [16:49:11.881] make[2]: *** Waiting for unfinished jobs....
> [16:49:20.195] dynahash.c: In function ‘hash_create’:
> [16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used
> uninitialized [-Werror=maybe-uninitialized]
> [16:49:20.195]   643 |                         curr_offset = (((char
> *)curr_offset) + (temp * elementSize));
> [16:49:20.195]       |
>  ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> [16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here
> [16:49:20.195]   588 |                 void *curr_offset;
> [16:49:20.195]       |                       ^~~~~~~~~~~
> [16:49:20.195] cc1: all warnings being treated as errors
> [16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1
>
>
 Fixed these.


>
> > diff --git a/src/backend/utils/hash/dynahash.c
> b/src/backend/utils/hash/dynahash.c
> > index cd5a00132f..5203f5b30b 100644
> > --- a/src/backend/utils/hash/dynahash.c
> > +++ b/src/backend/utils/hash/dynahash.c
> > @@ -120,7 +120,6 @@
> >   * a good idea of the maximum number of entries!).  For non-shared hash
> >   * tables, the initial directory size can be left at the default.
> >   */
> > -#define DEF_SEGSIZE                     256
> >  #define DEF_SEGSIZE_SHIFT       8    /* must be log2(DEF_SEGSIZE) */
> >  #define DEF_DIRSIZE                     256
>
> Why did you move this to the header? Afaict it's just used in
> hash_get_shared_size(), which is also in dynahash.c?
>
>
Yes. This was accidentally left behind by the previous version of the
patch, so I undid the change.


> >  static void register_seq_scan(HTAB *hashp);
> >  static void deregister_seq_scan(HTAB *hashp);
> >  static bool has_seq_scans(HTAB *hashp);
> > -
> > +static int find_num_of_segs(long nelem, int *nbuckets, long
> num_partitions, long ssize);
> >
> >  /*
> >   * memory allocation support
>
> You removed a newline here that probably shouldn't be removed.


Fixed this.



>
> > @@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const
> HASHCTL *info, int flags)
> >       else
> >               hashp->keycopy = memcpy;
> >
> > -     /* And select the entry allocation function, too. */
> > +     /*
> > +      * And select the entry allocation function, too. XXX should this
> also
> > +      * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
> > +      * currently only set with HASH_SHARED_MEM *
> > +      */
> >       if (flags & HASH_ALLOC)
> >               hashp->alloc = info->alloc;
> >       else
> > @@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const
> HASHCTL *info, int flags)
> >
> >       hashp->frozen = false;
> >
> > +     /* Initializing the HASHHDR variables with default values */
> >       hdefault(hashp);
> >
> >       hctl = hashp->hctl;
>
> I assume these were just observations you made while looking into this?
> They
> seem unrelated to the change itself?


Yes. I removed the first one and left the second one as a code comment.


>
> > @@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const
> HASHCTL *info, int flags)
> >                                       freelist_partitions,
> >                                       nelem_alloc,
> >                                       nelem_alloc_first;
> > -
> > +             void *curr_offset;
> > +
> >               /*
> >                * If hash table is partitioned, give each freelist an
> equal share of
> >                * the initial allocation.  Otherwise only freeList[0] is
> used.
> > @@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const
> HASHCTL *info, int flags)
> >               else
> >                       freelist_partitions = 1;
> >
> > +             /*
> > +              * If table is shared, calculate the offset at which to
> find the
> > +              * the first partition of elements
> > +              */
> > +             if (hashp->isshared)
> > +             {
> > +                     int                     nsegs;
> > +                     int                     nbuckets;
> > +                     nsegs = find_num_of_segs(nelem, &nbuckets,
> hctl->num_partitions, hctl->ssize);
> > +
> > +                     curr_offset =  (((char *) hashp->hctl) +
> sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
> > +                        + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
> > +             }
> > +
>
> Why only do this for shared hashtables? Couldn't we allocate the elments
> together with the rest for non-share hashtables too?
>

I think it is possible to consolidate the allocations for non-shared hash
tables
too. However, initial elements are much smaller in non-shared hash tables
due to
their ease of expansion. Therefore, there is probably less benefit in
trying to do
that for non-shared tables.
In addition, the proposed changes are targeted to improve the monitoring in
pg_shmem_allocations which won't be applicable to non-shared hashtables.
While I believe it is feasible, I am uncertain about the utility of such a
change.


> Seems a bit ugly to go through element_alloc() when pre-allocating.
> Perhaps
> it's the best thing we can do to avoid duplicating code, but it seems worth
> checking if we can do better. Perhaps we could split element_alloc() into
> element_alloc() and element_add() or such?  With the latter doing
> everything
> after hashp->alloc().
>
>
Makes sense. I split the element_alloc() into element_alloc() and
element_add().


>
> > -
> > +
> >       /*
> >        * initialize mutexes if it's a partitioned table
> >        */
>
> Spurious change.
>
>
Fixed.

A function called find_num_of_segs() that also sets nbuckets seems a bit
> confusing.  I also don't like "find_*", as that sounds like it's searching
> some datastructure, rather than just doing a bit of math.
>

 I renamed it to compute_buckets_and_segs(). I am open to better
suggestions.

>       segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) *
> hashp->ssize);
> >
> >       if (!segp)
>
> Spurious change.
>

Fixed.


> -static int
> > +int
> >  next_pow2_int(long num)
> >  {
> >       if (num > INT_MAX / 2)
> > @@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int
> nestDepth)
> >               }
> >       }
> >  }
>
> Why export this?
>
> It was a stale change, I removed it now



> > diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
> > index 932cc4f34d..5e16bd4183 100644
> > --- a/src/include/utils/hsearch.h
> > +++ b/src/include/utils/hsearch.h
> > @@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
> >  extern void hash_freeze(HTAB *hashp);
> >  extern Size hash_estimate_size(long num_entries, Size entrysize);
> >  extern long hash_select_dirsize(long num_entries);
> > -extern Size hash_get_shared_size(HASHCTL *info, int flags);
> > +extern Size hash_get_shared_size(HASHCTL *info, int flags, long
> init_size);
> >  extern void AtEOXact_HashTables(bool isCommit);
> >  extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
>
> It's imo a bit weird that we have very related logic in
> hash_estimate_size()
> and hash_get_shared_size(). Why do we need to duplicate it?
>
>
hash_estimate_size() estimates using default values and
hash_get_shared_size()
calculates using specific values depending on the flags associated with the
hash
table.  For instance, segment_size used by the former is DEF_SEGSIZE  and
the latter uses info->ssize which is set when the HASH_SEGMENT flag is true.
Hence, they might return different values for shared memory sizes.


>
> These I'd just combine with the ShmemInitStruct("PredXactList"), by
> allocating
> the additional space. The pointer math is a bit annoying, but it makes much
> more sense to have one entry in pg_shmem_allocations.
>
> Fixed accordingly.


>
> > -             (TransactionId *) ShmemAlloc(TotalProcs *
> sizeof(*ProcGlobal->xids));
> > +             (TransactionId *) ShmemInitStruct("Proc Transaction Ids",
> TotalProcs * sizeof(*ProcGlobal->xids), &found);
> >       MemSet(ProcGlobal->xids, 0, TotalProcs *
> sizeof(*ProcGlobal->xids));
> > -     ProcGlobal->subxidStates = (XidCacheStatus *)
> ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
> > +     ProcGlobal->subxidStates = (XidCacheStatus *)
> ShmemInitStruct("Proc Sub-transaction id states", TotalProcs *
> sizeof(*ProcGlobal->subxidStates), &found);
> >       MemSet(ProcGlobal->subxidStates, 0, TotalProcs *
> sizeof(*ProcGlobal->subxidStates));
> > -     ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs *
> sizeof(*ProcGlobal->statusFlags));
> > +     ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status
> Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);
> >       MemSet(ProcGlobal->statusFlags, 0, TotalProcs *
> sizeof(*ProcGlobal->statusFlags));
> >
> >       /*
>
> Same.
>
> Although here I'd say it's worth padding the size of each separate
> "allocation" by PG_CACHE_LINE_SIZE.
>

Made this change.


> > -     fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
> > +     fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs *
> (fpLockBitsSize + fpRelIdSize), &found);
> >       MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
> >
> >       /* For asserts checking we did not overflow. */
>
> This one might actually make sense to keep separate, depending on the
> configuration it can be reasonably big (max_connection = 1k,
> max_locks_per_transaction=1k results in ~5MB)..
>
> OK

PFA the rebased patches with the above changes.

Kindly let me know your views.

Thank you,
Rahila Syed

v2-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch
Description: Binary data

v2-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch
Description: Binary data

Re: Improve monitoring of shared memory allocations

Reply via email to