Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada Wed, 26 Oct 2022 19:12:10 -0700

On Wed, Oct 26, 2022 at 8:06 PM John Naylor
<john.nay...@enterprisedb.com> wrote:
>
> On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>
> > I've attached updated PoC patches for discussion and cfbot. From the
> > previous version, I mainly changed the following things:
> >


Thank you for the comments!

> > * Separate treatment of inner and leaf nodes
>
> Overall, this looks much better!
>
> > * Pack both the node kind and node count to an uint16 value.
>
> For this, I did mention a bitfield earlier as something we "could" do, but it 
> wasn't clear we should. After looking again at the node types, I must not 
> have thought through this at all. Storing one byte instead of four for the 
> full enum is a good step, but saving one more byte usually doesn't buy 
> anything because of padding, with a few exceptions like this example:
>
> node4:   4 +  4           +  4*8 =   40
> node4:   5 +  4+(7)       +  4*8 =   48 bytes
>
> Even there, I'd rather not spend the extra cycles to access the members. And 
> with my idea of decoupling size classes from kind, the variable-sized kinds 
> will require another byte to store "capacity". Then, even if the kind gets 
> encoded in a pointer tag, we'll still have 5 bytes in the base type. So I 
> think we should assume 5 bytes from the start. (Might be 6 temporarily if I 
> work on size decoupling first).

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes. Encoding the kind in a pointer tag could be tricky given DSA
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

>
> (Side note, if you have occasion to use bitfields again in the future, C99 
> has syntactic support for them, so no need to write your own shifting/masking 
> code).

Thanks!

>
> > I've not done SIMD part seriously yet. But overall the performance
> > seems good so far. If we agree with the current approach, I think we
> > can proceed with the verification of decoupling node sizes from node
> > kind. And I'll investigate DSA support.
>
> Sounds good. I have some additional comments about v7, and after these are 
> addressed, we can proceed independently with the above two items. Seeing the 
> DSA work will also inform me how invasive pointer tagging will be. There will 
> still be some performance tuning and cosmetic work, but it's getting closer.
>

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

node = dsa_get_address(tree->dsa, node_dp);

As shown above, the dsa area used by the shared radix tree is stored
in radix_tree struct, so we can know whether the radix tree is shared
or not by checking (tree->dsa == NULL). That is, if it's shared we use
a pointer to radix tree node as dsa_pointer, and if not we use a
pointer as a backend-local pointer. We don't need to encode something
in a pointer.

> -------------------------
> 0001:
>
> +#ifndef USE_NO_SIMD
> +#include "port/pg_bitutils.h"
> +#endif
>
> Leftover from an earlier version?
>
> +static inline int vector8_find(const Vector8 v, const uint8 c);
> +static inline int vector8_find_ge(const Vector8 v, const uint8 c);
>
> Leftovers, causing compiler warnings. (Also see new variable shadow warning)

Will fix.

>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) &r;
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
> +
> + return r;
> +#endif
>
> As I mentioned a couple versions ago, this style is really awkward, and 
> potential non-SIMD callers will be better off writing their own byte-wise 
> loop rather than using this API. Especially since the "min" function exists 
> only as a workaround for lack of unsigned comparison in (at least) SSE2. 
> There is one existing function in this file with that idiom for non-assert 
> code (for completeness), but even there, inputs of current interest to us use 
> the uint64 algorithm.

Agreed. Will remove non-SIMD code.

>
> 0002:
>
> + /* XXX: should not to use vector8_highbit_mask */
> + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << 
> sizeof(Vector8));
>
> Hmm?

It's my outdated memo, will remove.

>
> +/*
> + * Return index of the first element in chunks in the given node that is 
> greater
> + * than or equal to 'key'.  Return -1 if there is no such element.
> + */
> +static inline int
> +node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
>
> The caller must now have logic for inserting at the end:
>
> + int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
> + int16 count = NODE_GET_COUNT(n32);
> +
> + if (insertpos < 0)
> + insertpos = count; /* insert to the tail */
>
> It would be a bit more clear if node_*_search_ge() always returns the 
> position we need (see the prototype for example). In fact, these functions 
> are probably better named node*_get_insertpos().

Agreed.

>
> + if (likely(NODE_HAS_FREE_SLOT(n128)))
> + {
> + node_inner_128_insert(n128, chunk, child);
> + break;
> + }
> +
> + /* grow node from 128 to 256 */
>
> We want all the node-growing code to be pushed down to the bottom so that all 
> branches of the hot path are close together. This provides better locality 
> for the CPU frontend. Looking at the assembly, the above doesn't have the 
> desired effect, so we need to write like this (also see prototype):
>
> if (unlikely( ! has-free-slot))
>   grow-node;
> else
> {
>   ...;
>   break;
> }
> /* FALLTHROUGH */

Good point. Will change.

>
> + /* Descend the tree until a leaf node */
> + while (shift >= 0)
> + {
> +   rt_node    *child;
> +
> +   if (NODE_IS_LEAF(node))
> +     break;
> +
> +   if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
> +     child = rt_node_add_new_child(tree, parent, node, key);
> +
> +   Assert(child);
> +
> +   parent = node;
> +   node = child;
> +   shift -= RT_NODE_SPAN;
> + }
>
> Note that if we have to call rt_node_add_new_child(), each successive loop 
> iteration must search it and find nothing there (the prototype had a separate 
> function to handle this). Maybe it's not that critical yet, but something to 
> keep in mind as we proceed. Maybe a comment about it to remind us.

Agreed. Currently rt_extend() is used to add upper nodes but probably
we need another function to add lower nodes for this case.

>
> + /* there is no key to delete */
> + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
> +   return false;
> +
> + /* Update the statistics */
> + tree->num_keys--;
> +
> + /*
> +  * Delete the key from the leaf node and recursively delete the key in
> +  * inner nodes if necessary.
> +  */
> + Assert(NODE_IS_LEAF(stack[level]));
> + while (level >= 0)
> + {
> +   rt_node    *node = stack[level--];
> +
> +   if (NODE_IS_LEAF(node))
> +     rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
> +   else
> +     rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
> +
> +   /* If the node didn't become empty, we stop deleting the key */
> +   if (!NODE_IS_EMPTY(node))
> +     break;
> +
> +   /* The node became empty */
> +   rt_free_node(tree, node);
> + }
>
> Here we call rt_node_search_leaf() twice -- once to check for existence, and 
> once to delete. All three search calls are inlined, so this wastes space. 
> Let's try to delete the leaf, return if not found, otherwise handle the leaf 
> bookkeepping and loop over the inner nodes. This might require some 
> duplication of code.

Agreed.

>
> +ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
>
> Spelling

WIll fix.

>
> +static inline void
> +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> +             uint8 *dst_chunks, rt_node **dst_children, int count)
> +{
> + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> +}
>
> gcc generates better code with something like this (but not hard-coded) at 
> the top:
>
>     if (count > 4)
>         pg_unreachable();

Agreed.

>
> This would have to change when we implement shrinking of nodes, but might 
> still be useful.
>
> + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
> +   return false;
> +
> + return true;
>
> Maybe just "return rt_node_search_leaf(...)" ?

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

Reply via email to