On Tue, Mar 5, 2024 at 3:28 PM Peter Smith <smithpb2...@gmail.com> wrote: > > Hi, here are some review comments for v7-0002 > > ====== > Commit Message > > 1. > This commit adds a hash table to binaryheap in order to track of > positions of each nodes in the binaryheap. That way, by using newly > added functions such as binaryheap_update_up() etc., both updating a > key and removing a node can be done in O(1) on an average and O(log n) > in worst case. This is known as the indexed binary heap. The caller > can specify to use the indexed binaryheap by passing indexed = true. > > ~ > > /to track of positions of each nodes/to track the position of each node/ > > ~~~ > > 2. > There is no user of it but it will be used by a upcoming patch. > > ~ > > The current code does not use the new indexing logic, but it will be > used by an upcoming patch.
Fixed. > > ====== > src/common/binaryheap.c > > 3. > +/* > + * Define parameters for hash table code generation. The interface is *also*" > + * declared in binaryheaph.h (to generate the types, which are externally > + * visible). > + */ > > Typo: *also*" Fixed. > > ~~~ > > 4. > +#define SH_PREFIX bh_nodeidx > +#define SH_ELEMENT_TYPE bh_nodeidx_entry > +#define SH_KEY_TYPE bh_node_type > +#define SH_KEY key > +#define SH_HASH_KEY(tb, key) \ > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type)) > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0) > +#define SH_SCOPE extern > +#ifdef FRONTEND > +#define SH_RAW_ALLOCATOR pg_malloc0 > +#endif > +#define SH_DEFINE > +#include "lib/simplehash.h" > > 4a. > The comment in simplehash.h says > * The following parameters are only relevant when SH_DEFINE is defined: > * - SH_KEY - ... > * - SH_EQUAL(table, a, b) - ... > * - SH_HASH_KEY(table, key) - ... > * - SH_STORE_HASH - ... > * - SH_GET_HASH(tb, a) - ... > > So maybe it is nicer to reorder the #defines in that same order? > > SUGGESTION: > +#define SH_PREFIX bh_nodeidx > +#define SH_ELEMENT_TYPE bh_nodeidx_entry > +#define SH_KEY_TYPE bh_node_type > +#define SH_SCOPE extern > +#ifdef FRONTEND > +#define SH_RAW_ALLOCATOR pg_malloc0 > +#endif > > +#define SH_DEFINE > +#define SH_KEY key > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0) > +#define SH_HASH_KEY(tb, key) \ > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type)) > +#include "lib/simplehash.h" I'm really not sure it helps increase readability. For instance, for me it's readable if SH_DEFINE and SH_DECLARE come to the last before #include since it's more obvious whether we want to declare, define or both. Other simplehash.h users also do so. > > ~~ > > 4b. > The comment in simplehash.h says that "it's preferable, if possible, > to store the element's hash in the element's data type", so should > SH_STORE_HASH and SH_GET_HASH also be defined here? Good catch. I've used these macros. > > ~~~ > > 5. > + * > + * If 'indexed' is true, we create a hash table to track of each node's > + * index in the heap, enabling to perform some operations such as removing > + * the node from the heap. > */ > binaryheap * > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) > +binaryheap_allocate(int capacity, binaryheap_comparator compare, > + bool indexed, void *arg) > > BEFORE > ... enabling to perform some operations such as removing the node from the > heap. > > SUGGESTION > ... to help make operations such as removing nodes more efficient. > But these operations literally require the indexed binary heap as we have an assertion: void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d) { bh_nodeidx_entry *ent; Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); Assert(heap->bh_indexed); > ~~~ > > 6. > + heap->bh_indexed = indexed; > + if (heap->bh_indexed) > + { > +#ifdef FRONTEND > + heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL); > +#else > + heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity, > + NULL); > +#endif > + } > + > > The heap allocation just uses palloc instead of palloc0 so it might be > better to assign "heap->bh_nodeidx = NULL;" up-front, just so you will > never get a situation where bh_indexed is false but bh_nodeidx has > some (garbage) value. Fixed. > > ~~~ > > 7. > +/* > + * Set the given node at the 'index', and updates its position accordingly. > + * > + * Return true if the node's index is already tracked. > + */ > +static bool > +bh_set_node(binaryheap *heap, bh_node_type node, int index) > > 7a. > I felt the 1st sentence should be more like: > > SUGGESTION > Set the given node at the 'index' and track it if required. Fixed. > > ~ > > 7b. > IMO the parameters would be better the other way around (e.g. 'index' > before the 'node') because that's what the assignments look like: > > > heap->bh_nodes[heap->bh_size] = d; > > becomes: > bh_set_node(heap, heap->bh_size, d); > I think it assumes heap->bh_nodes is an array. But if we change it in the future, it will no longer make sense. I think it would make more sense if we define the parameters in an order like "we set the 'node' at 'index'". What do you think? > ~~~ > > 8. > +static bool > +bh_set_node(binaryheap *heap, bh_node_type node, int index) > +{ > + bh_nodeidx_entry *ent; > + bool found = false; > + > + /* Set the node to the nodes array */ > + heap->bh_nodes[index] = node; > + > + if (heap->bh_indexed) > + { > + /* Remember its index in the nodes array */ > + ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found); > + ent->idx = index; > + } > + > + return found; > +} > > 8a. > That 'ent' declaration can be moved to the inner block scope, so it is > closer to where it is needed. > > ~ > > 8b. > + /* Remember its index in the nodes array */ > > The comment is worded a bit ambiguously. IMO a simpler comment would > be: "/* Keep track of the node index. */" > > ~~~ Fixed. > > 9. > +static void > +bh_delete_nodeidx(binaryheap *heap, bh_node_type node) > +{ > + if (!heap->bh_indexed) > + return; > + > + (void) bh_nodeidx_delete(heap->bh_nodeidx, node); > +} > > Since there is only 1 statement IMO it is simpler to write this > function like below: > > if (heap->bh_indexed) > (void) bh_nodeidx_delete(heap->bh_nodeidx, node); Fixed. > > ~~~ > > 10. > +/* > + * Replace the node at 'idx' with the given node 'replaced_by'. Also > + * update their positions accordingly. > + */ > +static void > +bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by) > > 10a. > Would 'node' or 'new_node' or 'replacement' be a better name than > 'replaced_by'? Fixed. > > ~ > > 10b. > I noticed that the index param is called 'idx' here but in other > functions, it is called 'index'. I think either is good (I prefer > 'idx') but at least everywhere should use the same name for > consistency. Fixed. > > ~~~ > > 11. > +static void > +bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by) > +{ > + /* Remove overwritten node's index */ > + bh_delete_nodeidx(heap, heap->bh_nodes[idx]); > + > + /* Replace it with the given new node */ > + if (idx < heap->bh_size) > + { > + bool found PG_USED_FOR_ASSERTS_ONLY; > + > + found = bh_set_node(heap, replaced_by, idx); > + > + /* The overwritten node's index must already be tracked */ > + Assert(!heap->bh_indexed || found); > + } > +} > > I did not understand the condition. > e.g. Can you explain when is idx NOT less than heap->bh_size? > e.g. If this condition failed then nothing gets replaced (??) It was for a case like where we call binaryheap_remote_node(heap, 0) where the heap has only one entry, resulting in setting the root node again. I updated the bh_replace_node() to return if the node doesn't not need to be moved. > > ~~~ > > ====== > src/include/lib/binaryheap.h > > 12. > +/* > + * Struct for A hash table element to store the node's index in the bh_nodes > + * array. > + */ > +typedef struct bh_nodeidx_entry > > /for A hash table/for a hash table/ > > ~~~ > > 13. > +/* define parameters necessary to generate the hash table interface */ > > Suggest uppercase "Define" and add a period. Fixed. > > ~~~ > > 14. > + > + /* > + * If bh_indexed is true, the bh_nodeidx is used to track of each node's > + * index in bh_nodes. This enables the caller to perform > + * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n). > + */ > + bool bh_indexed; > + bh_nodeidx_hash *bh_nodeidx; > } binaryheap; > > I'm wondering why the separate 'bh_indexed' is necessary at all. Can't > you just use the bh_nodeidx value? E.g. If bh_nodeidx == NULL then it > means there is no index tracking, otherwise there is. > Good point. I added a macro binaryheap_indexed() to check it for better readability. The above comments are incorporated into the latest v8 patch set that I've just submitted[1]. Regards, [1] https://www.postgresql.org/message-id/CAD21AoBYjJmz7q_%3DZ%2BeXJgm0FScyu3_iGFshPAvnq78B2KL3qQ%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com