On Thu, Jun 8, 2023, at 12:19, Tomas Vondra wrote: > Would you be interested in helping with / working on some of that? I > don't have immediate need for this stuff, so it's not very high on my > TODO list.
Sure, I'm willing to help! I've attached a patch that works on some of the items on your list, including some additions to the README.md. There were a bunch of places where `maxelements / 8` caused bugs, that had to be changed to do proper integer ceiling division: - values = (int32 *) (set->data + set->maxelements / 8); + values = (int32 *) (set->data + (set->maxelements + 7) / 8); Side note: I wonder if it would be good to add CEIL_DIV and FLOOR_DIV macros to the PostgreSQL source code in general, since it's easy to make this mistake, and quite verbose/error-prone to write it out manually everywhere. Such macros could simplify code in e.g. numeric.c. > There's a bunch of stuff that needs to be improved to make this properly > usable, like: > > 1) better hash table implementation TODO > 2) input/output functions I've attempted to implement these. I thought comma separated values wrapped around curly braces felt as the most natural format, example: SELECT '{1,2,3}'::hashset; > 3) support for other types (now it only works with int32) TODO > 4) I wonder if this might be done as an array-like polymorphic type. That would be nice! I guess the work-around would be to store the actual value of non-int type in a lookup table, and then hash the int-based primary key in such table. Do you think later implementing polymorphic type support would mean a more or less complete rewrite, or can we carry on with int32-support and add it later on? > 5) more efficient storage format, with versioning etc. TODO > 6) regression tests I've added some regression tests. > Right. IMHO the query language is a separate thing, you still need to > evaluate the query somehow - which is where hashset applies. Good point, I fully agree. /Joel
hashset-1.0.0-joel-0001.patch
Description: Binary data