On Thu, Nov 23, 2023 at 5:34 AM Andres Freund <and...@anarazel.de> wrote:
> It's worth noting that the limited range of the input values means that > there's a lot of bias toward some bits being set ('a' to 'z' all start with > 0b011). We can take advantage of the limited range with a single additional instruction: After "ch |= 0x20", do "ch -= ('a' - 1)". That'll shrink letters and underscores to the range [1,31], which fits in 5 bits. (Other characters are much less common in a guc name). That increases randomness and allows 12 chars to be xor'd in before the first bits rotate around. > If, which I mildly doubt, we can't afford to call murmurhash32 for every > character, we could just call it for 32/5 input characters together. Or we > could just load up to 8 characters into an 64bit integer, can call > murmurhash64. I'll play around with this idea, as well. > The fact that string_hash() is slow due to the strlen(), which causes us to > process the input twice and which is optimized to also handle very long > strings which typically string_hash() doesn't encounter, seems problematic far > beyond this case. We use string_hash() in a *lot* of places, and that strlen() > does regularly show up in profiles. We should fix that. +1 > I think we ought to adjust our APIs around this: > 1) The accumulator state of the hash functions should be exposed, so one can > accumulate values into the hash state, without reducing the internal state > to a single 32/64 bit variable. If so, it might make sense to vendor a small, suitably licensed hash function that already has these APIs. While on the subject, it'd be good to have a clear separation between in-memory and on-disk usage, so we can make breaking changes in the former.