On Mon, 2023-12-04 at 12:12 +0700, John Naylor wrote: > That's a good thing to clear up. This thread has taken simplehash as > a > starting point from the very beginning. It initially showed no > improvement, and then we identified problems with the hashing and > equality computations. The latter seem like independently commitable > improvements, so I'm curious if they help on their own, even if we > still need to switch to simplehash as a last step to meet your > performance goals.
There's already a patch to use simplehash, and the API is a bit cleaner, and there's a minor performance improvement. It seems fairly non-controversial -- should I just proceed with that patch? > > If I understood what Andres was saying, the exposed hash state > > would be > > useful for writing a hash function like guc_name_hash(). > > From my point of view, it would at least be useful for C-strings, > where we don't have the length available up front. That's good news. By the way, is there any reason that we would need hash_bytes(s, strlen(s)) == cstring_hash(s)? > > Also, while the |= 0x20 is a nice trick for lowercasing, did we > > decide > > that it's better than my approach in patch 0004 here: > > > > https://www.postgresql.org/message-id/27a7a289d5b8f42e1b1e79b1bcaeef3a40583bd2.ca...@j-davis.com > > > > which optimizes exact hits (most GUC names are already folded) > > before > > trying case folding? > > Note there were two aspects there: hashing and equality. I > demonstrated in > > https://www.postgresql.org/message-id/CANWCAZbQ30O9j-bEZ_1zVCyKPpSjwbE4u19cSDDBJ%3DTYrHvPig%40mail.gmail.com > > ... in v4-0003 that the equality function can be optimized for > already-folded names (and in fact measured almost equally) using way, > way, way less code. Thinking in terms of API layers, there are two approaches: (a) make the hash and equality functions aware of the case-insensitivity, as we currently do; or (b) make it the caller's responsibility to do case folding, and the hash and equality functions are based on exact equality. Each approach has its own optimization techniques. In (a), we can use the |= 0x20 trick, and for equality do a memcmp() check first. In (b), the caller can first try lookup of the key in whatever form is provided, and only if that fails, case-fold it and try again. As a tangential point, we may eventually want to provide a more internationalized definition of "case insensitive" for GUC names. That would be slightly easier with (b) than with (a), but we can cross that bridge if and when we come to it. It seems you are moving toward (a) whereas my patches moved toward (b). I am fine with either approach but I wanted to clarify which approach we are using. In the abstract, I kind of like approach (b) because we don't need to be as special/clever with the hash functions. We would still want the faster hash for C-strings, but that's general and helps all callers. But you're right that it's more code, and that's not great. Regards, Jeff Davis