On Fri, Jan 04, 2019 at 03:31:11PM -0500, Tom Lane wrote: > John Naylor <jcnay...@gmail.com> writes: > > On 1/3/19, Joerg Sonnenberger <jo...@bec.de> wrote: > >> I was pointed at your patch on IRC and decided to look into adding my > >> own pieces. What I can provide you is a fast perfect hash function > >> generator. I've attached a sample hash function based on the current > >> main keyword list. hash() essentially gives you the number of the only > >> possible match, a final strcmp/memcmp is still necessary to verify that > >> it is an actual keyword though. The |0x20 can be dropped if all cases > >> have pre-lower-cased the input already. This would replace the binary > >> search in the lookup functions. Returning offsets directly would be easy > >> as well. That allows writing a single string where each entry is prefixed > >> with a type mask, the token id, the length of the keyword and the actual > >> keyword text. Does that sound useful to you? > > > Judging by previous responses, there is still interest in using > > perfect hash functions, so thanks for this. I'm not knowledgeable > > enough to judge its implementation, so I'll leave that for others. > > We haven't actually seen the implementation, so it's hard to judge ;-).
It's a temporary hacked version of nbperf in the NetBSD tree. > The sample hash function certainly looks great. I'm not terribly on board > with using |0x20 as a substitute for lower-casing, but that's a minor > detail. Yeah, I've included that part more because I don't know the current use cases enough. If all instances are already doing lower-casing in advance, it is trivial to drop. > The foremost questions in my mind are: > > * What's the generator written in? (if the answer's not "Perl", wedging > it into our build processes might be painful) Plain C, nothing really fancy in it. > * What license is it under? Two clause BSD license. > * Does it always suceed in producing a single-level lookup table? This question is a bit tricky. The short answer is: yes. The longer answer: The choosen hash function in the example is very simple (e.g. just two variations of DJB-style hash), so with that: no, not without potentially fiddling a bit with the hash function if things ever get nasty like having two keywords that hit a funnel for both variants. The main concern for the choice was to be fast. When using two families of independent hash functions, the generator requires a probalistic linear time in the number of keys. That means with a strong enough hash function like the Jenkins hash used in PG elsewhere, it will succeed very fast. So if it fails on new keywords, making the mixing a bit stronger should be enough. Joerg