On Thu, Apr 2, 2020 at 3:51 PM Peter Eisentraut <peter.eisentr...@2ndquadrant.com> wrote: > > On 2020-03-26 18:41, John Naylor wrote: > > We don't have a trie implementation in Postgres, but we do have a > > perfect hash implementation. Doing that would bring the tables back to > > 64 bits per entry, but would likely be noticeably faster than binary > > search. Since v4 has left out the biggest tables entirely, I think > > this might be worth a look for the smaller tables remaining. > > > > In the attached v5, when building the hash tables, we sort the code > > points by NO/MAYBE, and store the index of the beginning of the NO > > block: > > This is a valuable idea, but I fear it's a bit late now in this cycle. > I have questions about some details. For example, you mention that you > had to fiddle with the hash seed. How does that affect other users of > PerfectHash?
They would still try the same multipliers they use now, so no effect on them. > What happens when we update Unicode data and the hash > doesn't work anymore? The script would choose different multipliers and/or seeds automatically. Only if you're unlucky would you have to fiddle with the hash parameters again. The last-resort multipliers in the v2 patch in the other thread [1] seem very effective and easily build both the quick check D tables, which I tried for amusement's sake. That said, we could reduce the chances of that happening this way: After trying all the shift-and-add multipliers, we could add another loop to try a bunch of numbers in a range. We'd need a quick check to weed out multiples of small primes so the number has a decent chance of being prime. To save time, just try a few seeds and move on to the next number. Maybe emit a warning that it exhausted the shift-and-add multipliers in case the developer wanted to intervene. If I resubmit this, I would split the build up into two steps: have the current manual script build the quick check array for later commit into the tree, and build the hash function separately from that as a Makefile distprep target. There's no reason to have the hash functions checked in as I did in v5, like we don't check in the keyword hash functions. I would also consider splitting it into two patches: 1. Keep binary search but with a more abstract array representation (think PG_RMGR). This patch would be large but mostly mechanical. 2. Switch to hash lookup. A smaller one for ease of review. [1] https://www.postgresql.org/message-id/CACPNZCvMMj88Bsnk1k%3DRffW6gBw%2BFH7wcwCBfcKLDM%3DUEG2UWg%40mail.gmail.com -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services