Re: Unicode normalization SQL functions

John Naylor Thu, 02 Apr 2020 21:03:15 -0700

On Thu, Apr 2, 2020 at 3:51 PM Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
>
> On 2020-03-26 18:41, John Naylor wrote:
> > We don't have a trie implementation in Postgres, but we do have a
> > perfect hash implementation. Doing that would bring the tables back to
> > 64 bits per entry, but would likely be noticeably faster than binary
> > search. Since v4 has left out the biggest tables entirely, I think
> > this might be worth a look for the smaller tables remaining.
> >
> > In the attached v5, when building the hash tables, we sort the code
> > points by NO/MAYBE, and store the index of the beginning of the NO
> > block:
>
> This is a valuable idea, but I fear it's a bit late now in this cycle.
> I have questions about some details.  For example, you mention that you
> had to fiddle with the hash seed.  How does that affect other users of
> PerfectHash?

They would still try the same multipliers they use now, so no effect on them.

> What happens when we update Unicode data and the hash
> doesn't work anymore?

The script would choose different multipliers and/or seeds
automatically. Only if you're unlucky would you have to fiddle with
the hash parameters again. The last-resort multipliers in the v2 patch
in the other thread [1] seem very effective and easily build both the
quick check D tables, which I tried for amusement's sake.

That said, we could reduce the chances of that happening this way:
After trying all the shift-and-add multipliers, we could add another
loop to try a bunch of numbers in a range. We'd need a quick check to
weed out multiples of small primes so the number has a decent chance
of being prime. To save time, just try a few seeds and move on to the
next number. Maybe emit a warning that it exhausted the shift-and-add
multipliers in case the developer wanted to intervene.

If I resubmit this, I would split the build up into two steps: have
the current manual script build the quick check array for later commit
into the tree, and build the hash function separately from that as a
Makefile distprep target. There's no reason to have the hash functions
checked in as I did in v5, like we don't check in the keyword hash
functions.

I would also consider splitting it into two patches:

1. Keep binary search but with a more abstract array representation
(think PG_RMGR). This patch would be large but mostly mechanical.
2. Switch to hash lookup. A smaller one for ease of review.

[1]
https://www.postgresql.org/message-id/CACPNZCvMMj88Bsnk1k%3DRffW6gBw%2BFH7wcwCBfcKLDM%3DUEG2UWg%40mail.gmail.com

--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unicode normalization SQL functions

Reply via email to