speed up unicode normalization quick check

John Naylor Thu, 21 May 2020 00:13:09 -0700

Hi,

Attached is a patch to use perfect hashing to speed up Unicode
normalization quick check.


0001 changes the set of multipliers attempted when generating the hash
function. The set in HEAD works for the current set of NFC codepoints,
but not for the other types. Also, the updated multipliers now all
compile to shift-and-add on most platform/compiler combinations
available on godbolt.org (earlier experiments found in [1]). The
existing keyword lists are fine with the new set, and don't seem to be
very picky in general. As a test, it also successfully finds a
function for the OS "words" file, the "D" sets of codepoints, and for
sets of the first n built-in OIDs, where n > 5.

0002 builds on top of the existing normprops infrastructure to use a
hash function for NFC quick check. Below are typical numbers in a
non-assert build:

select count(*) from (select md5(i::text) as t from
generate_series(1,100000) as i) s where t is nfc normalized;

HEAD  411ms 413ms 409ms
patch 296ms 297ms 299ms

The addition of "const" was to silence a compiler warning. Also, I
changed the formatting of the output file slightly to match pgindent.

0003 uses hashing for NFKC and removes binary search. This is split
out for readability. I gather NFKC is a less common use case, so this
could technically be left out. Since this set is larger, the
performance gains are a bit larger as well, at the cost of 19kB of
binary space:

HEAD  439ms 440ms 442ms
patch 299ms 301ms 301ms

I'll add this to the July commitfest.

[1] 
https://www.postgresql.org/message-id/cacpnzcuvtilhxazxp9ucehguyhma59h6_pmp+_w-szxg0uy...@mail.gmail.com

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v1-0001-Tweak-the-set-of-candidate-multipliers-for-genera.patch
Description: Binary data

v1-0002-Use-perfect-hashing-for-NFC-Unicode-normalization.patch
Description: Binary data

v1-0003-Use-perfect-hashing-for-NFKC-Unicode-normalizatio.patch
Description: Binary data

speed up unicode normalization quick check

Reply via email to