Hi, Attached is a patch to use perfect hashing to speed up Unicode normalization quick check.
0001 changes the set of multipliers attempted when generating the hash function. The set in HEAD works for the current set of NFC codepoints, but not for the other types. Also, the updated multipliers now all compile to shift-and-add on most platform/compiler combinations available on godbolt.org (earlier experiments found in [1]). The existing keyword lists are fine with the new set, and don't seem to be very picky in general. As a test, it also successfully finds a function for the OS "words" file, the "D" sets of codepoints, and for sets of the first n built-in OIDs, where n > 5. 0002 builds on top of the existing normprops infrastructure to use a hash function for NFC quick check. Below are typical numbers in a non-assert build: select count(*) from (select md5(i::text) as t from generate_series(1,100000) as i) s where t is nfc normalized; HEAD 411ms 413ms 409ms patch 296ms 297ms 299ms The addition of "const" was to silence a compiler warning. Also, I changed the formatting of the output file slightly to match pgindent. 0003 uses hashing for NFKC and removes binary search. This is split out for readability. I gather NFKC is a less common use case, so this could technically be left out. Since this set is larger, the performance gains are a bit larger as well, at the cost of 19kB of binary space: HEAD 439ms 440ms 442ms patch 299ms 301ms 301ms I'll add this to the July commitfest. [1] https://www.postgresql.org/message-id/cacpnzcuvtilhxazxp9ucehguyhma59h6_pmp+_w-szxg0uy...@mail.gmail.com -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
v1-0001-Tweak-the-set-of-candidate-multipliers-for-genera.patch
Description: Binary data
v1-0002-Use-perfect-hashing-for-NFC-Unicode-normalization.patch
Description: Binary data
v1-0003-Use-perfect-hashing-for-NFKC-Unicode-normalizatio.patch
Description: Binary data