At Wed, 14 Oct 2020 23:06:28 -0400, Tom Lane <t...@sss.pgh.pa.us> wrote in > John Naylor <john.nay...@enterprisedb.com> writes: > > With those points in mind and thinking more broadly, I'd like to try harder > > on recomposition. Even several times faster, recomposition is still orders > > of magnitude slower than ICU, as measured by Daniel Verite [1]. > > Huh. Has anyone looked into how they do it?
I'm not sure it is that, but it would be that.. It uses separate tables for decomposition and composition pointed from a trie? That table is used after trying algorithmic decomposition/composition for, for example, Hangul. I didn't look it any fruther but just for information, icu4c/source/common/normalizer2impl.cpp seems doing that. For example icu4c/srouce/common/norm2_nfc_data.h defines the static data. icu4c/source/common/normalier2impl.h:244 points a design documentation of normalization. http://site.icu-project.org/design/normalization/custom > Old and New Implementation Details > > The old normalization data format (unorm.icu, ca. 2001..2009) uses > three data structures for normalization: A trie for looking up 32-bit > values for every code point, a 16-bit-unit array with decompositions > and some other data, and a composition table (16-bit-unit array, > linear search list per starter). The data is combined for all 4 > standard normalization forms: NFC, NFD, NFKC and NFKD. regards. -- Kyotaro Horiguchi NTT Open Source Software Center