On Thu, 2025-01-09 at 16:19 -0800, Jeff Davis wrote: > On Mon, 2024-12-02 at 23:58 -0800, Jeff Davis wrote: > > On Mon, 2024-12-02 at 16:39 +0100, Andreas Karlsson wrote: > > > I feel your first patch in the series is something you can just > > > commit. > > > > Done. > > > > I combined your patches and mine into the attached v10 series. > > Here's v12 after committing a few of the earlier patches.
I collected some performance numbers for a worst case on UTF8. This is where each row is million characters wide and each one is greater than MAX_SIMPLE_CHAR (U+07FF): create table wide (t text); insert into wide select repeat('カ', 1048576) from generate_series(1,1000) g; select 1 from wide where t ~ '([[:punct:]]|[[:lower:]])' collate "the_collation"; results: master patched C 3736 3589 pg_c_utf8 19500 23404 en_US 10251 12396 en-US-x-icu 10264 11963 And a separate test for ILIKE on en_US.iso885915 where each character is beyond the ASCII range and needs to be lowercased using the optimization for single-byte encodings in Generic_Text_IC_like: create table sb (t text); insert into sb select repeat('É', 1048576) from generate_series(1, 3000) g; select 1 from sb where t ilike '%á%'; results: master patched C 2900 2812 en_US 2203 3702 en-US-x-icu 17483 18123 The numbers from both tests show a slowdown. The worst one is probably tolower() for libc in LATIN9, which appears to be heavily optimized, and the extra indirection for a method call slows things down quite a bit. This is a bit unfortunate because the method table feels like the right code organization. Having special cases at the call sites (aside from ctype_is_c) is not great. Are the above numbers bad enough that we need to give up on this method-ization approach? Or should we say that the above cases don't represent reality, and a moderate regression there is OK? Or perhaps someone has an idea how to mitigate the regression? I could imagine another cache of character properties, like an extensible pg_char_properties. I'm not sure if the extra complexity is worth it, though. Regards, Jeff Davis