On Thu, 2025-01-09 at 16:19 -0800, Jeff Davis wrote:
> On Mon, 2024-12-02 at 23:58 -0800, Jeff Davis wrote:
> > On Mon, 2024-12-02 at 16:39 +0100, Andreas Karlsson wrote:
> > > I feel your first patch in the series is something you can just
> > > commit. 
> > 
> > Done.
> > 
> > I combined your patches and mine into the attached v10 series.
> 
> Here's v12 after committing a few of the earlier patches.

I collected some performance numbers for a worst case on UTF8. This is
where each row is million characters wide and each one is greater than
MAX_SIMPLE_CHAR (U+07FF):

  create table wide (t text);
  insert into wide
    select repeat('カ', 1048576)
    from generate_series(1,1000) g;

  select 1 from wide where t ~ '([[:punct:]]|[[:lower:]])'
    collate "the_collation";

results:
                   master     patched
  C                  3736        3589
  pg_c_utf8         19500       23404
  en_US             10251       12396
  en-US-x-icu       10264       11963


And a separate test for ILIKE on en_US.iso885915 where each character
is beyond the ASCII range and needs to be lowercased using the
optimization for single-byte encodings in Generic_Text_IC_like:

  create table sb (t text);
  insert into sb
    select repeat('É', 1048576)
    from generate_series(1, 3000) g;

  select 1 from sb where t ilike '%á%';

results:

                   master     patched
  C                  2900        2812
  en_US              2203        3702
  en-US-x-icu       17483       18123


The numbers from both tests show a slowdown. The worst one is probably
tolower() for libc in LATIN9, which appears to be heavily optimized,
and the extra indirection for a method call slows things down quite a
bit.

This is a bit unfortunate because the method table feels like the right
code organization. Having special cases at the call sites (aside from
ctype_is_c) is not great. Are the above numbers bad enough that we need
to give up on this method-ization approach? Or should we say that the
above cases don't represent reality, and a moderate regression there is
OK?

Or perhaps someone has an idea how to mitigate the regression? I could
imagine another cache of character properties, like an extensible
pg_char_properties. I'm not sure if the extra complexity is worth it,
though.

Regards,
        Jeff Davis



Reply via email to