Re: Optimization for lower(), upper(), casefold() functions.

Alexander Borisov Fri, 31 Jan 2025 05:20:30 -0800

31.01.2025 01:43, Heikki Linnakangas пишет:

Hi Heikki,

Did you consider using a radix tree? We use that method in src/backend/utils/mb/Unicode/convutils.pm. I'm not sure if that's better or worsethan what's proposed here, but it would seem like a more standardtechnique at least. Or if this is clearly better, then maybe we shouldswitch to this technique in convutils.pm too. A perfect hash would beanother alternative, we use that in src/common/unicode_norm.c.


I looked at the radix tree implementation, and according to number of
branches and mathematical operations I think radix tree will not be
faster than the proposed approach.

About the perfect hash.
The problem with the perfect hash is that it requires a Unicode
codepoint to be stored for matching.

Originally I started to optimize Unicode Normalization Form in Postgres.
But I decided to “practice” on a case so as not to scare anyone with
a big patch at once. Actually, I want to do Unicode in Postgres,
optimizations and improvements.

Did you check that these optimizations still win with Unicode version 16(https://www.postgresql.org/message-id/146349e4-4687-4321-91af-f23557249...@eisentraut.org)? We haven't updated to that yet, but sooneror later we will.


Yes, everything works just as well with Unicode version 16 data.

The way you're defining 'pg_unicode_case_index' as a function in theheader file won't work. It needs to be a static inline function if it'sin the header. Or put it in a .c file.


I agree, it needs to be moved to a .c file.

Some ideas on how to squeeze this further:
- Instead of having one table that contains Lower/Title/Upper/Fold forevery character, it might be better to have four separate tables. Ithink that would be more cache-friendly: you typically run one of thefunctions for many different characters in a loop, rather than all ofthe functions for the same character. You could deduplicate between thetables too: for many ranges of characters, Title=Upper and Lower=Fold.


I'll try to experiment with that. Theoretically the performance should
increase.

- The characters are stored as 4-byte integers, but the high byte isalways 0. Could squeeze those out. Not sure if that would be a win if itmakes the accesses unaligned, but you could benchmark that.Alternatively, use that empty byte to store the 'special_case' index,instead of having a separate field for it.


I thought about it, but it seems that it will make the code much more
complicated, and we won't gain much.

- Many characters that have a special case only need the special casefor some of the functions, not all. If you stored the special_caseseparately for each function (as the high byte in the 'simplemap' fieldperhaps, like I suggested on previous point), you could avoid havingthose "dummy" special cases.


That's a good idea. Then the main table will be reduced by uint8*n.


Thanks, after the weekend I'll send an updated patch that takes into
account the comments/advice.

--
SberTech
Alexander Borisov

Re: Optimization for lower(), upper(), casefold() functions.

Reply via email to