Re: Optimization for lower(), upper(), casefold() functions.

Jeff Davis Thu, 06 Feb 2025 11:08:46 -0800

On Thu, 2025-02-06 at 18:39 +0300, Alexander Borisov wrote:
> Since I started to improve Unicode Case, I used the same approach,
> essentially a binary search, only not by individual values, but by
> ranges.


I considered it a 4th approach because of the generated branches in
case_index(). Case_index() is essentially a part of the data structure
-- you can't navigate the table without it.

By the way, could we just represent those ranges as another tiny table
instead of representing them as branches?

In other words, a table something like:

    {{.start=0x0000, .end=0x0588, .offset=0},
     {.start=0x10A0, .end=0x1100, .offset=0x10A0 - 0x0588},
     ...
     {.start=0x1E900, .end=0x1E944, .offset=0x1E900 - 0x11D3}}

How would that perform?

> I plan to use the same approach for Unicode Normalization. Acting
> sequentially and suggesting improvements.
> Unfortunately I'm new here and I can't say why one place uses one
> approach and another another - I just don't know.

We have 3 or 4 different solutions (naive binary search, range binary
search, perfect hashing, and radix tree) to 3 or 4 similar problems
(normalization, character properties, case mapping, and encoding
conversion). We should consolidate these approaches somehow.

IIUC, out of the solutions, it seems that either your modified binary
search or the radix tree are best because they are both fast and both
allow compact tables.

My questions are: is there a clear winner between radix tree and the
range binary search for all four problems? If not, what are the trade-
offs?

Regards,
        Jeff Davis

Re: Optimization for lower(), upper(), casefold() functions.

Reply via email to