On 16 August 2024 23:11:18 BST, Nick Lockheart <li...@ageofdream.com> wrote:
>I used the rather broad title "Should All String Functions Become
>Multi-Byte Safe" because there are many smaller related topics, but my
>intention was to discuss multi-byte in general

I think it was probably not the best choice, because it seems like what you're 
specifically interested in is mostly not about existing functions, and not 
particularly about encodings being more than one byte wide.

For instance, even good old 7-bit-per-character ASCII contains control 
characters you might want help sanitising out; and plenty of 
8-bit-per-character encodings include more than one script, even more than one 
writing direction (e.g. ISO 8859-8 Latin/Hebrew).

But, the specific topic of safe input handling is definitely an interesting 
one. And focussing on Unicode, rather than every possible encoding (multibyte 
or not) makes sense in modern usage.


>There's a lot of potential pitfalls for dealing with Unicode input, and
>there are some best practices per the Unicode Consortium

It's worth looking into whether the ICU library has explicit functions to help 
with those recommendations (if you can navigate its slightly patchy 
documentation). Since most of ext/intl is just a thin wrapper on that library, 
that could make our lives a lot easier.


>For example, there should be a function that removes unassigned code
>points.
>
>There should also be a function that removes "scripts" (as defined by
>Unicode).
>
>We should have an easy way to remove private use code points (unless
>you're running a Star Trek fan site and really do need Klingon).

These all seem like good ideas. I think you can do at least some of it with 
regular expressions, but dedicated functions have potential to be both easier 
to use and more efficient.


>And the default replacement character for `mb_scrub` shouldn't be `?`.

This is trickier, and where mixing the terms "multibyte" and "Unicode" actually 
matters. The mbstring extension supports a number of different text encodings, 
most of which don't have a dedicated replacement character to use. It also has 
the ability to set the default in global state with mb_substitute_character() 
so it's not immediately obvious how a different default could be applied based 
on the specified encoding. (I'm not a fan of that API design, but it's what 
we've got!)


>Each of these and other ideas could be part of an RFC, or we could
>brainstorm a Unicode built-in class that handles lots of the common use
>cases.

I don't think a single class that tries to "do Unicode" makes sense; it would 
be like having a "maths class" that contains methods for anything dealing with 
numbers.

In fact, I think the group of functions you're suggesting are a great 
illustration of what I was saying in my last message to Rob: they make perfect 
sense as standalone features, and don't need any grand plan to "have Unicode in 
core" before we proceed with them.

Regards,
Rowan Tommins
[IMSoP]

Reply via email to