On 16 August 2024 23:11:18 BST, Nick Lockheart <li...@ageofdream.com> wrote:
>I used the rather broad title "Should All String Functions Become
>Multi-Byte Safe" because there are many smaller related topics, but my
>intention was to discuss multi-byte in general
I think it was probably not the best choice, because it seems like what you're
specifically interested in is mostly not about existing functions, and not
particularly about encodings being more than one byte wide.
For instance, even good old 7-bit-per-character ASCII contains control
characters you might want help sanitising out; and plenty of
8-bit-per-character encodings include more than one script, even more than one
writing direction (e.g. ISO 8859-8 Latin/Hebrew).
But, the specific topic of safe input handling is definitely an interesting
one. And focussing on Unicode, rather than every possible encoding (multibyte
or not) makes sense in modern usage.
>There's a lot of potential pitfalls for dealing with Unicode input, and
>there are some best practices per the Unicode Consortium
It's worth looking into whether the ICU library has explicit functions to help
with those recommendations (if you can navigate its slightly patchy
documentation). Since most of ext/intl is just a thin wrapper on that library,
that could make our lives a lot easier.
>For example, there should be a function that removes unassigned code
>points.
>
>There should also be a function that removes "scripts" (as defined by
>Unicode).
>
>We should have an easy way to remove private use code points (unless
>you're running a Star Trek fan site and really do need Klingon).
These all seem like good ideas. I think you can do at least some of it with
regular expressions, but dedicated functions have potential to be both easier
to use and more efficient.
>And the default replacement character for `mb_scrub` shouldn't be `?`.
This is trickier, and where mixing the terms "multibyte" and "Unicode" actually
matters. The mbstring extension supports a number of different text encodings,
most of which don't have a dedicated replacement character to use. It also has
the ability to set the default in global state with mb_substitute_character()
so it's not immediately obvious how a different default could be applied based
on the specified encoding. (I'm not a fan of that API design, but it's what
we've got!)
>Each of these and other ideas could be part of an RFC, or we could
>brainstorm a Unicode built-in class that handles lots of the common use
>cases.
I don't think a single class that tries to "do Unicode" makes sense; it would
be like having a "maths class" that contains methods for anything dealing with
numbers.
In fact, I think the group of functions you're suggesting are a great
illustration of what I was saying in my last message to Rob: they make perfect
sense as standalone features, and don't need any grand plan to "have Unicode in
core" before we proceed with them.
Regards,
Rowan Tommins
[IMSoP]