> There's a lot of pitfalls here, and I don't think the documentation > clearly calls out which functions are OK to use with UTF-8 and which > ones may cause unexpected surprises. > > The compatibility between ASCII and UTF-8 for Latin characters is both > a curse and a blessing. An application may work fine in testing, but > then break when a user submits an emoji. > > [snip] > > (1) All string functions should state in the official man page if they > are safe for UTF-8 or not.
https://github.com/php/doc-en where our official documentation source. Open source, and often towards the end of the year before the PHP major version release, the team and contributors spend a tremendous amount of work to update the documentation to match the latest new features, deprecations, etc. Always welcome for contributions, including the ones that warn about certain functions not being multi-byte safe. > > > (2) Functions intended for working with text should be made UTF-8 safe. > Generally speaking, all functions that deal with strings are in fact UTF-8 safe because UTF-8 strings are also a sequence of bytes, just like the other strings are. The problems occur only if you try to modify or inspect the text in a way that expects how it should be handled as human readable text. Take the _text_ "å" for example. What is the length of the string? ```php strlen('å'); // 3 mb_strlen('å'); // 2 grapheme_strlen('å'); // 1 ``` The correct length of the string above (`a\xCC\x8A`) is... well, all of them: - `strlen` is useful if you validate the length of a user-input before saving it to a database field with a `varchar` limit, or to avoid exceeding index length. - `mb_strlen` is useful if you want to count how many human code-points are used in that string. The mbstring extension knows from Unicode data shows that "\xCC\x8A" is a single code-point. However, it will only consider upto 4 bytes per character because UTF-8 representation limits it to 4 bytes. - `grapheme_strlen` counts the actual human-perceived characters (grapheme clusters), which is what you should really be using if you are formatting text for a specific length. It's also important to understand and appreciate that a lot of PHP functionality today has been there for a very long time. You can't simply change a critical function like `strpos` this late in a programming language. See the excellent reply Larry made about what happened the time PHP tried to do exactly what you are suggesting. Replacing all `strlen` calls in a code base `mb_strlen` or `graphme_strlen` is not a good idea because they serve a different requirement to `strlen`, and they should only be used intentionally where necessary. The latter functions also have to inspect the strings sequentially because UTF-8 is not fixed-length. This is quite slow and it adds up when you process thousands of strings. > (3) Functions intended for processing binary should be added if > necessary, and should be named something like "binary" or "byte". We are already doing it, just the other way around. See `mb_*` and `grapheme_*` functions: All of them are purposefully built to support those features, and are clearly named as such. The rest of the functions consistently consider all strings as a sequence of bytes. This naming pattern is arguably the correct way, because the majority of functions do not need to care whether the strings they deal with need to be human-perceived characters or not. For example, `base64_encode`/`decode` functions, `file_(get|put)_contents`, `pack`/`unpack`, etc will work with any string regardless of their UTF-8 correctness. Why should those strings need to be UTF-8 formatted in the first place?