On Mon, Dec 4, 2023 at 1:51 PM Stefan Schiller via internals <internals@lists.php.net> wrote: > > On Sat, Dec 2, 2023 at 6:13 AM Alex <alexinbeij...@gmail.com> wrote: > > > > Dear Stefan, and Dear Gina, > > > > Thanks for the message. Yes, Stefan has rediscovered an interesting quirk > > of the mbstring library. I have been aware of this for a long time, and > > other mbstring developers have too. It dates back to the origin of the > > library; actually, even before the origin of mbstring, since mbstring was > > based on another library called libmbfl, and this behavior originates from > > libmbfl. > > > > Pull your chair up around the fire and let me tell you the tale of libmbfl. > > Once upon a time, there was a text-processing library called libmbfl. > > libmbfl was based on a collection of text-decoding routines (which > > converted bytes to codepoints) and text-encoding routines (which converted > > codepoints to bytes). Each such routine was structured as a stateful > > "filter". These filters could be assembled into "chains", whereby the > > output values generated by one routine would automatically be passed to the > > next. libmbfl could perform many wonderful text-processing tasks by > > substituting a different final filter at the end of the chain. > > > > But all was not well. Since libmbfl's filters processed text only one byte > > or codepoint at a time, and each routine had to save its state before > > returning, and restore its state upon entry, libmbfl was slow. Slow as a > > turtle, slow as a snail, slow as > > whatever-slowly-moving-thing-you-can-think-of. Oh, what was libmbfl to do? > > A clever plan was hatched: give libmbfl a 256-entry table called a > > "mblen_table" for each supported text encoding with the property that the > > byte length of a character can be determined from its first byte. Then, > > text-processing tasks which were not dependent on the actual content of a > > string, but only on the number of codepoints, could be performed without > > ever invoking those wonderful, but painfully slow filters! libmbfl could > > skip through a string while just examining the first byte of each > > character. (Of course, this only worked for text encodings with an > > mblen_table.) For valid strings, the new method worked identically to the > > previous one. For invalid strings, there were significant differences in > > behavior, but libmbfl tried to ignore these and bravely pressed on. > > > > The story ends with an ironic twist. Many years later, I became interested > > in mbstring and reimplemented its internals, replacing the libmbfl code > > with fresh new code which ran many times faster. The new code was so much > > faster that in some cases, the mblen_table optimization actually became a > > pessimization! In other cases, the mblen_table-based code is still faster, > > but not by a large amount. But now mbstring was haunted by the spectre of > > Hyrum's Law (https://www.hyrumslaw.com/). With a huge body of legacy code > > relying on mbstring, almost any observable behavior change runs a > > significant risk of breaking someone's code. And when this happens, they > > will not hesitate to vent their rage on the hapless maintainers. > > > > Notwithstanding the rage of the users, about a year ago, I did remove the > > mblen_table-based code in one place where benchmarks clearly showed it was > > acting as a pessimization. I don't remember which mbstring function was > > affected and would need to check the commit log to confirm. > > Hi Alex, > > Thank you very much for sharing this background context. > > > > > Personally, I think the real issue here is not the inconsistency between > > mbstring functions which are based on the mblen_tables and those which are > > not. I think a lot of mbstring operations should not be used on invalid > > strings at all, and that for such operations, mbstring would do well to > > throw an exception if it receives invalid input. (Like mb_strpos; how do > > you define the "position of a UTF-8 substring" when the parent string is > > not UTF-8 at all?) But that would be a huge BC break. > > > > My biggest concern is that this quirk can cause security issues in > user code. I came across this in the first place when discovering an > exploitable security vulnerability in an application. From my point of > view, this is not only about inconsistent behavior but also violates > the documentation for specific functions like mb_strstr. I agree that > a lot of mbstring operations should not be used on invalid strings, > and an exception seems to be an appropriate answer despite the huge BC > impact.
I think it is only a security issue when people accidentally think mb_* functions should be used if it is available. I've seen people do mb_strlen() on binary data, for example, not realizing the differences between mb_strlen and strlen. Or using mb_* functions and then passing them off to cryptographic functions. Robert Landers Software Engineer Utrecht NL -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php