On Mon, Dec 4, 2023 at 1:51 PM Stefan Schiller via internals
<internals@lists.php.net> wrote:
>
> On Sat, Dec 2, 2023 at 6:13 AM Alex <alexinbeij...@gmail.com> wrote:
> >
> > Dear Stefan, and Dear Gina,
> >
> > Thanks for the message. Yes, Stefan has rediscovered an interesting quirk 
> > of the mbstring library. I have been aware of this for a long time, and 
> > other mbstring developers have too. It dates back to the origin of the 
> > library; actually, even before the origin of mbstring, since mbstring was 
> > based on another library called libmbfl, and this behavior originates from 
> > libmbfl.
> >
> > Pull your chair up around the fire and let me tell you the tale of libmbfl. 
> > Once upon a time, there was a text-processing library called libmbfl. 
> > libmbfl was based on a collection of text-decoding routines (which 
> > converted bytes to codepoints) and text-encoding routines (which converted 
> > codepoints to bytes). Each such routine was structured as a stateful 
> > "filter". These filters could be assembled into "chains", whereby the 
> > output values generated by one routine would automatically be passed to the 
> > next. libmbfl could perform many wonderful text-processing tasks by 
> > substituting a different final filter at the end of the chain.
> >
> > But all was not well. Since libmbfl's filters processed text only one byte 
> > or codepoint at a time, and each routine had to save its state before 
> > returning, and restore its state upon entry, libmbfl was slow. Slow as a 
> > turtle, slow as a snail, slow as 
> > whatever-slowly-moving-thing-you-can-think-of. Oh, what was libmbfl to do? 
> > A clever plan was hatched: give libmbfl a 256-entry table called a 
> > "mblen_table" for each supported text encoding with the property that the 
> > byte length of a character can be determined from its first byte. Then, 
> > text-processing tasks which were not dependent on the actual content of a 
> > string, but only on the number of codepoints, could be performed without 
> > ever invoking those wonderful, but painfully slow filters! libmbfl could 
> > skip through a string while just examining the first byte of each 
> > character. (Of course, this only worked for text encodings with an 
> > mblen_table.) For valid strings, the new method worked identically to the 
> > previous one. For invalid strings, there were significant differences in 
> > behavior, but libmbfl tried to ignore these and bravely pressed on.
> >
> > The story ends with an ironic twist. Many years later, I became interested 
> > in mbstring and reimplemented its internals, replacing the libmbfl code 
> > with fresh new code which ran many times faster. The new code was so much 
> > faster that in some cases, the mblen_table optimization actually became a 
> > pessimization! In other cases, the mblen_table-based code is still faster, 
> > but not by a large amount. But now mbstring was haunted by the spectre of 
> > Hyrum's Law (https://www.hyrumslaw.com/). With a huge body of legacy code 
> > relying on mbstring, almost any observable behavior change runs a 
> > significant risk of breaking someone's code. And when this happens, they 
> > will not hesitate to vent their rage on the hapless maintainers.
> >
> > Notwithstanding the rage of the users, about a year ago, I did remove the 
> > mblen_table-based code in one place where benchmarks clearly showed it was 
> > acting as a pessimization. I don't remember which mbstring function was 
> > affected and would need to check the commit log to confirm.
>
> Hi Alex,
>
> Thank you very much for sharing this background context.
>
> >
> > Personally, I think the real issue here is not the inconsistency between 
> > mbstring functions which are based on the mblen_tables and those which are 
> > not. I think a lot of mbstring operations should not be used on invalid 
> > strings at all, and that for such operations, mbstring would do well to 
> > throw an exception if it receives invalid input. (Like mb_strpos; how do 
> > you define the "position of a UTF-8 substring" when the parent string is 
> > not UTF-8 at all?) But that would be a huge BC break.
> >
>
> My biggest concern is that this quirk can cause security issues in
> user code. I came across this in the first place when discovering an
> exploitable security vulnerability in an application. From my point of
> view, this is not only about inconsistent behavior but also violates
> the documentation for specific functions like mb_strstr. I agree that
> a lot of mbstring operations should not be used on invalid strings,
> and an exception seems to be an appropriate answer despite the huge BC
> impact.

I think it is only a security issue when people accidentally think
mb_* functions should be used if it is available. I've seen people do
mb_strlen() on binary data, for example, not realizing the differences
between mb_strlen and strlen. Or using mb_* functions and then passing
them off to cryptographic functions.

Robert Landers
Software Engineer
Utrecht NL

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to