On Sat, Dec 2, 2023 at 6:13 AM Alex <alexinbeij...@gmail.com> wrote:
>
> Dear Stefan, and Dear Gina,
>
> Thanks for the message. Yes, Stefan has rediscovered an interesting quirk of 
> the mbstring library. I have been aware of this for a long time, and other 
> mbstring developers have too. It dates back to the origin of the library; 
> actually, even before the origin of mbstring, since mbstring was based on 
> another library called libmbfl, and this behavior originates from libmbfl.
>
> Pull your chair up around the fire and let me tell you the tale of libmbfl. 
> Once upon a time, there was a text-processing library called libmbfl. libmbfl 
> was based on a collection of text-decoding routines (which converted bytes to 
> codepoints) and text-encoding routines (which converted codepoints to bytes). 
> Each such routine was structured as a stateful "filter". These filters could 
> be assembled into "chains", whereby the output values generated by one 
> routine would automatically be passed to the next. libmbfl could perform many 
> wonderful text-processing tasks by substituting a different final filter at 
> the end of the chain.
>
> But all was not well. Since libmbfl's filters processed text only one byte or 
> codepoint at a time, and each routine had to save its state before returning, 
> and restore its state upon entry, libmbfl was slow. Slow as a turtle, slow as 
> a snail, slow as whatever-slowly-moving-thing-you-can-think-of. Oh, what was 
> libmbfl to do? A clever plan was hatched: give libmbfl a 256-entry table 
> called a "mblen_table" for each supported text encoding with the property 
> that the byte length of a character can be determined from its first byte. 
> Then, text-processing tasks which were not dependent on the actual content of 
> a string, but only on the number of codepoints, could be performed without 
> ever invoking those wonderful, but painfully slow filters! libmbfl could skip 
> through a string while just examining the first byte of each character. (Of 
> course, this only worked for text encodings with an mblen_table.) For valid 
> strings, the new method worked identically to the previous one. For invalid 
> strings, there were significant differences in behavior, but libmbfl tried to 
> ignore these and bravely pressed on.
>
> The story ends with an ironic twist. Many years later, I became interested in 
> mbstring and reimplemented its internals, replacing the libmbfl code with 
> fresh new code which ran many times faster. The new code was so much faster 
> that in some cases, the mblen_table optimization actually became a 
> pessimization! In other cases, the mblen_table-based code is still faster, 
> but not by a large amount. But now mbstring was haunted by the spectre of 
> Hyrum's Law (https://www.hyrumslaw.com/). With a huge body of legacy code 
> relying on mbstring, almost any observable behavior change runs a significant 
> risk of breaking someone's code. And when this happens, they will not 
> hesitate to vent their rage on the hapless maintainers.
>
> Notwithstanding the rage of the users, about a year ago, I did remove the 
> mblen_table-based code in one place where benchmarks clearly showed it was 
> acting as a pessimization. I don't remember which mbstring function was 
> affected and would need to check the commit log to confirm.

Hi Alex,

Thank you very much for sharing this background context.

>
> Personally, I think the real issue here is not the inconsistency between 
> mbstring functions which are based on the mblen_tables and those which are 
> not. I think a lot of mbstring operations should not be used on invalid 
> strings at all, and that for such operations, mbstring would do well to throw 
> an exception if it receives invalid input. (Like mb_strpos; how do you define 
> the "position of a UTF-8 substring" when the parent string is not UTF-8 at 
> all?) But that would be a huge BC break.
>

My biggest concern is that this quirk can cause security issues in
user code. I came across this in the first place when discovering an
exploitable security vulnerability in an application. From my point of
view, this is not only about inconsistent behavior but also violates
the documentation for specific functions like mb_strstr. I agree that
a lot of mbstring operations should not be used on invalid strings,
and an exception seems to be an appropriate answer despite the huge BC
impact.

Best Regards,
Stefan Schiller

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Reply via email to