On Mon, Dec 4, 2023 at 8:45 PM Alex <alexinbeij...@gmail.com> wrote: > > Stefan, > >> >> My biggest concern is that this quirk can cause security issues in >> user code. I came across this in the first place when discovering an >> exploitable security vulnerability in an application. From my point of >> view, this is not only about inconsistent behavior but also violates >> the documentation for specific functions like mb_strstr. I agree that >> a lot of mbstring operations should not be used on invalid strings, >> and an exception seems to be an appropriate answer despite the huge BC >> impact. > > > 1) I'm sure the vulnerable application is proprietary, but can you at least > tell us which mbstring functions were directly involved in this > vulnerability, give an overview of the mechanism by which the exploit worked > (scrubbed of any specific details about the application), and let us know the > general nature of the resulting compromise? (i.e. denial-of-service, > information disclosure, allowing users to impersonate other users, etc) > > Real-life examples of actual security impact provide far more compelling > grounds for a BC break than hypothetical security impact.
The case I am referring to involves mb_strpos, which is used to determine the index of a specific character, and mb_substr, which is used to extract the substring until the first occurrence of this character. For this specific case, this results in a Cross-Site Scripting (XSS) vulnerability, which would allow an attacker to perform actions on behalf of an authenticated user. The actual code is more complex, but the following lines should be a suitable representation of the issue: $data = $_GET['data']; $pos = mb_strpos($data, "<"); $out = $pos ? mb_substr($data, 0, $pos) : $data; echo $out; The developer's assumption here is that $out should never contain an opening HTML tag, and it is thus safe to echo it back in the response. Despite the fact that e.g. htmlspecialchars should be used to encode the output, this assumption seems reasonable to me. However, due to the inconsistent behavior, an attacker can break this assumption. If $data, e.g., contains the sequence "\xf0AAA<b>", str_pos will consider the index of "<" to be 4. However, mb_substr assumes the full sequence to have a length of 4 (1 x 4-byte Unicode character + 3 x 1-byte Unicode characters). Accordingly, the full sequence, including the "<b>" tag, is reflected in the response. > > 2) Your point about documentation is appreciated. However, you should > consider the context. We are dealing with a 20+-year old library, with a > large installed base, which has (historically) always had inconsistent, > poorly defined semantics in edge cases, and which (historically) has always > had poor documentation. Don't forget Hyrum's Law, Stefan. Never forget > Hyrum's Law. > > In view of all this, we are usually much more likely to amend the mbstring > documentation to match behavior than to modify behavior to match > documentation. Now, that's not to say we can never modify mbstring behavior > to match documentation (I have done it myself several times), but there > should always be good reasons to prefer the new behavior (not just "because > that's what the documentation says"). I can fully understand your concerns, especially taking into account the long-standing history of the library and Hyrum's Law. While the above-mentioned example might even be arguable since two different functions are involved, the behavior of, e.g., mb_strstr just feels wrong to me: $data = "\xf0start"; var_dump(mb_strstr($data, "start")); // string(2) "rt" > > 3) Modifying mbstring behavior to make the handling of invalid strings more > consistent is a smaller BC break than raising an exception. However, > completely refusing to operate on invalid strings (by raising an exception) > will be more effective in guarding against potential security vulnerabilities. Considering BC, completely refusing to operate on invalid strings seems like a huge break. From a preventing-security-issues-in-user-code point of view, either of these solutions seems appropriate. > > 4) Hamada-san is absolutely correct that all mbstring users should check > strings derived from user input using mb_check_encoding() before operating on > them. Unfortunately, "should" doesn't count for much. Most users will not do > what they "should". > > 5) Amending the documentation to call out these dangers more clearly would be > a positive step, though probably not enough. > In general, I like the idea of warning against pitfalls like this in the documentation, although I estimate the effect, especially on existing applications to be low. Best Regards, Stefan Schiller -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php