2025-04-20 17:31:56 -0400, Chet Ramey: [...] > This has been fixed since last July, and the fix is in bash-5.3. [...]
Thanks, though as Greg says, there seems to be a few more related issues still affecting 5.3. I repost a message sent privately below now that the discussion has been extended to the mailing list. > The bug concerns unicode combining characters introducing > invalid unicode character sequences that happen to contain the > delimiter, and was reported privately. [...] That sentence doesn't seem to make sense to me. Per https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G1708 > All combining characters can be applied to any base character > and can, in principle, be used with any script. As with other > characters, the allocation of a combining character to one > block or another identifies only its primary usage; it is not > intended to define or limit the range of characters to which > it may be applied. In the Unicode Standard, all sequences of > character codes are permitted. While you could still say that applying a combining acute accent to a control character such as backspace makes little sense, in any case, "read" does not and should not care about the meaning of characters once they have been decoded from their raw byte encoding. Here, in my understanding, the problem occurs (in 5.0 .. 5.2) when a record ends in the truncated encoding of a multibyte character, whether it's a combining character or not, like that \315 byte in the example which in UTF-8 encoded text is invalid if not followed by another byte in the range \200 to \277 like in the encoding of character U+0340 to U+037F (which happens to include several combining characters but not only such as the Greek leter Ͱ), but would apply for any byte in the range \300 to \364 at least in UTF-8 and other byte values in locales using multibyte character encodings other than UTF-8 (like the BIG5-HKSCS one mentioned below). [reposted (and corrected) from an email sent privately]: 2025-04-20 16:48:48 +0100, Stephane Chazelas: > 2025-04-20 07:27:52 -0400, Greg Wooledge: > [...] > > I'm not aware of any security implications. > > Well, I'd say it has quite a large potential for causing > vulnerabilities. [...] The 5.3 bug as well I'd say: $ printf '%b\0' winter spring 'summer\0200apple\0200banana\0200cherry' automn | ./bash -c 'while IFS= read -rd "" season; do LC_ALL=C printf "<%q>\n" "$season"; done' <winter> <spring> <summer> <apple> <banana> <cherry> <automn> Allowing an "attacker" to insert an arbitrary number of records if they control the contents of one such record. Delimiter bytes from 0x80 to 0xbf (continuation bytes of UTF-8 encoded characters) in UTF-8 locales are also a problem: $ printf '%b\200' winter 'spring\0375' summer automn | ./bash -c $'while IFS= read -rd "\200" season; do LC_ALL=C printf "<%q>\n" "$season"; done' <winter> <$'spring\375\200summer'> <automn> (though a lot less likely to be used in pratice and as those bytes contrary to NUL or NL can occur in the encoding of multibyte characters, it seems reasonable to expect the user to use LC_ALL=C). I can reproduce similar problems in locales using other multibyte encodings: $ printf '%b\243' winter 'spring\0375' '\0277summer' '\0277' automn | LANG=zh_HK.big5hkscs ./bash -c $'while IFS= read -rd "\243" season; do LC_ALL=C printf "<%q>\n" "$season"; done' <winter> <$'spring\375\243\277summer'> <$'\277\243automn'> In any case, when the bugs are fixed, I'd say it would be worth backporting to 5.0, 5.1 and 5.2 as security patches. As mentioned at https://mywiki.wooledge.org/BashPitfalls#pf65, for -d '', it's also a POSIX conformance bug as POSIX states: > If the -d delim option is specified and delim is the null > string, the standard input shall contain zero or more bytes > (which need not form valid characters). -d was added to POSIX alongsite find -print0 and xargs -r0 especially so as to be able to process the output of find -print0 safely with IFS= read -rd '' pathname That text above however is likely incomplete as that leaves it unclear how backslashes and IFS characters would be identified in non-text input if -r is not passed or IFS is not empty or is unset. -- Stephane