2025-04-20 17:31:56 -0400, Chet Ramey:
[...]
> This has been fixed since last July, and the fix is in bash-5.3.
[...]

Thanks, though as Greg says, there seems to be a few more
related issues still affecting 5.3. I repost a message sent
privately below now that the discussion has been extended to the
mailing list.

> The bug concerns unicode combining characters introducing
> invalid unicode character sequences that happen to contain the
> delimiter, and was reported privately.
[...]

That sentence doesn't seem to make sense to me.

Per https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G1708

> All combining characters can be applied to any base character
> and can, in principle, be used with any script. As with other
> characters, the allocation of a combining character to one
> block or another identifies only its primary usage; it is not
> intended to define or limit the range of characters to which
> it may be applied. In the Unicode Standard, all sequences of
> character codes are permitted.

While you could still say that applying a combining acute accent
to a control character such as backspace makes little sense, in
any case, "read" does not and should not care about the meaning
of characters once they have been decoded from their raw byte
encoding.

Here, in my understanding, the problem occurs (in 5.0 .. 5.2)
when a record ends in the truncated encoding of a multibyte
character, whether it's a combining character or not, like that
\315 byte in the example which in UTF-8 encoded text is invalid
if not followed by another byte in the range \200 to \277 like
in the encoding of character U+0340 to U+037F (which happens to
include several combining characters but not only such as the
Greek leter Ͱ), but would apply for any byte in the range \300
to \364 at least in UTF-8 and other byte values in locales using
multibyte character encodings other than UTF-8 (like the
BIG5-HKSCS one mentioned below).

[reposted (and corrected) from an email sent privately]:

2025-04-20 16:48:48 +0100, Stephane Chazelas:
> 2025-04-20 07:27:52 -0400, Greg Wooledge:
> [...]
> > I'm not aware of any security implications.
> 
> Well, I'd say it has quite a large potential for causing
> vulnerabilities.
[...]

The 5.3 bug as well I'd say:

$ printf '%b\0' winter spring 'summer\0200apple\0200banana\0200cherry' automn |
   ./bash -c 'while IFS= read -rd "" season; do LC_ALL=C printf "<%q>\n" 
"$season"; done'
<winter>
<spring>
<summer>
<apple>
<banana>
<cherry>
<automn>

Allowing an "attacker" to insert an arbitrary number of records
if they control the contents of one such record.

Delimiter bytes from 0x80 to 0xbf (continuation bytes of UTF-8 encoded
characters) in UTF-8 locales are also a problem:

$ printf '%b\200' winter 'spring\0375' summer automn |
  ./bash -c $'while IFS= read -rd "\200" season; do LC_ALL=C printf "<%q>\n" 
"$season"; done'
<winter>
<$'spring\375\200summer'>
<automn>

(though a lot less likely to be used in pratice and as those
bytes contrary to NUL or NL can occur in the encoding of
multibyte characters, it seems reasonable to expect the user to
use LC_ALL=C).

I can reproduce similar problems in locales using other multibyte encodings:

$ printf '%b\243' winter 'spring\0375' '\0277summer' '\0277' automn |
   LANG=zh_HK.big5hkscs ./bash -c $'while IFS= read -rd "\243" season; do 
LC_ALL=C printf "<%q>\n" "$season"; done'
<winter>
<$'spring\375\243\277summer'>
<$'\277\243automn'>

In any case, when the bugs are fixed, I'd say it would be worth
backporting to 5.0, 5.1 and 5.2 as security patches.

As mentioned at https://mywiki.wooledge.org/BashPitfalls#pf65,
for -d '', it's also a POSIX conformance bug as POSIX states:

> If the -d delim option is specified and delim is the null
> string, the standard input shall contain zero or more bytes
> (which need not form valid characters).

-d was added to POSIX alongsite find -print0 and xargs -r0
especially so as to be able to process the output of find
-print0 safely with IFS= read -rd '' pathname

That text above however is likely incomplete as that leaves it
unclear how backslashes and IFS characters would be identified
in non-text input if -r is not passed or IFS is not empty or
is unset.

-- 
Stephane







Reply via email to