Re: Unexpected (?) segfault after unset LANG

Chet Ramey Sun, 11 Feb 2018 15:29:39 -0800

On 2/9/18 11:53 AM, mike b wrote:
> Indeed, with build from devel it doesn't segfault anymore. Just out of pure
> curiosity, which commit introduced a fix for that? Aaand, there's one more
> thing that puzzles me a bit:
> 
> # echo $BASH_VERSION
> 5.0.12(1)-alpha
> # echo ${LANG:-bleh}
> bleh


LANG has the lowest priority of the various locale environment variables.

Without LANG set, the locale is either determined by the LC_ variables,
which may or may not be set, or the system's default locale at program
start (probably "C").

> # LANG=UTF-8

If you don't have a "UTF-8" locale on your system (Mac OS X happens to, but
Linux does not), this fails and leaves the locale unchanged from whatever
the default happens to be. Strictly speaking, that's not a valid locale
specification; it's an encoding (codeset).

> # printf '%s\n' $'\u013b'

Bash takes `013b', converts it to a numeric unicode value (315), and tries
to convert it to a character. Since your system probably defines
__STDC_ISO_10646__, as most Linux systems seem to, that value can be
directly used as a wchar_t and converted to a multibyte character using
wctomb().

> Ļ

wctomb() returns the multibyte character sequence you see here.

> # unset LANG
> # : # \o/ no segfaults

Assuming the absence of LC_ALL or any other LC_ variables, this explicitly
sets the locale to the system default (""). Bash does a little more work
that it probably needs to, and explicitly sets all the different parts of
the locale (LC_CTYPE, LC_MESSAGES, etc.) to "" ("C").

> # printf '%s\n' $'\u013b'
> \u013B

The same path through wctomb, but this time wctomb return -1/EILSEQ
(illegal byte sequence), and bash attepts to convert it using iconv().
That fails, so bash chooses to handle the multiple errors by returning
a C99-style escape sequence. That's why the lower-case `b' gets
converted to `B'.

> # LANG=UTF-8

Since this isn't a valid locale, nothing changes.

> # printf '%s\n' $'\u013b'
> \u013B # why it returns just the code?

wctomb() returns -1/EILSEQ.

> When LANG is set to UTF-8, printf returns actual character which coresponds
> to given code after first call, however, after LANG is toggled, printf
> keeps returning just the code. I guess my question here is: why that
> happens? I mean, I would expect it to decode it whenever LANG is set back
> to UTF-8 in this case. Am I missing something here?

The fact that setting LANG=UTF-8 is actually a no-op. I think the real
difference is between whatever the default value for LC_CTYPE is at program
startup, and bash setting it to "" (system default: "C" or "POSIX") when
LANG is unset.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    c...@case.edu    http://tiswww.cwru.edu/~chet/

Re: Unexpected (?) segfault after unset LANG

Reply via email to