On 2/9/18 11:53 AM, mike b wrote: > Indeed, with build from devel it doesn't segfault anymore. Just out of pure > curiosity, which commit introduced a fix for that? Aaand, there's one more > thing that puzzles me a bit: > > # echo $BASH_VERSION > 5.0.12(1)-alpha > # echo ${LANG:-bleh} > bleh
LANG has the lowest priority of the various locale environment variables. Without LANG set, the locale is either determined by the LC_ variables, which may or may not be set, or the system's default locale at program start (probably "C"). > # LANG=UTF-8 If you don't have a "UTF-8" locale on your system (Mac OS X happens to, but Linux does not), this fails and leaves the locale unchanged from whatever the default happens to be. Strictly speaking, that's not a valid locale specification; it's an encoding (codeset). > # printf '%s\n' $'\u013b' Bash takes `013b', converts it to a numeric unicode value (315), and tries to convert it to a character. Since your system probably defines __STDC_ISO_10646__, as most Linux systems seem to, that value can be directly used as a wchar_t and converted to a multibyte character using wctomb(). > Ļ wctomb() returns the multibyte character sequence you see here. > # unset LANG > # : # \o/ no segfaults Assuming the absence of LC_ALL or any other LC_ variables, this explicitly sets the locale to the system default (""). Bash does a little more work that it probably needs to, and explicitly sets all the different parts of the locale (LC_CTYPE, LC_MESSAGES, etc.) to "" ("C"). > # printf '%s\n' $'\u013b' > \u013B The same path through wctomb, but this time wctomb return -1/EILSEQ (illegal byte sequence), and bash attepts to convert it using iconv(). That fails, so bash chooses to handle the multiple errors by returning a C99-style escape sequence. That's why the lower-case `b' gets converted to `B'. > # LANG=UTF-8 Since this isn't a valid locale, nothing changes. > # printf '%s\n' $'\u013b' > \u013B # why it returns just the code? wctomb() returns -1/EILSEQ. > When LANG is set to UTF-8, printf returns actual character which coresponds > to given code after first call, however, after LANG is toggled, printf > keeps returning just the code. I guess my question here is: why that > happens? I mean, I would expect it to decode it whenever LANG is set back > to UTF-8 in this case. Am I missing something here? The fact that setting LANG=UTF-8 is actually a no-op. I think the real difference is between whatever the default value for LC_CTYPE is at program startup, and bash setting it to "" (system default: "C" or "POSIX") when LANG is unset. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRU c...@case.edu http://tiswww.cwru.edu/~chet/