On 10/16/18 3:57 AM, Mark H Weaver wrote: > retitle 33044 Guile misbehaves in the "ja_JP.sjis" locale > thanks > > Hi Tom, > > Thanks for the report, analysis and patch. I agree with your analysis, > and the patch looks good. >
If so, can the patch be committed? I'm running into this problem in the context of gdb, which fails like this: ... $ LC_CTYPE=ja_JP.sjis gdb". Segmentation fault (core dumped) ... So, gdb (which has a dependency on libguile) aborts because of guile initialization, without gdb actually using the guile functionality, and the patch fixes this. > However, there's also a much deeper problem here. You found and fixed > one occurrence of Guile assuming that the locale encoding is ASCII- > compatible. In fact, this assumption is widespread in Guile, and I > would guess that it's widespread throughout the POSIX world. > > I admit that before I saw your message, I believed that it was > legitimate to assume that the locale encoding was ASCII-compatible. Now > I'm unsure, although I'll note that according to the 'localedef' utility > from GNU libc, this locale is "not ISO C compliant". It printed the > following message when I asked it to generate the "ja_JP.sjis" locale: > > [warning] character map `SHIFT_JIS' is not ASCII compatible, locale not ISO > C compliant [--no-warnings=ascii] > > Shift_JIS is _mostly_ ASCII-compatible, except that code points 0x5C and > 0x7E, which represent backslash (\) and tilde (~) in ASCII, are mapped > to the Yen sign (¥) and overline (‾) in Shift_JIS. Backslash (\) and > tilde (~) are multibyte characters in Shift_JIS. > > One common problem is that Guile often uses 'scm_from_locale_string' to > create Scheme strings from ASCII-only C string literals. These should > all be changed to use either 'scm_from_latin1_string' or > 'scm_from_utf8_string'. I prefer the latter because modern C compilers > typically use UTF-8 as the default execution character set, i.e. the > character set used to encode string and character constants, regardless > of the locale settings. GCC uses UTF-8 by default unless > -fexec-charset=CHARSET is given at compile time. I'd prefer to promote > writing code that works for arbitrary string literals, so that code > needn't be adjusted if non-ASCII characters are later added. > > A related set of problems is that Guile often applies > 'scm_from_locale_string' to char* arguments passed in from the user, or > produced by third-party libraries. These issues are more difficult to > address. We provide several C APIs that accept C strings without > specifying what encoding is expected. If the string ultimately derives > from a C string constant, we probably want UTF-8, whereas if the string > came from I/O, or program arguments, then we probably want the locale > encoding. > > For example, consider 'scm_c_eval_string'. This has been a public API > function since 2002, but we did not specify the encoding of its C string > argument until 2011. We chose the locale encoding in this case, which I > think is reasonable, but I also expect that code exists in the wild that > passes a C string literal to 'scm_c_eval_string'. > > Until now, problems like this have been mostly harmless, since the C > string literals are typically ASCII-only. However, if we wish to > support non-ASCII-compatible encodings such as Shift_JIS, we can no > longer consider these problems harmless. For example, programs which > pass C string literals to 'scm_c_eval_string' will fail when using the > "ja_JP.sjis" locale, if any tildes or backslashes are present. > Backslashes are fairly common in Scheme code. > > There's various other code scattered in Guile that assumes ASCII > characters can searched for, and sometimes replaced with other ASCII > characters. For example, several functions in load.c, including > 'search_path', 'load_thunk_from_path' scan through file names in the > locale encoding, scanning the bytes looking for particular ASCII codes > such as '.', '/', and '\'. > > On MingW, 'scm_i_mirror_backslashes' in load.c converts backslashes into > forward slashes byte-wise, assuming ASCII-compatibility, and this > transformation is applied to file names in several places. > > While looking into this, I also discovered that Guile's S-expression > reader, i.e. the 'read' procedure, assumes an ASCII-compatible port > encoding, despite the fact that it is meant to support arbitrary > encodings such as UTF-16 and UTF-32. I just filed a related bug > <https://bug.gnu.org/33057> to track this probem. > > These are some of the problems that I'm currently aware of. I expect > that this bug report will remain open for a while. > > To begin, I've started working on a patch to change many occurrences of > 'scm_from_locale_string' to 'scm_from_utf8_string', in cases where the C > string clearly originates from a C string literal. > Thanks for the elaboration here, that's helpful for me. Thanks, - Tom