bug#33044: Guile misbehaves in the "ja_JP.sjis" locale

Tom de Vries Tue, 16 Oct 2018 16:28:24 -0700

On 10/16/18 3:57 AM, Mark H Weaver wrote:
> retitle 33044 Guile misbehaves in the "ja_JP.sjis" locale
> thanks
> 
> Hi Tom,
> 
> Thanks for the report, analysis and patch.  I agree with your analysis,
> and the patch looks good.
>


If so, can the patch be committed?

I'm running into this problem in the context of gdb, which fails like this:
...
$ LC_CTYPE=ja_JP.sjis gdb".
Segmentation fault (core dumped)
...

So, gdb (which has a dependency on libguile) aborts because of guile
initialization, without gdb actually using the guile functionality, and
the patch fixes this.

> However, there's also a much deeper problem here.  You found and fixed
> one occurrence of Guile assuming that the locale encoding is ASCII-
> compatible.  In fact, this assumption is widespread in Guile, and I
> would guess that it's widespread throughout the POSIX world.
> 
> I admit that before I saw your message, I believed that it was
> legitimate to assume that the locale encoding was ASCII-compatible.  Now
> I'm unsure, although I'll note that according to the 'localedef' utility
> from GNU libc, this locale is "not ISO C compliant".  It printed the
> following message when I asked it to generate the "ja_JP.sjis" locale:
> 
>   [warning] character map `SHIFT_JIS' is not ASCII compatible, locale not ISO 
> C compliant [--no-warnings=ascii]
> 
> Shift_JIS is _mostly_ ASCII-compatible, except that code points 0x5C and
> 0x7E, which represent backslash (\) and tilde (~) in ASCII, are mapped
> to the Yen sign (¥) and overline (‾) in Shift_JIS.  Backslash (\) and
> tilde (~) are multibyte characters in Shift_JIS.
> 
> One common problem is that Guile often uses 'scm_from_locale_string' to
> create Scheme strings from ASCII-only C string literals.  These should
> all be changed to use either 'scm_from_latin1_string' or
> 'scm_from_utf8_string'.  I prefer the latter because modern C compilers
> typically use UTF-8 as the default execution character set, i.e. the
> character set used to encode string and character constants, regardless
> of the locale settings.  GCC uses UTF-8 by default unless
> -fexec-charset=CHARSET is given at compile time.  I'd prefer to promote
> writing code that works for arbitrary string literals, so that code
> needn't be adjusted if non-ASCII characters are later added.
> 
> A related set of problems is that Guile often applies
> 'scm_from_locale_string' to char* arguments passed in from the user, or
> produced by third-party libraries.  These issues are more difficult to
> address.  We provide several C APIs that accept C strings without
> specifying what encoding is expected.  If the string ultimately derives
> from a C string constant, we probably want UTF-8, whereas if the string
> came from I/O, or program arguments, then we probably want the locale
> encoding.
> 
> For example, consider 'scm_c_eval_string'.  This has been a public API
> function since 2002, but we did not specify the encoding of its C string
> argument until 2011.  We chose the locale encoding in this case, which I
> think is reasonable, but I also expect that code exists in the wild that
> passes a C string literal to 'scm_c_eval_string'.
> 
> Until now, problems like this have been mostly harmless, since the C
> string literals are typically ASCII-only.  However, if we wish to
> support non-ASCII-compatible encodings such as Shift_JIS, we can no
> longer consider these problems harmless.  For example, programs which
> pass C string literals to 'scm_c_eval_string' will fail when using the
> "ja_JP.sjis" locale, if any tildes or backslashes are present.
> Backslashes are fairly common in Scheme code.
> 
> There's various other code scattered in Guile that assumes ASCII
> characters can searched for, and sometimes replaced with other ASCII
> characters.  For example, several functions in load.c, including
> 'search_path', 'load_thunk_from_path' scan through file names in the
> locale encoding, scanning the bytes looking for particular ASCII codes
> such as '.', '/', and '\'.
> 
> On MingW, 'scm_i_mirror_backslashes' in load.c converts backslashes into
> forward slashes byte-wise, assuming ASCII-compatibility, and this
> transformation is applied to file names in several places.
> 
> While looking into this, I also discovered that Guile's S-expression
> reader, i.e. the 'read' procedure, assumes an ASCII-compatible port
> encoding, despite the fact that it is meant to support arbitrary
> encodings such as UTF-16 and UTF-32.  I just filed a related bug
> <https://bug.gnu.org/33057> to track this probem.
> 
> These are some of the problems that I'm currently aware of.  I expect
> that this bug report will remain open for a while.
> 
> To begin, I've started working on a patch to change many occurrences of
> 'scm_from_locale_string' to 'scm_from_utf8_string', in cases where the C
> string clearly originates from a C string literal.
> 

Thanks for the elaboration here, that's helpful for me.

Thanks,
- Tom

bug#33044: Guile misbehaves in the "ja_JP.sjis" locale

Reply via email to