Mike Gran <spk...@yahoo.com> writes: > I could fix the test by testing only characters 0 to 127 in a C locale > if a Latin-1 locale can't be found.
Yes, that'd be nice. > I can also fix the test by using the 'setbinary' function --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> (help setbinary) `setbinary' is a primitive procedure in the (guile) module. -- Scheme Procedure: setbinary Sets the encoding for the current input, output, and error ports to ISO-8859-1. That character encoding allows ports to operate on binary data. It also sets the default encoding for newly created ports to ISO-8859-1. The previous default encoding for new ports is returned --8<---------------cut here---------------end--------------->8--- It seems to do a lot of things, which aren't clear from the name. ;-) What can be done about it? At least it should be renamed, to `set-port-binary-mode!' or similar. Then it'd be nice if that functionality could be split in several functions, some operating on a per-port basis. After all, one can already do: (for-each (lambda (p) (set-port-encoding! p "ISO-8859-1")) (list (current-input-port) (current-output-port) (current-error-port))) So we just lack: ;; encoding for newly created ports (set-default-port-encoding! "ISO-8859-1") With that `setbinary' can be implemented in Scheme. > to force the encodings on stdin and stdout to a default value that > will pass through binary data, instead of calling 'setlocale'. Hmm, I think I'd still prefer `setlocale'. regexec(3) doesn't say anything about the string encoding. Do libc implementations actually expect plain ASCII or Latin-1? Or do they adapt to the current locale's encoding? > I looked in the POSIX spec on Regex for specific advice using 128-255 in > regex in the C locale. I didn't see anything offhand. The spec does > spend a lot of time talking about the interaction between the locale and > regular expressions. I get the impression from the spec that using > regex on 128-255 in the C locale is an unexpected use of regular > expressions. http://www.opengroup.org/onlinepubs/9699919799/functions/regexec.html reads: If, when regexec() is called, the locale is different from when the regular expression was compiled, the result is undefined. It makes me think that, if a process runs with a UTF-8 locale and passes raw UTF-8 bytes to regcomp(3) and regexec(3), it may work. Hmm, the program below, with UTF-8-encoded source, works both with a Latin-1 and a UTF-8 locale:
#include <stdlib.h> #include <regex.h> #include <locale.h> int main (int argc, char *argv[]) { regex_t rx; regmatch_t match; setlocale (LC_ALL, "fr_FR.utf8"); regcomp (&rx, "ça", REG_EXTENDED); return regexec (&rx, "ça va ?", 1, &match, 0) == 0 ? EXIT_SUCCESS : EXIT_FAILURE; }
Do you think it would work to just leave `regexp.test' as it is in 1.8? Thanks, Ludo'.