Greetings,

While debugging[0] an issue with Bobot++ (poor sneek!) aborting after
calling scm_regexp_exec on any utf-8 strings I eventually realized
that... the string was actually single-byte encoded internally. After
taking that down the wrong path I eventually tested `regexp-exec' with a
*valid* latin-1 string and that too aborted in `fixup_multibyte_match'.

I have attached a patch that I think is correct. Instead of
unconditionally calling `fixup_multibyte_match' when wchar_t is
available it instead checks if the scheme string being matched is
actually a multibyte string. This permits applications that provide no
string encoding and non-ascii strings to be matched.

If you call `setlocale' with any locale things sort of work. In the case
of "C" non-ascii characters are escaped upon read, and in the case of
"latin1" `mbrlen' will not reject the char code (AFAICT, I'm not an
expert in this area).

Unfortunately this means I don't see an easy way to write a test for the
suite--it only happens in the case where the locale is "C" and no port
encoder is set. <http://paste.lisp.org/display/120245#5> is what I was
going for and will show the bug if run by hand.

I'm not entirely certain this is the *correct* solution, but I think it
should be--it seems bad to abort() applications that uses regexeps but
haven't set their locale yet!

(My papers for Guile are on file AFAIK FWIW)

[0] http://paste.lisp.org/display/120245

From 61900d7e93780dd9d7d6db02fe3ad07a72a8a45b Mon Sep 17 00:00:00 2001
From: Clinton Ebadi <clin...@unknownlamer.org>
Date: Sat, 5 Mar 2011 23:44:23 -0500
Subject: [PATCH] 2011-03-05  Clinton Ebadi  <clin...@unknownlamer.org>

	* libguile/regex-posix.c (scm_regexp_exec): Only fixup byte to
	character offset when the string is actually multibyte encoded.
---
 libguile/regex-posix.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/libguile/regex-posix.c b/libguile/regex-posix.c
index 3423099..db76e36 100644
--- a/libguile/regex-posix.c
+++ b/libguile/regex-posix.c
@@ -305,7 +305,7 @@ SCM_DEFINE (scm_regexp_exec, "regexp-exec", 2, 2, 0,
 		    scm_to_int (flags));
 
 #ifdef HAVE_WCHAR_H
-  if (!status)
+  if ((!status) && (scm_to_int (scm_string_bytes_per_char (substr)) > 1))
     fixup_multibyte_match (matches, nmatches, c_str);
 #endif
 
-- 
1.6.6.1

-- 
Jessie: but today i was a nerd
Jessie: i even read slashdot.

Attachment: pgpqUHuTjg3LK.pgp
Description: PGP signature

Reply via email to