Constant folding

2009-09-08 Thread Ludovic Courtès
Hello!

We should implement constant folding in the tree-il->glil pass.  A naive
implementation looks like this:

diff --git a/module/language/tree-il/compile-glil.scm b/module/language/tree-il/compile-glil.scm
index 86b610f..57a46c8 100644
--- a/module/language/tree-il/compile-glil.scm
+++ b/module/language/tree-il/compile-glil.scm
@@ -22,6 +22,7 @@
   #:use-module (system base syntax)
   #:use-module (system base pmatch)
   #:use-module (system base message)
+  #:use-module (srfi srfi-1)
   #:use-module (ice-9 receive)
   #:use-module (language glil)
   #:use-module (system vm instruction)
@@ -394,20 +395,28 @@
 (cons (primitive-ref-name proc) (length args)))
   (hash-ref *primcall-ops* (primitive-ref-name proc
  => (lambda (op)
-  (for-each comp-push args)
-  (emit-code src (make-glil-call op (length args)))
-  (case (instruction-pushes op)
-((0)
- (case context
-   ((tail push vals) (emit-code #f (make-glil-void
- (maybe-emit-return))
-((1)
- (case context
-   ((drop) (emit-code #f (make-glil-call 'drop 1
- (maybe-emit-return))
-(else
- (error "bad primitive op: too many pushes"
-op (instruction-pushes op))
+  (if (every const? args)
+  (let* ((proc (module-ref the-scm-module
+   (primitive-ref-name proc)))
+ (args (map const-exp args)))
+;; constant folding
+(emit-code src
+   (make-glil-const (apply proc args
+  (begin
+(for-each comp-push args)
+(emit-code src (make-glil-call op (length args)))
+(case (instruction-pushes op)
+  ((0)
+   (case context
+ ((tail push vals) (emit-code #f (make-glil-void
+   (maybe-emit-return))
+  ((1)
+   (case context
+ ((drop) (emit-code #f (make-glil-call 'drop 1
+   (maybe-emit-return))
+  (else
+   (error "bad primitive op: too many pushes"
+  op (instruction-pushes op
 
 ;; da capo al fine
 ((and (lexical-ref? proc)

With that we get:

--8<---cut here---start->8---
scheme@(guile-user)> ,c (+ 2 3)
Disassembly of #:

   0(make-int8 5)   ;; 5

--8<---cut here---end--->8---

instead of:

--8<---cut here---start->8---
scheme@(guile-user)> ,c (+ 2 3)
Disassembly of #:

   0(make-int8 2)   ;; 2
   2(make-int8 3)   ;; 3
   4(add)   
   5(return)

--8<---cut here---end--->8---

but:

--8<---cut here---start->8---
scheme@(guile-user)> ,c (+ 2 (+ 2 3) 4)
Disassembly of #:

   0(make-int8 2)   ;; 2
   2(make-int8 5)   ;; 5
   4(make-int8 4)   ;; 4
   6(add)   
   7(add)   
   8(return)

--8<---cut here---end--->8---

Thanks,
Ludo’.


[BDW-GC] Performance comparison for the two GCs

2009-09-08 Thread Ludovic Courtès
Hello!

I finally took the time to re-run the benchmarks used in Hansen's PhD
dissertation [0] and which are under `gc-benchmarks' in the repo.  The
methodology is still the same as before [1].  This correspond to commits
0e0d97c477b160f193b289b4aabfa73bbaf52e9b (boehm-demers-weiser-gc) and
ce3ed0125fcfb9ad09da815f133a2320102d164c (master).

`run-benchmarks' now produces bars on the right, which show whether (and
how much) BDW-GC is better/worse than Guile's current GC: `+' means
"better" and `-' means "worse".  "Better", here, means one of the
following scenarios:

  A. BDW-GC uses less heap and is faster.
  B. BDW-GC uses less heap and is slower but the outcome is positive
 (e.g., it is twice as slow and the heap is 4 times smaller).
  C. BDW-GC uses more heap and is faster but the outcome is positive
 (e.g., it uses twice as much heap and is 4 times faster).

For details on how the bar length is computed, see [2].

Note that heap size and execution time are the only criteria here.
Other criteria, such as pause time [3], may be of interest depending on
the application, but presumably memory and speed are those most people
care about.


1. Large-Heap Benchmarks


benchmark: `./string.scm'
   heap size (MiB)  execution time (s.)
Guile   707.12 (1.00x)  7.247 (1.00x)
BDW-GC, FSD=3   258.84 (0.37x)  3.761 (0.52x) ++
BDW-GC, FSD=6   256.21 (0.36x)  3.101 (0.43x) ++
BDW-GC, FSD=9   311.58 (0.44x)  3.153 (0.44x) 
BDW-GC, FSD=3 incr. 468.72 (0.66x)  3.646 (0.50x) +
BDW-GC, FSD=3 gene. 488.08 (0.69x)  3.526 (0.49x) +

benchmark: `./larceny/dynamic.sch'
   heap size (MiB)  execution time (s.)
Guile87.19 (1.00x) 16.578 (1.00x)
BDW-GC, FSD=3   100.35 (1.15x) 15.422 (0.93x) --
BDW-GC, FSD=682.96 (0.95x) 16.608 (1.00x) +
BDW-GC, FSD=974.90 (0.86x) 18.012 (1.09x) ++
BDW-GC, FSD=3 incr. 108.25 (1.24x) 16.590 (1.00x) 
BDW-GC, FSD=3 gene.  97.34 (1.12x) 15.548 (0.94x) --

benchmark: `./larceny/perm.sch'
   heap size (MiB)  execution time (s.)
Guile28.88 (1.00x)  0.934 (1.00x)
BDW-GC, FSD=327.75 (0.96x)  0.824 (0.88x) +
BDW-GC, FSD=630.74 (1.06x)  1.021 (1.09x) -
BDW-GC, FSD=934.77 (1.20x)  1.151 (1.23x) ---
BDW-GC, FSD=3 incr.  33.55 (1.16x)  0.891 (0.95x) --
BDW-GC, FSD=3 gene.  28.03 (0.97x)  0.949 (1.02x) 

benchmark: `./larceny/graphs.sch'
   heap size (MiB)  execution time (s.)
Guile30.75 (1.00x)425.026 (1.00x)
BDW-GC, FSD=360.08 (1.95x)159.148 (0.37x) +
BDW-GC, FSD=648.46 (1.58x)202.383 (0.48x) 
BDW-GC, FSD=942.31 (1.38x)249.425 (0.59x) ++
BDW-GC, FSD=3 incr. 124.86 (4.06x)165.998 (0.39x) 
BDW-GC, FSD=3 gene.  58.80 (1.91x)156.682 (0.37x) +

benchmark: `./larceny/gcold.scm'
   heap size (MiB)  execution time (s.)
Guile   278.39 (1.00x)103.479 (1.00x)
BDW-GC, FSD=3   108.50 (0.39x)258.024 (2.49x) +
BDW-GC, FSD=689.84 (0.32x)391.099 (3.78x) -
BDW-GC, FSD=976.31 (0.27x)531.270 (5.13x) 
BDW-GC, FSD=3 incr.  91.18 (0.33x)211.965 (2.05x) +++
BDW-GC, FSD=3 gene. 107.51 (0.39x)195.811 (1.89x) 

benchmark: `./gcbench.scm'
   heap size (MiB)  execution time (s.)
Guile52.27 (1.00x) 20.900 (1.00x)
BDW-GC, FSD=350.75 (0.97x) 14.956 (0.72x) +
BDW-GC, FSD=644.32 (0.85x) 14.742 (0.71x) +++
BDW-GC, FSD=945.05 (0.86x) 15.189 (0.73x) ++
BDW-GC, FSD=3 incr.  95.84 (1.83x) 18.585 (0.89x) ---
BDW-GC, FSD=3 gene.  81.64 (1.56x) 17.074 (0.82x) ---


Scenario A (wins on both criteria) is uncommon (gcbench, perm, string).
Scenario C (faster) is the most common, with the exception of gcold.
BDW-GC is often worse in incremental and generational modes.


2. Small-Heap Benchmarks (< 10 MiB)
---

benchmark: `./larceny/lattice.sch'
   heap size (MiB)  execution time (s.)
Guile 3.51 (1.00x)147.189 (1.00x)
BDW-GC, FSD=3 7.05 (2.01x) 88.755 (0.60x) ++
BDW-GC, FSD=6 5.14 (1.47x)100.515 (0.68x) +
BDW-GC, FSD=9 4.64 (1.32x)115.124 (0.78x) +++
BDW-GC, FSD=3 incr.   5.55 (1.58x)103.257 (0.70x) 
BDW-GC, FSD=3 gene.   6.96 (1.99x) 96.383 (0.65x) +

benchmark: `./larceny/nucleic2.sch'
   heap size (MiB)  execution time (s.)
Guile 6.43 (1.00x) 33.663 (1.00x)
BDW-GC, FSD=3 9.04 (1.41x) 23.043 (0.68x) 
B

Re: [Guile-commits] GNU Guile branch, master, updated. release_1-9-2-164-g0d05ae7

2009-09-08 Thread Ludovic Courtès
Hello!

"Michael Gran"  writes:

> http://git.savannah.gnu.org/cgit/guile.git/commit/?id=0d05ae7c4b1eddf6257f99f44eaf5cb7b11191be

[...]

> -  return scm_getc (input_port);
> +  return scm_get_byte_or_eof (input_port);

This is actually an earlier change, but the prototype of scm_getc is now
different from that in 1.8.  Presumably, this means that it’s not
source-compatible with 1.8, e.g., on platforms where
sizeof (int) < sizeof (scm_t_wchar), right?

> --- a/libguile/strings.h
> +++ b/libguile/strings.h
> @@ -111,7 +111,7 @@ SCM_API SCM scm_substring_shared (SCM str, SCM start, SCM 
> end);
>  SCM_API SCM scm_substring_copy (SCM str, SCM start, SCM end);
>  SCM_API SCM scm_string_append (SCM args);
>  
> -SCM_INTERNAL SCM scm_i_from_stringn (const char *str, size_t len, 
> +SCM_API SCM scm_i_from_stringn (const char *str, size_t len, 
>   const char *encoding,
>   scm_t_string_failed_conversion_handler 
>   handler);
> @@ -157,7 +157,7 @@ SCM_INTERNAL const scm_t_wchar *scm_i_string_wide_chars 
> (SCM str);
>  SCM_INTERNAL SCM scm_i_string_start_writing (SCM str);
>  SCM_INTERNAL void scm_i_string_stop_writing (void);
>  SCM_INTERNAL int scm_i_is_narrow_string (SCM str);
> -SCM_INTERNAL scm_t_wchar scm_i_string_ref (SCM str, size_t x);
> +SCM_API scm_t_wchar scm_i_string_ref (SCM str, size_t x);

Were these changes intended?

> +  (with-locale "en_US.iso88591"
> +(pass-if-exception "no args" exception:wrong-num-args
> +  (regexp-quote))

Is the locale part of the API?  That is, should programs that use
regexps explicitly ask for a locale with 8-bit encoding?

Thanks,
Ludo’.




Re: more compilation failures: -DSCM_DEBUG_TYPING_STRICTNESS=2

2009-09-08 Thread Neil Jerram
l...@gnu.org (Ludovic Courtès) writes:

>>> Anyway, in the meantime, we can conditionalize static initialization
>>> stuff from bdw-gc-static-alloc on STRICTNESS == 0 and keep everyone
>>> happy.
>>>
>>> Does that sound reasonable?
>>
>> Sure.  Actually, STRICTNESS=1 is the default -- 0 makes SCM an
>> integer, 1 makes it a pointer to a struct, which adds a little more
>> type safety, and 2 makes it a union, which breaks casting,
>> initialization, etc.
>
> Oh, right.

I've never used the non-default STRICTNESS either, and static
initialization sounds like a nice feature; so I'm also happy with
blowing away STRICTNESS 2.

Then, given that you (Ken) think that STRICTNESS 0 doesn't work
either, I'd favour hardcoding the STRICTNESS 1 macros and then
discarding the whole STRICTNESS concept.

 Neil




Re: [BDW-GC] "Inlined" storage; `scm_take_' functions

2009-09-08 Thread Neil Jerram
l...@gnu.org (Ludovic Courtès) writes:

> Hello!

Hi!

> Stringbufs and bytevectors are now always "inlined" in the BDW-GC
> branch [0, 1], which means that there's no cell->buffer indirection,
> which greatly simplifies code (it also takes less room and may slightly
> improve performance).
>
> The `scm_take_' functions for strings/symbols/bytevectors are now
> essentially aliases to the corresponding `scm_from_' because we cannot
> advantageously reuse the provided storage.

That seems a bit of a shame.  (i.e. that we can't advantageously keep
the caller's string or vector data)

Did you consider the option of

- always having an indirection from the stringbuf/bytevector object to
the underlying data

- optimizing the scm_from_... case by doing a single
  scm_gc_malloc_pointerless (), and making the "underlying data
  pointer" point into the same malloc'd block.

The first point should allow a similar simplification of the code as
you have in your commits - by not having to handle both the inline and
indirected cases everywhere - but the indirection would allow us to
keep meaningful scm_take_... functions.

Neil




make check fails if no en_US.iso88591 locale

2009-09-08 Thread Neil Jerram
make check fails for me in regexp.test:

  ...
  Running regexp.test
  guile: uncaught throw to unresolved: ()

because I don't have an en_US.iso88591 locale installed, and so

  (with-locale "en_US.iso88591" ...)

throws an 'unresolved exception.

I can allow make check to complete by changing that line to

  (false-if-exception (with-locale "en_US.iso88591"

but I doubt that's the best fix.  Is the "en_US.iso88591" locale
actually important for the enclosed tests?

Thanks,
Neil





GNU Guile meeting notes

2009-09-08 Thread Neil Jerram
Andy, Ludo and I actually met in Real Life (tm) a couple of weeks ago,
and spent a day talking through lots of Guile stuff.  It was fun.

With apologies for the delay (which reflects nothing but my lack of
time), below are my notes from that meeting.

 Neil


* 2.0 release milestones

We discussed and agreed the following items as things that we really
want to get done before the 2.0 release.

Bearing in mind all of the following, we also agreed to move the
tentative 2.0 date from 15th October to 15th December.

** GC

i.e. switching to using Boehm-Deimers-Weiser libgc, instead of Guile's
own.

Our plan is to merge this work to master before the next 1.9.x
prerelease.

** Module name spaces

i.e. we'd really like to get rid of the `%app' and
`%module-public-interface' hacks before 2.0.

** Scheme eval

Implementing eval in Scheme, and compiling it, will solve current
problems of bouncing between compiled and interpreted code, and of not
being able to tail-call between them.

** Fix problems with read options and current-reader

The fact that a whole file is read and compiled together - as opposed
to each top level expression being read and then evaluated in turn -
means that these no longer work as they used to.

Plan to fix this by making read-set! etc. a macro.

** GOOPS dispatch in the VM

Look at using polymorphic inline caches here.

Andy adds: "I ran into this issue on wip-eval-cleanup; it
self-compiles until reaching occam-channel, because it's the first
thing that uses GOOPS.

I think this is going to be an early priority for me, along with
cleaning up subr dispatch (so it's easy to get into the VM)."

** Merge most of Andy's array refactoring

Most of this branch has since been merged.  We're still not quite
agreed on the bytevector/SRFI-4 unification; Andy's going to prepare
an updated patch for that.

** ECMAscript test and documentation

We have another working language, so we should show it off!

** Testing the main Guile applications

As much as we can; by encouraging the application developers to test
against the prereleases and/or doing this ourselves.

** Multiple value handling

Largely eliminate the # object.  Allow passing multiple
values to a single value continuation, by using just the first value.

(The compiler does this already; this change is about making the
interpreter do the same thing.)

** Import Guile-Lib

Subject to legal details.


* 2.0 release optional items

We also discussed the following as things that would be nice for 2.0,
but not required

** Function inlining in the compiler

e.g. inlining (map ... ) calls

** Support for using R6RS libraries

** An FFI

** Completing the MOP for generic functions

** ECMAscript - completing the implementation

We don't yet support the whole standard, and it would be nice (or at
least neater) to do so.

** Enlarge the space of SMOB numbers


* Native compilation

Given a compiler that produces VM code, it's a relatively short extra step
to produce native code instead.  Sassy can help with this.


* Debugging and stack layout

A hassle with VM development has been handling different
representations of debugging information (i.e. in the C eval and in
the VM), and compiling backtraces that may include segments of both
kinds.  With a Scheme eval this becomes much easier; and if we design
the stack layout carefully, the same debug info representation can
work for native code too.


* Elisp

Agreed that we need to implement dynwind directly in the VM, so that
dynamic binding can be efficient.

Need to review and merge Mark Weaver's patches.

Andy is going to merge Daniel's work soon; at that time, most of the
old lang/elisp stuff will be removed (to avoid confusion!).


* Really delete environments.[ch]

There's no plan now to use this code, so we will delete it to avoid
confusion.


* Benchmarking

We'd like to have an easy way of storing off performance results over
time; e.g. `make benchmark' running the benchmarks, and then also
storing the results somewhere persistent.




Re: make check fails if no en_US.iso88591 locale

2009-09-08 Thread Mike Gran
> From: Neil Jerram 
> 
> make check fails for me in regexp.test:
> 
>   ...
>   Running regexp.test
>   guile: uncaught throw to unresolved: ()
> 
> because I don't have an en_US.iso88591 locale installed, and so
> 
>   (with-locale "en_US.iso88591" ...)
> 
> throws an 'unresolved exception.
> 

My bad.  Actually, I should have enclosed the 'with-locale' in the
context of a 'pass-if', which would have caught the exception.

> I can allow make check to complete by changing that line to
> 
>   (false-if-exception (with-locale "en_US.iso88591"
> 
> but I doubt that's the best fix.  Is the "en_US.iso88591" locale
> actually important for the enclosed tests?

It is important.  This is one of the problems with the whole Unicode
effort.  There is no Unicode-capable regex library.  The regexp.test
tries matching all bytes from 0 to 255, and it uses scm_to_locale_string
to prep the string for dispatch to the libc regex calls and
scm_from_locale_string to send them back.  

If the current locale is C or ASCII, bytes above 127 will cause errors.
If the current locale is UTF-8, bytes above 127 will be converted into
multibyte sequences that won't be matched by the regular expression
being tested.  To pass the test in regexp.test, we need to use the 
encoding that matches all of the codepoints 0 to 255 to single byte
characters, which is ISO-8859-1.

So until a better regex comes along, wrapping regex in an
8-bit-clean-friendly locale like Latin-1 is necessary to avoid encoding
errors when encoding arbitrary 8-bit data like the test does.

The reason why this problem is cropping up now and didn't occur before
is because the old scm_to_locale_string was just a stub that passed
8-bit data through unmodified.

This regex library actually can be used with arbitrary Unicode data
but it takes extra care.  UTF-8 can be used as the locale, and, then
regular expression must be written keeping in mind that each non-ASCII
character is really a multibyte string.

> 
> Thanks,
>         Neil

Thanks,

Mike




Re: more compilation failures: -DSCM_DEBUG_TYPING_STRICTNESS=2

2009-09-08 Thread Ken Raeburn

On Sep 8, 2009, at 19:37, Neil Jerram wrote:

Then, given that you (Ken) think that STRICTNESS 0 doesn't work
either, I'd favour hardcoding the STRICTNESS 1 macros and then
discarding the whole STRICTNESS concept.


That (0 not working) is only a guess, but I'll try it out to see.

I kind of like keeping STRICTNESS 2, because I'm a bit uncomfortable  
with the amount of casting going on in some places, and the degree to  
which it could (theoretically) be masking actual bugs.  But if opinion  
goes against me I'm willing to work on the changes you describe.


Ken




Re: [Guile-commits] GNU Guile branch, master, updated. release_1-9-2-164-g0d05ae7

2009-09-08 Thread Mike Gran
On Wed, 2009-09-09 at 01:00 +0200, Ludovic Courtès wrote:
> Hello!
> 
> "Michael Gran"  writes:
> 
> > http://git.savannah.gnu.org/cgit/guile.git/commit/?id=0d05ae7c4b1eddf6257f99f44eaf5cb7b11191be
> 
> [...]
> 
> > -  return scm_getc (input_port);
> > +  return scm_get_byte_or_eof (input_port);
> 
> This is actually an earlier change, but the prototype of scm_getc is now
> different from that in 1.8.  Presumably, this means that it’s not
> source-compatible with 1.8, e.g., on platforms where
> sizeof (int) < sizeof (scm_t_wchar), right?

The readline library can't handle UCS-4 codepoints, but, it is capable
of dealing with locale-encoded text.  So, it needs to have the raw bytes
of the locale-encoded characters, and scm_get_byte_or_eof returns the
raw bytes.

> 
> > --- a/libguile/strings.h
> > +++ b/libguile/strings.h
> > @@ -111,7 +111,7 @@ SCM_API SCM scm_substring_shared (SCM str, SCM start, 
> > SCM end);
> >  SCM_API SCM scm_substring_copy (SCM str, SCM start, SCM end);
> >  SCM_API SCM scm_string_append (SCM args);
> >  
> > -SCM_INTERNAL SCM scm_i_from_stringn (const char *str, size_t len, 
> > +SCM_API SCM scm_i_from_stringn (const char *str, size_t len, 
> >   const char *encoding,
> >   
> > scm_t_string_failed_conversion_handler 
> >   handler);
> > @@ -157,7 +157,7 @@ SCM_INTERNAL const scm_t_wchar *scm_i_string_wide_chars 
> > (SCM str);
> >  SCM_INTERNAL SCM scm_i_string_start_writing (SCM str);
> >  SCM_INTERNAL void scm_i_string_stop_writing (void);
> >  SCM_INTERNAL int scm_i_is_narrow_string (SCM str);
> > -SCM_INTERNAL scm_t_wchar scm_i_string_ref (SCM str, size_t x);
> > +SCM_API scm_t_wchar scm_i_string_ref (SCM str, size_t x);
> 
> Were these changes intended?

Well, one of the two of them was intended.  :)

> 
> > +  (with-locale "en_US.iso88591"
> > +(pass-if-exception "no args" exception:wrong-num-args
> > +  (regexp-quote))
> 
> Is the locale part of the API?  That is, should programs that use
> regexps explicitly ask for a locale with 8-bit encoding?

Basically yes.  On Wed, 2009-09-09 at 01:00 +0200, Ludovic Courtès
wrote: 
> Hello!
> 
> "Michael Gran"  writes:
> 
> > http://git.savannah.gnu.org/cgit/guile.git/commit/?id=0d05ae7c4b1eddf6257f99f44eaf5cb7b11191be
> 
> [...]
> 
> > -  return scm_getc (input_port);
> > +  return scm_get_byte_or_eof (input_port);
> 
> This is actually an earlier change, but the prototype of scm_getc is now
> different from that in 1.8.  Presumably, this means that it’s not
> source-compatible with 1.8, e.g., on platforms where
> sizeof (int) < sizeof (scm_t_wchar), right?

The readline library can't handle UCS-4 codepoints, but, it is capable
of dealing with locale-encoded text.  So, it needs to have the raw bytes
of the locale-encoded characters, and scm_get_byte_or_eof returns the
raw bytes instead of doing the processing necessary to make codepoints.

> 
> > --- a/libguile/strings.h
> > +++ b/libguile/strings.h
> > @@ -111,7 +111,7 @@ SCM_API SCM scm_substring_shared (SCM str, SCM start, 
> > SCM end);
> >  SCM_API SCM scm_substring_copy (SCM str, SCM start, SCM end);
> >  SCM_API SCM scm_string_append (SCM args);
> >  
> > -SCM_INTERNAL SCM scm_i_from_stringn (const char *str, size_t len, 
> > +SCM_API SCM scm_i_from_stringn (const char *str, size_t len, 
> >   const char *encoding,
> >   
> > scm_t_string_failed_conversion_handler 
> >   handler);
> > @@ -157,7 +157,7 @@ SCM_INTERNAL const scm_t_wchar *scm_i_string_wide_chars 
> > (SCM str);
> >  SCM_INTERNAL SCM scm_i_string_start_writing (SCM str);
> >  SCM_INTERNAL void scm_i_string_stop_writing (void);
> >  SCM_INTERNAL int scm_i_is_narrow_string (SCM str);
> > -SCM_INTERNAL scm_t_wchar scm_i_string_ref (SCM str, size_t x);
> > +SCM_API scm_t_wchar scm_i_string_ref (SCM str, size_t x);
> 
> Were these changes intended?

Well, one of the two of them was intended.  :)

> 
> > +  (with-locale "en_US.iso88591"
> > +(pass-if-exception "no args" exception:wrong-num-args
> > +  (regexp-quote))
> 
> Is the locale part of the API?  That is, should programs that use
> regexps explicitly ask for a locale with 8-bit encoding?

Basically yes. The libc regex is 8-bit, and it uses
scm_to/from_locale_string to convert regex's input and output.

Until libunistring comes with Unicode regex, I think this is the best we
can do.

Thanks,

Mike