UTF-8 and new ports

2008-02-14 Thread Mike Gran
Hi-

Suppose I'm creating a new Guile port type that is
going to use NCurses primitives for input (scm_getc)
and output (display).  NCurses can both receive input
and display output of wide characters, but, these
functions operate on 32-bit wide unicode codepoints,
aka UTF32.

It seems that port types are inherently 8-bit, right? 
So to make this work, the ports will have to store and
transmit characters as UTF-8 encoded data.  The
'fill_input' function will have to convert UTF-32 to
UTF-8 and then cache them, passing them 1 byte at a
time as requested.  The 'write' function will receive
data 1 byte at a time and buffer it.  It will only
write the character when a complete UTF-32 codepoint
has been received.

Sound right?

Has anyone already done this sort of thing?

--
Mike Gran




Re: UTF-8 and new ports

2008-02-14 Thread Stephen Compall
Mike Gran <[EMAIL PROTECTED]> writes:
> It seems that port types are inherently 8-bit, right? 
> So to make this work, the ports will have to store and
> transmit characters as UTF-8 encoded data.  The
> 'fill_input' function will have to convert UTF-32 to
> UTF-8 and then cache them, passing them 1 byte at a
> time as requested.  The 'write' function will receive
> data 1 byte at a time and buffer it.  It will only
> write the character when a complete UTF-32 codepoint
> has been received.

Alternatively, you could assume an 8-bit character set (either from
CTYPE, or force Latin-1), recode output to UTF-32, and either ignore
or deliver nulls or something else convenient (maybe space?) for
characters outside the 8-bit character set.  This would be reasonable as
Guile characters are 8-bit anyway.

-- 
But you know how reluctant paranormal phenomena are to reveal
themselves when skeptics are present. --Robert Sheaffer, SkI 9/2003