Unicode strings and encodings

Leopold Toetsch Thu, 10 Nov 2005 06:07:08 -0800

I'm currently going through the various string functions and make themusable for all string encdodings we have. It's not finished yet, but alot already works.


We have:


charsets:   binary, ascii, iso-8859-1, unicode
encodings:  fixed_8, utf8, utf16, ucs2

utf16 is a bit special, as it falls immediately back to ucs2, if thereare no surrogates in the string.


The default charset is ascii.
The default encoding for (binary,ascii,iso-8859-1) is fixed_8
The default encoding for unicode is utf8.

String operations with unicode either return utf8 strings (concat utf8,ascii) or create utf16/ucs2 strings.

Therefore before a unicode string is sent to some output, it needsconversion to the desired encoding, possibly utf8. There are to ways toachieve this:


   getstdout P0      # get output handle - any ParrotIO PMC will do
   push P0, "utf8"   # push utf8 output filter  on layer stack
   # all output to P0 will now be utf8

or

   find_encoding  I0, "utf8"   # or any other valid encoding
   trans_encoding S1, I0       # S1 is now utf8

I hope these semantics are sane so far.

leo

Unicode strings and encodings

Reply via email to