Re: String rationale

Tom Hughes Sat, 27 Oct 2001 07:52:48 -0700

In message <[EMAIL PROTECTED]>
          Tom Hughes <[EMAIL PROTECTED]> wrote:


> Other than that it looked quite good and I'll probably start looking at
> bending the existing code into the new model over the weekend.

Attached is my first pass at this - it's not fully ready yet but
is something for people to cast an eye over before I spend lots of
time going down the wrong path ;-)

The encoding_lookup() and chartype_lookup() routines will obviously
need to load the relevant libraries on the fly when we have support
for that.

The packfile stuff is just a hack to make it work for now. Presumably
we will have to modify the byte code format to record the string types
as names or something so we can look them up properly?

String comparison is not language sensitive here - as before it just
compares based on character values.

Other than that I think it's aiming in the right direction and it does
pass all the tests... Please correct me if I'm wrong.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/

# This is a patch for parrot to update it to parrot-ns
# 
# To apply this patch:
# STEP 1: Chdir to the source directory.
# STEP 2: Run the 'applypatch' program with this patch file as input.
#
# If you do not have 'applypatch', it is part of the 'makepatch' package
# that you can fetch from the Comprehensive Perl Archive Network:
# http://www.perl.com/CPAN/authors/Johan_Vromans/makepatch-x.y.tar.gz
# In the above URL, 'x' should be 2 or higher.
#
# To apply this patch without the use of 'applypatch':
# STEP 1: Chdir to the source directory.
# If you have a decent Bourne-type shell:
# STEP 2: Run the shell with this file as input.
# If you don't have such a shell, you may need to manually create/delete
# the files/directories as shown below.
# STEP 3: Run the 'patch' program with this file as input.
#
# These are the commands needed to create/delete files/directories:
#
mkdir 'chartypes'
chmod 0755 'chartypes'
mkdir 'encodings'
chmod 0755 'encodings'
rm -f 'transcode.c'
rm -f 'strutf8.c'
rm -f 'strutf32.c'
rm -f 'strutf16.c'
rm -f 'strnative.c'
rm -f 'include/parrot/transcode.h'
rm -f 'include/parrot/strutf8.h'
rm -f 'include/parrot/strutf32.h'
rm -f 'include/parrot/strutf16.h'
rm -f 'include/parrot/strnative.h'
touch 'chartype.c'
chmod 0644 'chartype.c'
touch 'chartypes/unicode.c'
chmod 0644 'chartypes/unicode.c'
touch 'chartypes/usascii.c'
chmod 0644 'chartypes/usascii.c'
touch 'encoding.c'
chmod 0644 'encoding.c'
touch 'encodings/singlebyte.c'
chmod 0644 'encodings/singlebyte.c'
touch 'encodings/utf16.c'
chmod 0644 'encodings/utf16.c'
touch 'encodings/utf32.c'
chmod 0644 'encodings/utf32.c'
touch 'encodings/utf8.c'
chmod 0644 'encodings/utf8.c'
touch 'include/parrot/chartype.h'
chmod 0644 'include/parrot/chartype.h'
touch 'include/parrot/encoding.h'
chmod 0644 'include/parrot/encoding.h'
#
# This command terminates the shell and need not be executed manually.
exit
#
#### End of Preamble ####

#### Patch data follows ####
diff -c 'parrot/MANIFEST' 'parrot-ns/MANIFEST'
Index: ./MANIFEST
*** ./MANIFEST  Wed Oct 24 22:16:51 2001
--- ./MANIFEST  Sat Oct 27 14:59:43 2001
***************
*** 1,5 ****
--- 1,8 ----
  assemble.pl
  ChangeLog
+ chartype.c
+ chartypes/unicode.c
+ chartypes/usascii.c
  classes/genclass.pl
  classes/intclass.c
  config_h.in
***************
*** 14,19 ****
--- 17,27 ----
  docs/parrotbyte.pod
  docs/strings.pod
  docs/vtables.pod
+ encoding.c
+ encodings/singlebyte.c
+ encodings/utf8.c
+ encodings/utf16.c
+ encodings/utf32.c
  examples/assembly/bsr.pasm
  examples/assembly/call.pasm
  examples/assembly/euclid.pasm
***************
*** 29,34 ****
--- 37,44 ----
  global_setup.c
  hints/mswin32.pl
  hints/vms.pl
+ include/parrot/chartype.h
+ include/parrot/encoding.h
  include/parrot/events.h
  include/parrot/exceptions.h
  include/parrot/global_setup.h
***************
*** 45,55 ****
  include/parrot/runops_cores.h
  include/parrot/stacks.h
  include/parrot/string.h
- include/parrot/strnative.h
- include/parrot/strutf16.h
- include/parrot/strutf32.h
- include/parrot/strutf8.h
- include/parrot/transcode.h
  include/parrot/trace.h
  include/parrot/unicode.h
  interpreter.c
--- 55,60 ----
***************
*** 107,116 ****
  runops_cores.c
  stacks.c
  string.c
- strnative.c
- strutf16.c
- strutf32.c
- strutf8.c
  test_c.in
  test_main.c
  Test/More.pm
--- 112,117 ----
***************
*** 128,134 ****
  t/op/time.t
  t/op/trans.t
  trace.c
- transcode.c
  Types_pm.in
  vtable_h.pl
  vtable.tbl
--- 129,134 ----
diff -c 'parrot/Makefile.in' 'parrot-ns/Makefile.in'
Index: ./Makefile.in
*** ./Makefile.in       Wed Oct 24 19:23:47 2001
--- ./Makefile.in       Sat Oct 27 15:02:45 2001
***************
*** 11,19 ****
  $(INC)/pmc.h $(INC)/resources.h
  
  O_FILES = global_setup$(O) interpreter$(O) parrot$(O) register$(O) \
! core_ops$(O) memory$(O) packfile$(O) stacks$(O) string$(O) strnative$(O) \
! strutf8$(O) strutf16$(O) strutf32$(O) transcode$(O) runops_cores$(O) \
! trace$(O) vtable_ops$(O) classes/intclass$(O) resources$(O)
  
  #DO NOT ADD C COMPILER FLAGS HERE
  #Add them in Configure.pl--look for the
--- 11,20 ----
  $(INC)/pmc.h $(INC)/resources.h
  
  O_FILES = global_setup$(O) interpreter$(O) parrot$(O) register$(O) \
! core_ops$(O) memory$(O) packfile$(O) stacks$(O) string$(O) encoding$(O) \
! chartype$(O) runops_cores$(O) trace$(O) vtable_ops$(O) classes/intclass$(O) \
! encodings/singlebyte$(O) encodings/utf8$(O) encodings/utf16$(O) \
! encodings/utf32$(O) chartypes/unicode$(O) chartypes/usascii$(O) resources$(O)
  
  #DO NOT ADD C COMPILER FLAGS HERE
  #Add them in Configure.pl--look for the
diff -c /dev/null 'parrot-ns/chartype.c'
Index: ./chartype.c
*** ./chartype.c        Thu Jan  1 01:00:00 1970
--- ./chartype.c        Sat Oct 27 15:04:03 2001
***************
*** 0 ****
--- 1,39 ----
+ /* chartype.c
+  *  Copyright: (When this is determined...it will go here)
+  *  CVS Info
+  *     $Id$
+  *  Overview:
+  *     This defines the string character type subsystem
+  *  Data Structure and Algorithms:
+  *  History:
+  *  Notes:
+  *  References:
+  */
+ 
+ #include "parrot/parrot.h"
+ 
+ extern const CHARTYPE usascii_chartype;
+ extern const CHARTYPE unicode_chartype;
+ 
+ const CHARTYPE *
+ chartype_lookup(const char *name) {
+     if (strcmp(name, "usascii") == 0) {
+         return &usascii_chartype;
+     }
+     else if (strcmp(name, "unicode") == 0) {
+         return &unicode_chartype;
+     }
+     else {
+         return NULL;
+     }
+ }
+ 
+ /*
+  * Local variables:
+  * c-indentation-style: bsd
+  * c-basic-offset: 4
+  * indent-tabs-mode: nil
+  * End:
+  *
+  * vim: expandtab shiftwidth=4:
+ */
diff -c /dev/null 'parrot-ns/chartypes/unicode.c'
Index: ./chartypes/unicode.c
*** ./chartypes/unicode.c       Thu Jan  1 01:00:00 1970
--- ./chartypes/unicode.c       Sat Oct 27 15:02:16 2001
***************
*** 0 ****
--- 1,28 ----
+ /* unicode.c
+  *  Copyright: (When this is determined...it will go here)
+  *  CVS Info
+  *     $Id$
+  *  Overview:
+  *     This defines the US-ASCII character type routines.
+  *  Data Structure and Algorithms:
+  *  History:
+  *  Notes:
+  *  References:
+  */
+ 
+ #include "parrot/parrot.h"
+ 
+ const CHARTYPE unicode_chartype = {
+     "unicode",
+     "utf32"
+ };
+ 
+ /*
+  * Local variables:
+  * c-indentation-style: bsd
+  * c-basic-offset: 4
+  * indent-tabs-mode: nil
+  * End:
+  *
+  * vim: expandtab shiftwidth=4:
+ */
diff -c /dev/null 'parrot-ns/chartypes/usascii.c'
Index: ./chartypes/usascii.c
*** ./chartypes/usascii.c       Thu Jan  1 01:00:00 1970
--- ./chartypes/usascii.c       Sat Oct 27 15:01:41 2001
***************
*** 0 ****
--- 1,28 ----
+ /* usascii.c
+  *  Copyright: (When this is determined...it will go here)
+  *  CVS Info
+  *     $Id$
+  *  Overview:
+  *     This defines the US-ASCII character type routines.
+  *  Data Structure and Algorithms:
+  *  History:
+  *  Notes:
+  *  References:
+  */
+ 
+ #include "parrot/parrot.h"
+ 
+ const CHARTYPE usascii_chartype = {
+     "usascii",
+     "singlebyte"
+ };
+ 
+ /*
+  * Local variables:
+  * c-indentation-style: bsd
+  * c-basic-offset: 4
+  * indent-tabs-mode: nil
+  * End:
+  *
+  * vim: expandtab shiftwidth=4:
+ */
diff -c /dev/null 'parrot-ns/encoding.c'
Index: ./encoding.c
*** ./encoding.c        Thu Jan  1 01:00:00 1970
--- ./encoding.c        Sat Oct 27 15:04:16 2001
***************
*** 0 ****
--- 1,47 ----
+ /* encoding.c
+  *  Copyright: (When this is determined...it will go here)
+  *  CVS Info
+  *     $Id$
+  *  Overview:
+  *     This defines the string encoding subsystem
+  *  Data Structure and Algorithms:
+  *  History:
+  *  Notes:
+  *  References:
+  */
+ 
+ #include "parrot/parrot.h"
+ 
+ extern const ENCODING singlebyte_encoding;
+ extern const ENCODING utf8_encoding;
+ extern const ENCODING utf16_encoding;
+ extern const ENCODING utf32_encoding;
+ 
+ const ENCODING *
+ encoding_lookup(const char *name) {
+     if (strcmp(name, "singlebyte") == 0) {
+         return &singlebyte_encoding;
+     }
+     else if (strcmp(name, "utf8") == 0) {
+         return &utf8_encoding;
+     }
+     else if (strcmp(name, "utf16") == 0) {
+         return &utf16_encoding;
+     }
+     else if (strcmp(name, "utf32") == 0) {
+         return &utf32_encoding;
+     }
+     else {
+         return NULL;
+     }
+ }
+ 
+ /*
+  * Local variables:
+  * c-indentation-style: bsd
+  * c-basic-offset: 4
+  * indent-tabs-mode: nil
+  * End:
+  *
+  * vim: expandtab shiftwidth=4:
+ */
diff -c /dev/null 'parrot-ns/encodings/singlebyte.c'
Index: ./encodings/singlebyte.c
*** ./encodings/singlebyte.c    Thu Jan  1 01:00:00 1970
--- ./encodings/singlebyte.c    Sat Oct 27 15:40:40 2001
***************
*** 0 ****
--- 1,75 ----
+ /* singlebyte.c
+  *  Copyright: (When this is determined...it will go here)
+  *  CVS Info
+  *     $Id$
+  *  Overview:
+  *     This defines the single byte encoding routines.
+  *  Data Structure and Algorithms:
+  *  History:
+  *  Notes:
+  *  References:
+  */
+ 
+ #include "parrot/parrot.h"
+ 
+ typedef unsigned char byte_t;
+ 
+ static INTVAL
+ singlebyte_characters (const void *ptr, INTVAL bytes) {
+     return bytes;
+ }
+ 
+ static INTVAL
+ singlebyte_decode (const void *ptr) {
+     const byte_t *bptr = ptr;
+ 
+     return *bptr;
+ }
+ 
+ static void *
+ singlebyte_encode (void *ptr, INTVAL c) {
+     byte_t *bptr = ptr;
+ 
+     if (c < 0 || c > 255) {
+         INTERNAL_EXCEPTION(INVALID_CHARACTER,
+                            "Invalid character for single byte encoding\n");
+     }
+ 
+     *bptr = c;
+ 
+     return bptr + 1;
+ }
+ 
+ static void *
+ singlebyte_skip_forward (void *ptr, INTVAL n) {
+     byte_t *bptr = ptr;
+ 
+     return bptr + n;
+ }
+ 
+ static void *
+ singlebyte_skip_backward (void *ptr, INTVAL n) {
+     byte_t *bptr = ptr;
+ 
+     return bptr - n;
+ }
+ 
+ const ENCODING singlebyte_encoding = {
+     "singlebyte",
+     1,
+     singlebyte_characters,
+     singlebyte_decode,
+     singlebyte_encode,
+     singlebyte_skip_forward,
+     singlebyte_skip_backward
+ };
+ 
+ /*
+  * Local variables:
+  * c-indentation-style: bsd
+  * c-basic-offset: 4
+  * indent-tabs-mode: nil
+  * End:
+  *
+  * vim: expandtab shiftwidth=4:
+ */
diff -c /dev/null 'parrot-ns/encodings/utf16.c'
Index: ./encodings/utf16.c
*** ./encodings/utf16.c Thu Jan  1 01:00:00 1970
--- ./encodings/utf16.c Sat Oct 27 15:50:34 2001
***************
*** 0 ****
--- 1,143 ----
+ /* utf16.c
+  *  Copyright: (When this is determined...it will go here)
+  *  CVS Info
+  *     $Id$
+  *  Overview:
+  *     This defines the UTF-16 encoding routines.
+  *  Data Structure and Algorithms:
+  *  History:
+  *  Notes:
+  *  References:
+  */
+ 
+ #include "parrot/parrot.h"
+ #include "parrot/unicode.h"
+ 
+ #if 0
+ typedef unsigned short utf16_t;
+ #endif
+ 
+ static INTVAL
+ utf16_characters (const void *ptr, INTVAL bytes) {
+     const utf16_t *u16ptr = ptr;
+     const utf16_t *u16end = u16ptr + bytes / sizeof(utf16_t);
+     INTVAL characters = 0;
+ 
+     while (u16ptr < u16end) {
+         u16ptr += UTF16SKIP(u16ptr);
+         characters++;
+     }
+ 
+     if (u16ptr > u16end) {
+         INTERNAL_EXCEPTION(MALFORMED_UTF16, "Unaligned end in UTF-16 string\n");
+     }
+ 
+     return characters;
+ }
+ 
+ static INTVAL
+ utf16_decode (const void *ptr) {
+     const utf16_t *u16ptr = ptr;
+     INTVAL c = *u16ptr++;
+ 
+     if (UNICODE_IS_HIGH_SURROGATE(c)) {
+         utf16_t low = *u16ptr++;
+ 
+         if (!UNICODE_IS_LOW_SURROGATE(low)) {
+             INTERNAL_EXCEPTION(MALFORMED_UTF16, "Malformed UTF-16 surrogate\n");
+         }
+ 
+         c = UNICODE_DECODE_SURROGATE(c, low);
+     }
+     else if (UNICODE_IS_LOW_SURROGATE(c)) {
+         INTERNAL_EXCEPTION(MALFORMED_UTF16, "Malformed UTF-16 surrogate\n");
+     }
+ 
+     return c;
+ }
+ 
+ static void *
+ utf16_encode (void *ptr, INTVAL c) {
+     utf16_t *u16ptr = ptr;
+ 
+     if (c < 0 || c > 0x10FFFF || UNICODE_IS_SURROGATE(c)) {
+         INTERNAL_EXCEPTION(INVALID_CHARACTER,
+                            "Invalid character for UTF-16 encoding\n");
+     }
+ 
+     if (c < 0x10000u) {
+         *u16ptr++ = c;
+     }
+     else {
+         *u16ptr++ = UNICODE_HIGH_SURROGATE(c);
+         *u16ptr++ = UNICODE_LOW_SURROGATE(c);
+     }
+ 
+     return u16ptr;
+ }
+ 
+ static void *
+ utf16_skip_forward (void *ptr, INTVAL n) {
+     utf16_t *u16ptr = ptr;
+ 
+     while (n-- > 0) {
+       if (UNICODE_IS_HIGH_SURROGATE(*u16ptr)) {
+           u16ptr++;
+ 
+           if (!UNICODE_IS_LOW_SURROGATE(*u16ptr)) {
+               INTERNAL_EXCEPTION(MALFORMED_UTF16,
+                                  "Malformed UTF-16 surrogate\n");
+           }
+       }
+       else if (UNICODE_IS_LOW_SURROGATE(*u16ptr)) {
+           INTERNAL_EXCEPTION(MALFORMED_UTF16, "Malformed UTF-16 surrogate\n");
+       }
+ 
+       u16ptr++;
+     }
+ 
+     return u16ptr;
+ }
+ 
+ static void *
+ utf16_skip_backward (void *ptr, INTVAL n) {
+     utf16_t *u16ptr = ptr;
+ 
+     while (n--> 0) {
+         u16ptr--;
+ 
+         if (UNICODE_IS_LOW_SURROGATE(*u16ptr)) {
+             u16ptr--;
+ 
+             if (!UNICODE_IS_HIGH_SURROGATE(*u16ptr)) {
+                 INTERNAL_EXCEPTION(MALFORMED_UTF16,
+                                    "Malformed UTF-16 surrogate\n");
+             }
+         }
+         else if (UNICODE_IS_HIGH_SURROGATE(*u16ptr)) {
+             INTERNAL_EXCEPTION(MALFORMED_UTF16, "Malformed UTF-16 surrogate\n");
+         }
+     }
+ 
+     return u16ptr;
+ }
+ 
+ const ENCODING utf16_encoding = {
+     "utf16",
+     UTF16_MAXLEN,
+     utf16_characters,
+     utf16_decode,
+     utf16_encode,
+     utf16_skip_forward,
+     utf16_skip_backward
+ };
+ 
+ /*
+  * Local variables:
+  * c-indentation-style: bsd
+  * c-basic-offset: 4
+  * indent-tabs-mode: nil
+  * End:
+  *
+  * vim: expandtab shiftwidth=4:
+ */
diff -c /dev/null 'parrot-ns/encodings/utf32.c'
Index: ./encodings/utf32.c
*** ./encodings/utf32.c Thu Jan  1 01:00:00 1970
--- ./encodings/utf32.c Sat Oct 27 15:51:28 2001
***************
*** 0 ****
--- 1,78 ----
+ /* utf32.c
+  *  Copyright: (When this is determined...it will go here)
+  *  CVS Info
+  *     $Id$
+  *  Overview:
+  *     This defines the UTF-32 encoding routines.
+  *  Data Structure and Algorithms:
+  *  History:
+  *  Notes:
+  *  References:
+  */
+ 
+ #include "parrot/parrot.h"
+ #include "parrot/unicode.h"
+ 
+ #if 0
+ typedef unsigned long utf32_t;
+ #endif
+ 
+ static INTVAL
+ utf32_characters (const void *ptr, INTVAL bytes) {
+     return bytes / 4;
+ }
+ 
+ static INTVAL
+ utf32_decode (const void *ptr) {
+     const utf32_t *u32ptr = ptr;
+ 
+     return *u32ptr;
+ }
+ 
+ static void *
+ utf32_encode (void *ptr, INTVAL c) {
+     utf32_t *u32ptr = ptr;
+ 
+     if (c < 0 || c > 0x10FFFF || UNICODE_IS_SURROGATE(c)) {
+         INTERNAL_EXCEPTION(INVALID_CHARACTER,
+                            "Invalid character for UTF-32 encoding\n");
+     }
+ 
+     *u32ptr = c;
+ 
+     return u32ptr + 1;
+ }
+ 
+ static void *
+ utf32_skip_forward (void *ptr, INTVAL n) {
+     utf32_t *u32ptr = ptr;
+ 
+     return u32ptr + n;
+ }
+ 
+ static void *
+ utf32_skip_backward (void *ptr, INTVAL n) {
+     utf32_t *u32ptr = ptr;
+ 
+     return u32ptr - n;
+ }
+ 
+ const ENCODING utf32_encoding = {
+     "utf32",
+     4,
+     utf32_characters,
+     utf32_decode,
+     utf32_encode,
+     utf32_skip_forward,
+     utf32_skip_backward
+ };
+ 
+ /*
+  * Local variables:
+  * c-indentation-style: bsd
+  * c-basic-offset: 4
+  * indent-tabs-mode: nil
+  * End:
+  *
+  * vim: expandtab shiftwidth=4:
+ */
diff -c /dev/null 'parrot-ns/encodings/utf8.c'
Index: ./encodings/utf8.c
*** ./encodings/utf8.c  Thu Jan  1 01:00:00 1970
--- ./encodings/utf8.c  Sat Oct 27 15:46:08 2001
***************
*** 0 ****
--- 1,140 ----
+ /* utf8.c
+  *  Copyright: (When this is determined...it will go here)
+  *  CVS Info
+  *     $Id$
+  *  Overview:
+  *     This defines the UTF-8 encoding routines.
+  *  Data Structure and Algorithms:
+  *  History:
+  *  Notes:
+  *  References:
+  */
+ 
+ #include "parrot/parrot.h"
+ #include "parrot/unicode.h"
+ 
+ const char Parrot_utf8skip[256] = {
+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* ascii */
+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* ascii */
+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* ascii */
+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* ascii */
+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* bogus */
+ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, /* bogus */
+ 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, /* scripts */
+ 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,6,6,6,6, /* cjk etc. */
+ };
+ 
+ #if 0
+ typedef unsigned char utf8_t;
+ #endif
+ 
+ static INTVAL
+ utf8_characters (const void *ptr, INTVAL bytes) {
+     const utf8_t *u8ptr = ptr;
+     const utf8_t *u8end = u8ptr + bytes;
+     INTVAL characters = 0;
+ 
+     while (u8ptr < u8end) {
+         u8ptr += UTF8SKIP(u8ptr);
+         characters++;
+     }
+ 
+     if (u8ptr > u8end) {
+         INTERNAL_EXCEPTION(MALFORMED_UTF8, "Unaligned end in UTF-8 string\n");
+     }
+ 
+     return characters;
+ }
+ 
+ static INTVAL
+ utf8_decode (const void *ptr) {
+     const utf8_t *u8ptr = ptr;
+     INTVAL c = *u8ptr;
+ 
+     if (UTF8_IS_START(c)) {
+         INTVAL len = UTF8SKIP(u8ptr);
+         INTVAL count;
+ 
+         c &= UTF8_START_MASK(len);
+         for (count = 1; count < len; count++) {
+             u8ptr++;
+             if (!UTF8_IS_CONTINUATION(*u8ptr)) {
+                 INTERNAL_EXCEPTION(MALFORMED_UTF8,
+                                    "Malformed UTF-8 string\n");
+             }
+             c = UTF8_ACCUMULATE(c, *u8ptr);
+         }
+ 
+         if (UNICODE_IS_SURROGATE(c)) {
+             INTERNAL_EXCEPTION(MALFORMED_UTF8, "Surrogate in UTF-8 string\n");
+         }
+     }
+     else if (!UNICODE_IS_INVARIANT(c)) {
+         INTERNAL_EXCEPTION(MALFORMED_UTF8, "Malformed UTF-8 string\n");
+     }
+ 
+     return c;
+ }
+ 
+ static void *
+ utf8_encode (void *ptr, INTVAL c) {
+     utf8_t *u8ptr = ptr;
+     INTVAL len = UNISKIP(c);
+     utf8_t *u8end = u8ptr + len - 1;
+ 
+     if (c < 0 || c > 0x10FFFF || UNICODE_IS_SURROGATE(c)) {
+         INTERNAL_EXCEPTION(INVALID_CHARACTER,
+                            "Invalid character for UTF-8 encoding\n");
+     }
+ 
+     while (u8end > u8ptr) {
+         *u8end-- = (c & UTF8_CONTINUATION_MASK) | UTF8_CONTINUATION_MARK;
+         c >>= UTF8_ACCUMULATION_SHIFT;
+     }
+     *u8end = (c & UTF8_START_MASK(len)) | UTF8_START_MARK(len);
+ 
+     return u8ptr + len;
+ }
+ 
+ static void *
+ utf8_skip_forward (void *ptr, INTVAL n) {
+     utf8_t *u8ptr = ptr;
+ 
+     while (n-- > 0) {
+         u8ptr += UTF8SKIP(u8ptr);
+     }
+ 
+     return u8ptr;
+ }
+ 
+ static void *
+ utf8_skip_backward (void *ptr, INTVAL n) {
+     utf8_t *u8ptr = ptr;
+ 
+     while (n-- > 0) {
+         u8ptr--;
+         while (UTF8_IS_CONTINUATION(*u8ptr)) u8ptr--;
+     }
+ 
+     return u8ptr;
+ }
+ 
+ const ENCODING utf8_encoding = {
+     "utf8",
+     UTF8_MAXLEN,
+     utf8_characters,
+     utf8_decode,
+     utf8_encode,
+     utf8_skip_forward,
+     utf8_skip_backward
+ };
+ 
+ /*
+  * Local variables:
+  * c-indentation-style: bsd
+  * c-basic-offset: 4
+  * indent-tabs-mode: nil
+  * End:
+  *
+  * vim: expandtab shiftwidth=4:
+ */
diff -c 'parrot/global_setup.c' 'parrot-ns/global_setup.c'
Index: ./global_setup.c
Prereq:  1.6 
*** ./global_setup.c    Mon Oct  8 08:49:10 2001
--- ./global_setup.c    Sat Oct 27 14:44:18 2001
***************
*** 16,30 ****
  
  void
  init_world() {
!     string_init(); /* Set up the string subsystem */ 
!     transcode_init(); /* Set up the transcoding subsystem */
  }
  
  /*
   * Local variables:
   * c-indentation-style: bsd
   * c-basic-offset: 4
!  * indent-tabs-mode: nil 
   * End:
   *
   * vim: expandtab shiftwidth=4:
--- 16,29 ----
  
  void
  init_world() {
!     string_init(); /* Set up the string subsystem */
  }
  
  /*
   * Local variables:
   * c-indentation-style: bsd
   * c-basic-offset: 4
!  * indent-tabs-mode: nil
   * End:
   *
   * vim: expandtab shiftwidth=4:
diff -c /dev/null 'parrot-ns/include/parrot/chartype.h'
Index: ./include/parrot/chartype.h
*** ./include/parrot/chartype.h Thu Jan  1 01:00:00 1970
--- ./include/parrot/chartype.h Sat Oct 27 14:50:39 2001
***************
*** 0 ****
--- 1,34 ----
+ /* chartype.h
+  *  Copyright: (When this is determined...it will go here)
+  *  CVS Info
+  *     $Id$
+  *  Overview:
+  *     This is the api header for the string character type subsystem
+  *  Data Structure and Algorithms:
+  *  History:
+  *  Notes:
+  *  References:
+  */
+ 
+ #if !defined(PARROT_CHARTYPE_H_GUARD)
+ #define PARROT_ENCODING_H_GUARD
+ 
+ typedef struct {
+     const char *name;
+     const char *default_encoding;
+ } CHARTYPE;
+ 
+ const CHARTYPE *
+ chartype_lookup(const char *name);
+ 
+ #endif
+ 
+ /*
+  * Local variables:
+  * c-indentation-style: bsd
+  * c-basic-offset: 4
+  * indent-tabs-mode: nil
+  * End:
+  *
+  * vim: expandtab shiftwidth=4:
+ */
diff -c /dev/null 'parrot-ns/include/parrot/encoding.h'
Index: ./include/parrot/encoding.h
*** ./include/parrot/encoding.h Thu Jan  1 01:00:00 1970
--- ./include/parrot/encoding.h Sat Oct 27 15:40:55 2001
***************
*** 0 ****
--- 1,39 ----
+ /* encoding.h
+  *  Copyright: (When this is determined...it will go here)
+  *  CVS Info
+  *     $Id$
+  *  Overview:
+  *     This is the api header for the string encoding subsystem
+  *  Data Structure and Algorithms:
+  *  History:
+  *  Notes:
+  *  References:
+  */
+ 
+ #if !defined(PARROT_ENCODING_H_GUARD)
+ #define PARROT_ENCODING_H_GUARD
+ 
+ typedef struct {
+     const char *name;
+     INTVAL max_bytes;
+     INTVAL (*characters)(const void *ptr, INTVAL bytes);
+     INTVAL (*decode)(const void *ptr);
+     void *(*encode)(void *ptr, INTVAL c);
+     void *(*skip_forward)(void *ptr, INTVAL n);
+     void *(*skip_backward)(void *ptr, INTVAL n);
+ } ENCODING;
+ 
+ const ENCODING *
+ encoding_lookup(const char *name);
+ 
+ #endif
+ 
+ /*
+  * Local variables:
+  * c-indentation-style: bsd
+  * c-basic-offset: 4
+  * indent-tabs-mode: nil
+  * End:
+  *
+  * vim: expandtab shiftwidth=4:
+ */
diff -c 'parrot/include/parrot/exceptions.h' 'parrot-ns/include/parrot/exceptions.h'
Index: ./include/parrot/exceptions.h
Prereq:  1.3 
*** ./include/parrot/exceptions.h       Mon Oct  8 08:49:03 2001
--- ./include/parrot/exceptions.h       Sat Oct 27 12:33:21 2001
***************
*** 20,25 ****
--- 20,26 ----
  #define MALFORMED_UTF8 1
  #define MALFORMED_UTF16 1
  #define MALFORMED_UTF32 1
+ #define INVALID_CHARACTER 1
  
  #endif
  
***************
*** 27,33 ****
   * Local variables:
   * c-indentation-style: bsd
   * c-basic-offset: 4
!  * indent-tabs-mode: nil 
   * End:
   *
   * vim: expandtab shiftwidth=4:
--- 28,34 ----
   * Local variables:
   * c-indentation-style: bsd
   * c-basic-offset: 4
!  * indent-tabs-mode: nil
   * End:
   *
   * vim: expandtab shiftwidth=4:
diff -c 'parrot/include/parrot/parrot.h' 'parrot-ns/include/parrot/parrot.h'
Index: ./include/parrot/parrot.h
Prereq:  1.8 
*** ./include/parrot/parrot.h   Mon Oct 22 22:59:42 2001
--- ./include/parrot/parrot.h   Sat Oct 27 14:51:44 2001
***************
*** 3,9 ****
   *  CVS Info
   *     $Id: parrot.h,v 1.8 2001/10/22 21:43:25 dan Exp $
   *  Overview:
!  *     General header file includes for the parrot interpreter    
   *  Data Structure and Algorithms:
   *  History:
   *  Notes:
--- 3,9 ----
   *  CVS Info
   *     $Id: parrot.h,v 1.8 2001/10/22 21:43:25 dan Exp $
   *  Overview:
!  *     General header file includes for the parrot interpreter
   *  Data Structure and Algorithms:
   *  History:
   *  Notes:
***************
*** 14,20 ****
  #define PARROT_PARROT_H_GUARD
  
  #if defined(INSIDE_GLOBAL_SETUP)
! #define VAR_SCOPE 
  #else
  #define VAR_SCOPE extern
  #endif
--- 14,20 ----
  #define PARROT_PARROT_H_GUARD
  
  #if defined(INSIDE_GLOBAL_SETUP)
! #define VAR_SCOPE
  #else
  #define VAR_SCOPE extern
  #endif
***************
*** 54,61 ****
  
  #include "parrot/global_setup.h"
  #include "parrot/interpreter.h"
  #include "parrot/string.h"
- #include "parrot/transcode.h"
  #include "parrot/vtable.h"
  #include "parrot/register.h"
  #include "parrot/exceptions.h"
--- 54,62 ----
  
  #include "parrot/global_setup.h"
  #include "parrot/interpreter.h"
+ #include "parrot/encoding.h"
+ #include "parrot/chartype.h"
  #include "parrot/string.h"
  #include "parrot/vtable.h"
  #include "parrot/register.h"
  #include "parrot/exceptions.h"
***************
*** 73,79 ****
   * Local variables:
   * c-indentation-style: bsd
   * c-basic-offset: 4
!  * indent-tabs-mode: nil 
   * End:
   *
   * vim: expandtab shiftwidth=4:
--- 74,80 ----
   * Local variables:
   * c-indentation-style: bsd
   * c-basic-offset: 4
!  * indent-tabs-mode: nil
   * End:
   *
   * vim: expandtab shiftwidth=4:
diff -c 'parrot/include/parrot/string.h' 'parrot-ns/include/parrot/string.h'
Index: ./include/parrot/string.h
Prereq:  1.8 
*** ./include/parrot/string.h   Tue Oct 23 00:34:48 2001
--- ./include/parrot/string.h   Sat Oct 27 15:59:35 2001
***************
*** 15,62 ****
  
  #include "parrot/parrot.h"
  
! typedef struct parrot_string STRING;
! typedef struct string_vtable STRING_VTABLE;
! 
! typedef enum {
!     enc_native,
!     enc_utf8,
!     enc_utf16,
!     enc_utf32,
!     enc_foreign,
!     enc_max
! } encoding_t;
! 
! 
! /* String vtable functions */
! 
! typedef INTVAL (*string_to_iv_t)(STRING *);
! typedef STRING* (*string_iv_to_string_t)(STRING *, INTVAL);
! typedef STRING* (*two_strings_iv_to_string_t)(struct Parrot_Interp *, STRING *, 
STRING *, INTVAL);
! typedef STRING* (*substr_t)(STRING*, INTVAL, INTVAL, STRING*);
! typedef INTVAL (*iv_to_iv_t)(INTVAL);
! typedef INTVAL (*two_strings_to_iv_t)(STRING*, STRING*);
! 
! struct string_vtable {
!     encoding_t which;                   /* What sort of encoding is this? */
!     string_to_iv_t compute_strlen;      /* How long is a piece of string? */
!     iv_to_iv_t max_bytes;               /* I have n characters - how many bytes 
should I allocate? */
!     two_strings_iv_to_string_t concat;  /* Append string b to the end of string a */
!     string_iv_to_string_t chopn;        /* Remove n characters from the end of a 
string */
!     substr_t substr;                    /* Substring operation */
!     two_strings_to_iv_t compare;        /* Compare operation */
! };
! 
! struct parrot_string {
      void *bufstart;
      INTVAL buflen;
      INTVAL bufused;
      INTVAL flags;
      INTVAL strlen;
!     STRING_VTABLE* encoding;
!     INTVAL type;
      INTVAL lanugage;
! };
  
  
  /* Declarations of accessors */
--- 15,30 ----
  
  #include "parrot/parrot.h"
  
! typedef struct {
      void *bufstart;
      INTVAL buflen;
      INTVAL bufused;
      INTVAL flags;
      INTVAL strlen;
!     const ENCODING *encoding;
!     const CHARTYPE *type;
      INTVAL lanugage;
! } STRING;
  
  
  /* Declarations of accessors */
***************
*** 82,99 ****
  void
  string_destroy(STRING* s);
  STRING*
! string_make(struct Parrot_Interp *interpreter, void *buffer, INTVAL buflen, INTVAL 
encoding, INTVAL flags, INTVAL type);
  STRING*
  string_copy(struct Parrot_Interp *interpreter, STRING *i);
  void
  string_init(void);
  
- VAR_SCOPE STRING_VTABLE Parrot_string_vtable[enc_max];
- 
- #include "parrot/strnative.h"
- #include "parrot/strutf8.h"
- #include "parrot/strutf16.h"
- #include "parrot/strutf32.h"
  #endif
  
  /*
--- 50,63 ----
  void
  string_destroy(STRING* s);
  STRING*
! string_make(struct Parrot_Interp *interpreter, void *buffer, INTVAL buflen, const 
ENCODING *encoding, INTVAL flags, const CHARTYPE *type);
  STRING*
  string_copy(struct Parrot_Interp *interpreter, STRING *i);
+ STRING*
+ string_transcode(struct Parrot_Interp *interpreter, STRING *src, const ENCODING 
+*encoding, const CHARTYPE *type, STRING *dest);
  void
  string_init(void);
  
  #endif
  
  /*
diff -c 'parrot/packfile.c' 'parrot-ns/packfile.c'
Index: ./packfile.c
Prereq:  1.14 
*** ./packfile.c        Mon Oct 22 22:59:42 2001
--- ./packfile.c        Sat Oct 27 15:59:06 2001
***************
*** 183,189 ****
  
  ***************************************/
  
! void 
  PackFile_set_magic(struct PackFile * self, opcode_t magic) {
      self->magic = magic;
  }
--- 183,189 ----
  
  ***************************************/
  
! void
  PackFile_set_magic(struct PackFile * self, opcode_t magic) {
      self->magic = magic;
  }
***************
*** 330,336 ****
  #if TRACE_PACKFILE
      printf("PackFile_unpack(): Unpacking %ld bytes for fixup table...\n", 
segment_size);
  #endif
!     
      if (segment_size % sizeof(opcode_t)) {
          fprintf(stderr, "PackFile_unpack: Illegal fixup table segment size %d (must 
be multiple of %d)!\n",
              segment_size, sizeof(opcode_t));
--- 330,336 ----
  #if TRACE_PACKFILE
      printf("PackFile_unpack(): Unpacking %ld bytes for fixup table...\n", 
segment_size);
  #endif
! 
      if (segment_size % sizeof(opcode_t)) {
          fprintf(stderr, "PackFile_unpack: Illegal fixup table segment size %d (must 
be multiple of %d)!\n",
              segment_size, sizeof(opcode_t));
***************
*** 355,367 ****
  #if TRACE_PACKFILE
      printf("PackFile_unpack(): Unpacking %ld bytes for constant table...\n", 
segment_size);
  #endif
!     
      if (segment_size % sizeof(opcode_t)) {
          fprintf(stderr, "PackFile_unpack: Illegal constant table segment size %d 
(must be multiple of %d)!\n",
              segment_size, sizeof(opcode_t));
          return 0;
      }
!     
      if (!PackFile_ConstTable_unpack(interpreter, self->const_table, cursor, 
segment_size)) {
          fprintf(stderr, "PackFile_unpack: Error reading constant table segment!\n");
          return 0;
--- 355,367 ----
  #if TRACE_PACKFILE
      printf("PackFile_unpack(): Unpacking %ld bytes for constant table...\n", 
segment_size);
  #endif
! 
      if (segment_size % sizeof(opcode_t)) {
          fprintf(stderr, "PackFile_unpack: Illegal constant table segment size %d 
(must be multiple of %d)!\n",
              segment_size, sizeof(opcode_t));
          return 0;
      }
! 
      if (!PackFile_ConstTable_unpack(interpreter, self->const_table, cursor, 
segment_size)) {
          fprintf(stderr, "PackFile_unpack: Error reading constant table segment!\n");
          return 0;
***************
*** 391,397 ****
              self->byte_code_size = 0;
              return 0;
          }
!      
          mem_sys_memcopy(self->byte_code, cursor, self->byte_code_size);
      }
  
--- 391,397 ----
              self->byte_code_size = 0;
              return 0;
          }
! 
          mem_sys_memcopy(self->byte_code, cursor, self->byte_code_size);
      }
  
***************
*** 486,492 ****
      op_ptr = (opcode_t *)cursor;
      *op_ptr = const_table_size;
      cursor += sizeof(opcode_t);
!     
      PackFile_ConstTable_pack(self->const_table, cursor);
      cursor += const_table_size;
  
--- 486,492 ----
      op_ptr = (opcode_t *)cursor;
      *op_ptr = const_table_size;
      cursor += sizeof(opcode_t);
! 
      PackFile_ConstTable_pack(self->const_table, cursor);
      cursor += const_table_size;
  
***************
*** 932,938 ****
  #if TRACE_PACKFILE
      printf("PackFile_ConstTable_unpack(): Unpacking %ld constants...\n", 
self->const_count);
  #endif
!     
      if (self->const_count == 0) {
          return 1;
      }
--- 932,938 ----
  #if TRACE_PACKFILE
      printf("PackFile_ConstTable_unpack(): Unpacking %ld constants...\n", 
self->const_count);
  #endif
! 
      if (self->const_count == 0) {
          return 1;
      }
***************
*** 956,962 ****
  
          cursor += PackFile_Constant_pack_size(self->constants[i]);
      }
!     
      return 1;
  }
  
--- 956,962 ----
  
          cursor += PackFile_Constant_pack_size(self->constants[i]);
      }
! 
      return 1;
  }
  
***************
*** 1027,1033 ****
  
          cursor += PackFile_Constant_pack_size(self->constants[i]);
      }
!     
      return;
  }
  
--- 1027,1033 ----
  
          cursor += PackFile_Constant_pack_size(self->constants[i]);
      }
! 
      return;
  }
  
***************
*** 1055,1061 ****
          printf("    # %d:\n", i);
          PackFile_Constant_dump(self->constants[i]);
      }
!     
      return;
  }
  
--- 1055,1061 ----
          printf("    # %d:\n", i);
          PackFile_Constant_dump(self->constants[i]);
      }
! 
      return;
  }
  
***************
*** 1483,1489 ****
  #endif
  
      self->type   = PFC_STRING;
!     self->string = string_make(interpreter, cursor, size, encoding, flags, type);
  
      return 1;
  }
--- 1483,1497 ----
  #endif
  
      self->type   = PFC_STRING;
!     if (encoding == 0) {
!         self->string = string_make(interpreter, cursor, size, NULL, flags, NULL); /* 
fixme */
!     }
!     else if (encoding == 3) {
!         self->string = string_make(interpreter, cursor, size, 
encoding_lookup("utf32"), flags, chartype_lookup("unicode")); /* fixme */
!     }
!     else {
!       return 0;
!     }
  
      return 1;
  }
***************
*** 1514,1528 ****
          case PFC_NONE:
              packed_size = 0;
              break;
!  
          case PFC_INTEGER:
              packed_size = sizeof(opcode_t);
              break;
!  
          case PFC_NUMBER:
              packed_size = sizeof(FLOATVAL);
              break;
!  
          case PFC_STRING:
              padded_size = self->string->bufused;
  
--- 1522,1536 ----
          case PFC_NONE:
              packed_size = 0;
              break;
! 
          case PFC_INTEGER:
              packed_size = sizeof(opcode_t);
              break;
! 
          case PFC_NUMBER:
              packed_size = sizeof(FLOATVAL);
              break;
! 
          case PFC_STRING:
              padded_size = self->string->bufused;
  
***************
*** 1629,1639 ****
              cursor += sizeof(opcode_t);
  
              op_ptr  = (opcode_t *)cursor;
!             *op_ptr = self->string->encoding->which;
              cursor += sizeof(opcode_t);
  
              op_ptr  = (opcode_t *)cursor;
!             *op_ptr = self->string->type;
              cursor += sizeof(opcode_t);
  
              op_ptr  = (opcode_t *)cursor;
--- 1637,1654 ----
              cursor += sizeof(opcode_t);
  
              op_ptr  = (opcode_t *)cursor;
!             if (strcmp(self->string->type->name, "usascii") == 0 &&
!                 strcmp(self->string->encoding->name, "singlebyte") == 0 ) {
!                 *op_ptr = 0; /* fixme */
!             }
!             else if (strcmp(self->string->type->name, "unicode") == 0 &&
!                      strcmp(self->string->encoding->name, "utf32") == 0 ) {
!                 *op_ptr = 3; /* fixme */
!             }
              cursor += sizeof(opcode_t);
  
              op_ptr  = (opcode_t *)cursor;
!             *op_ptr = 0; /* fixme */
              cursor += sizeof(opcode_t);
  
              op_ptr  = (opcode_t *)cursor;
***************
*** 1695,1709 ****
          case PFC_STRING:
              printf("    [ 'PFC_STRING', {\n");
              printf("        FLAGS    => 0x%04x,\n", self->string->flags);
!             printf("        ENCODING => %ld,\n", 
!                     (long) self->string->encoding->which);
!             printf("        TYPE     => %ld,\n",  
!                     (long) self->string->type);
!             printf("        SIZE     => %ld,\n",  
                      (long) self->string->bufused);
              /* TODO: Won't do anything reasonable for most encodings */
!             printf("        DATA     => '%.*s'\n",  
!                     self->string->bufused, (char *) self->string->bufstart); 
              printf("    } ],\n");
              break;
  
--- 1710,1724 ----
          case PFC_STRING:
              printf("    [ 'PFC_STRING', {\n");
              printf("        FLAGS    => 0x%04x,\n", self->string->flags);
!             printf("        ENCODING => %s,\n",
!                     (long) self->string->encoding->name);
!             printf("        TYPE     => %s,\n",
!                     (long) self->string->type->name);
!             printf("        SIZE     => %ld,\n",
                      (long) self->string->bufused);
              /* TODO: Won't do anything reasonable for most encodings */
!             printf("        DATA     => '%.*s'\n",
!                     self->string->bufused, (char *) self->string->bufstart);
              printf("    } ],\n");
              break;
  
***************
*** 1729,1735 ****
  * Local variables:
  * c-indentation-style: bsd
  * c-basic-offset: 4
! * indent-tabs-mode: nil 
  * End:
  *
  * vim: expandtab shiftwidth=4:
--- 1744,1750 ----
  * Local variables:
  * c-indentation-style: bsd
  * c-basic-offset: 4
! * indent-tabs-mode: nil
  * End:
  *
  * vim: expandtab shiftwidth=4:
diff -c 'parrot/string.c' 'parrot-ns/string.c'
Index: ./string.c
Prereq:  1.15 
*** ./string.c  Tue Oct 23 00:34:47 2001
--- ./string.c  Sat Oct 27 16:15:03 2001
***************
*** 12,17 ****
--- 12,20 ----
  
  #include "parrot/parrot.h"
  
+ static const CHARTYPE *string_native_type;
+ static const CHARTYPE *string_unicode_type;
+ 
  /* Basic string stuff - creation, enlargement, destruction, etc. */
  
  /*=for api string string_init
***************
*** 19,28 ****
   */
  void
  string_init(void) {
!     Parrot_string_vtable[enc_native] = string_native_vtable();
!     Parrot_string_vtable[enc_utf8] = string_utf8_vtable();
!     Parrot_string_vtable[enc_utf16] = string_utf16_vtable();
!     Parrot_string_vtable[enc_utf32] = string_utf32_vtable();
  }
  
  /*=for api string string_make
--- 22,29 ----
   */
  void
  string_init(void) {
!     string_native_type = chartype_lookup("usascii");
!     string_unicode_type = chartype_lookup("unicode");
  }
  
  /*=for api string string_make
***************
*** 30,44 ****
   * and compute its string length
   */
  STRING *
! string_make(struct Parrot_Interp *interpreter, void *buffer, INTVAL buflen, INTVAL 
encoding, INTVAL flags, INTVAL type) {
      STRING *s = new_string_header(interpreter);
      s->bufstart = mem_sys_allocate(buflen);
      mem_sys_memcopy(s->bufstart, buffer, buflen);
!     s->encoding = &(Parrot_string_vtable[encoding]);
      s->buflen = s->bufused = buflen;
      s->flags = flags;
      string_compute_strlen(s);
      s->type = type;
      return s;
  }
  
--- 31,55 ----
   * and compute its string length
   */
  STRING *
! string_make(struct Parrot_Interp *interpreter, void *buffer, INTVAL buflen, const 
ENCODING *encoding, INTVAL flags, const CHARTYPE *type) {
      STRING *s = new_string_header(interpreter);
+ 
+     if (!type) {
+       type = string_native_type;
+     }
+ 
+     if (!encoding) {
+       encoding = encoding_lookup(type->default_encoding);
+     }
+ 
      s->bufstart = mem_sys_allocate(buflen);
      mem_sys_memcopy(s->bufstart, buffer, buflen);
!     s->encoding = encoding;
      s->buflen = s->bufused = buflen;
      s->flags = flags;
      string_compute_strlen(s);
      s->type = type;
+ 
      return s;
  }
  
***************
*** 77,95 ****
   */
  STRING*
  string_copy(struct Parrot_Interp *interpreter, STRING *s) {
!     return string_make(interpreter, s->bufstart, s->buflen, s->encoding->which, 
s->flags, s->type);
  }
  
! /* vtable despatch functions */
  
! #define ENC_VTABLE(x) x->encoding
  
  /*=for api string string_compute_strlen
   * get the string length of the string
   */
  INTVAL
  string_compute_strlen(STRING* s) {
!     return (s->strlen = (ENC_VTABLE(s)->compute_strlen)(s));
  }
  
  /*=for api string string_max_bytes
--- 88,113 ----
   */
  STRING*
  string_copy(struct Parrot_Interp *interpreter, STRING *s) {
!     return string_make(interpreter, s->bufstart, s->buflen, s->encoding, s->flags, 
s->type);
  }
  
! STRING*
! string_transcode(struct Parrot_Interp *interpreter, STRING *src, const ENCODING 
*encoding, const CHARTYPE *type, STRING *dest) {
!     if (!dest) {
!         dest = string_make(interpreter, NULL, 0, encoding, 0, type);
!     }
!     return dest;
! }
  
! /* vtable despatch functions */
  
  /*=for api string string_compute_strlen
   * get the string length of the string
   */
  INTVAL
  string_compute_strlen(STRING* s) {
!     s->strlen = s->encoding->characters(s->bufstart, s->bufused);
!     return s->strlen;
  }
  
  /*=for api string string_max_bytes
***************
*** 97,103 ****
   */
  INTVAL
  string_max_bytes(STRING* s, INTVAL iv) {
!     return (ENC_VTABLE(s)->max_bytes)(iv);
  }
  
  /*=for api string string_concat
--- 115,121 ----
   */
  INTVAL
  string_max_bytes(STRING* s, INTVAL iv) {
!     return iv * s->encoding->max_bytes;
  }
  
  /*=for api string string_concat
***************
*** 105,111 ****
   */
  STRING*
  string_concat(struct Parrot_Interp *interpreter, STRING* a, STRING* b, INTVAL flags) 
{
!     return (ENC_VTABLE(a)->concat)(interpreter, a, b, flags);
  }
  
  /*=for api string string_substr
--- 123,136 ----
   */
  STRING*
  string_concat(struct Parrot_Interp *interpreter, STRING* a, STRING* b, INTVAL flags) 
{
!     if (a->type != b->type || a->encoding != b->encoding) {
!         b = string_transcode(interpreter, b, a->encoding, a->type, NULL);
!     }
!     string_grow(a, a->strlen + b->strlen);
!     mem_sys_memcopy((void*)((ptrcast_t)a->bufstart + a->bufused), b->bufstart, 
b->bufused);
!     a->strlen = a->strlen + b->strlen;
!     a->bufused = a->bufused + b->bufused;
!     return a;
  }
  
  /*=for api string string_substr
***************
*** 115,120 ****
--- 140,147 ----
  STRING*
  string_substr(struct Parrot_Interp *interpreter, STRING* src, INTVAL offset, INTVAL 
length, STRING** d) {
      STRING *dest;
+     char *substart;
+     char *subend;
      if (offset < 0) {
          offset = src->strlen + offset;
      }
***************
*** 129,140 ****
          length = src->strlen - offset;
      }
      if (!d || !*d) {
!         dest = string_make(interpreter, NULL, 0, src->encoding->which, 0, 0);
      }
      else {
          dest = *d;
      }
!     return (ENC_VTABLE(src)->substr)(src, offset, length, dest);
  }
  
  /*=for api string string_chopn
--- 156,173 ----
          length = src->strlen - offset;
      }
      if (!d || !*d) {
!         dest = string_make(interpreter, NULL, 0, src->encoding, 0, src->type);
      }
      else {
          dest = *d;
      }
!     substart = src->encoding->skip_forward(src->bufstart, offset);
!     subend = src->encoding->skip_forward(substart, length);
!     string_grow(dest, length);
!     mem_sys_memcopy(dest->bufstart, substart, subend - substart);
!     dest->bufused = subend - substart;
!     dest->strlen = length;
!     return dest;
  }
  
  /*=for api string string_chopn
***************
*** 142,154 ****
   */
  STRING*
  string_chopn(STRING* s, INTVAL n) {
      if (n > s->strlen) {
          n = s->strlen;
      }
      if (n < 0) {
          n = 0;
      }
!     return (ENC_VTABLE(s)->chopn)(s, n);
  }
  
  /*=for api string string_compare
--- 175,192 ----
   */
  STRING*
  string_chopn(STRING* s, INTVAL n) {
+     char *bufstart = s->bufstart;
+     char *bufend = bufstart + s->bufused;
      if (n > s->strlen) {
          n = s->strlen;
      }
      if (n < 0) {
          n = 0;
      }
!     bufend = s->encoding->skip_backward(bufend, n);
!     s->bufused = bufend - bufstart;
!     s->strlen = s->strlen - n;
!     return s;
  }
  
  /*=for api string string_compare
***************
*** 156,171 ****
   */
  INTVAL
  string_compare(struct Parrot_Interp *interpreter, STRING* s1, STRING* s2) {
!     if (s1->encoding != s2->encoding) {
!         if (s1->encoding->which != enc_utf32) {
!             s1 = Parrot_transcode_table[s1->encoding->which][enc_utf32](interpreter, 
s1, NULL);
!         }
!         if (s2->encoding->which != enc_utf32) {
!             s2 = Parrot_transcode_table[s2->encoding->which][enc_utf32](interpreter, 
s2, NULL);
!         }
      }
  
!     return (ENC_VTABLE(s1)->compare)(s1, s2);
  }
  
  /*
--- 194,229 ----
   */
  INTVAL
  string_compare(struct Parrot_Interp *interpreter, STRING* s1, STRING* s2) {
!     char *s1start;
!     char *s1end;
!     char *s2start;
!     char *s2end;
!     INTVAL cmp = 0;
! 
!     if (s1->type != s2->type || s1->encoding != s2->encoding) {
!         s1 = string_transcode(interpreter, s1, NULL, string_unicode_type, NULL);
!         s2 = string_transcode(interpreter, s2, NULL, string_unicode_type, NULL);
      }
  
!     s1start = s1->bufstart;
!     s1end = s1start + s1->bufused;
!     s2start = s2->bufstart;
!     s2end = s2start + s2->bufused;
! 
!     while (cmp == 0 && s1start < s1end && s2start < s2end) {
!         INTVAL c1 = s1->encoding->decode(s1start);
!         INTVAL c2 = s2->encoding->decode(s2start);
! 
!         cmp = c1 - c2;
! 
!         s1start = s1->encoding->skip_forward(s1start, 1);
!         s2start = s2->encoding->skip_forward(s2start, 1);
!     }
! 
!     if (cmp == 0 && s1start < s1end) cmp = 1;
!     if (cmp == 0 && s2start < s2end) cmp = -1;
! 
!     return cmp;
  }
  
  /*
#### End of Patch data ####

#### ApplyPatch data follows ####
# Data version        : 1.0
# Date generated      : Sat Oct 27 16:16:36 2001
# Generated by        : makepatch 2.00_05
# Recurse directories : Yes
# r 'transcode.c' 9177 0
# r 'strutf8.c' 5655 0
# r 'strutf32.c' 3310 0
# r 'strutf16.c' 4751 0
# r 'strnative.c' 3401 0
# r 'include/parrot/transcode.h' 740 0
# r 'include/parrot/strutf8.h' 549 0
# r 'include/parrot/strutf32.h' 555 0
# r 'include/parrot/strutf16.h' 555 0
# r 'include/parrot/strnative.h' 560 0
# p 'MANIFEST' 2693 1004191183 0100644
# p 'Makefile.in' 3479 1004191365 0100644
# c 'chartype.c' 0 1004191443 0100644
# c 'chartypes/unicode.c' 0 1004191336 0100644
# c 'chartypes/usascii.c' 0 1004191301 0100644
# c 'encoding.c' 0 1004191456 0100644
# c 'encodings/singlebyte.c' 0 1004193640 0100644
# c 'encodings/utf16.c' 0 1004194234 0100644
# c 'encodings/utf32.c' 0 1004194288 0100644
# c 'encodings/utf8.c' 0 1004193968 0100644
# p 'global_setup.c' 722 1004190258 0100644
# c 'include/parrot/chartype.h' 0 1004190639 0100644
# c 'include/parrot/encoding.h' 0 1004193655 0100644
# p 'include/parrot/exceptions.h' 721 1004182401 0100644
# p 'include/parrot/parrot.h' 1551 1004190704 0100644
# p 'include/parrot/string.h' 2835 1004194775 0100644
# p 'packfile.c' 36570 1004194746 0100644
# p 'string.c' 4706 1004195703 0100644
# C 'chartypes' 0 1004191390 040755
# C 'encodings' 0 1004194299 040755
#### End of ApplyPatch data ####

#### End of Patch kit [created: Sat Oct 27 16:16:36 2001] ####
#### Patch checksum: 1730 44890 18503 ####
#### Checksum: 1792 46862 55736 ####

Re: String rationale

Reply via email to