>    struct perl_string {
>      void *string_buffer;
>      UV length;
>      UV allocated;
>      UV flags;
>    }
> 
> The low three bits of the flags field is reserved for the type of the
> string. The various types are:
> 
> =over 4
> 
> =item BINARY (0)
> 
> =item ASCII (1)
> 
> =item EBCDIC (2)
> 
> =item UTF_8 (3)
> 
> =item UTF_32 (4)
> 
> =item NATIVE_1 (5) through NATIVE_3 (7)

Some thoughts about string encoding. Because Unicode normalization
and canonical equivalent, some characters that take one codepoint
in one encoding may take two or more codepoints in another encoding,
mainly vowels with diacritics. In that sense, the substr() may give
different results depending on its current encoding.

Here is an example, "re`sume`" takes 6 characters in Latin-1, but
could take 8 characters in Unicode. All Perl functions that directly
deal with character position and length will be sensitive to encoding.
I wonder how we should handle this case.

Hong


Reply via email to