Author: simon
Date: Sat Jan 19 21:55:18 2008
New Revision: 25030

Added:
   trunk/docs/pdds/draft/pdd28_character_sets.pod   (contents, props changed)

Changes in other areas also in this revision:
Modified:
   trunk/MANIFEST
   trunk/MANIFEST.SKIP

Log:
[docs] A start on the charsets PDD. Will write more over the coming week.


Added: trunk/docs/pdds/draft/pdd28_character_sets.pod
==============================================================================
--- (empty file)
+++ trunk/docs/pdds/draft/pdd28_character_sets.pod      Sat Jan 19 21:55:18 2008
@@ -0,0 +1,98 @@
+# Copyright (C) 2001-2005, The Perl Foundation.
+# $Id$
+
+=head1 NAME
+
+docs/pdds/pdd28_character_sets.pod - Strings and character sets
+
+=head1 ABSTRACT
+
+This PDD describes the conventions expected for users of Parrot strings,
+including but not limited to support for multiple character sets,
+encodings and languages.
+
+=head1 VERSION
+
+$Revision$
+
+=head1 DESCRIPTION
+
+Here is a summary of the design decisions described in this PDD.
+
+=over 3
+
+=item *
+
+Parrot supports multiple string formats, and so users of Parrot strings
+must be aware at all times of string encoding issues and how these
+relate to the string interface.
+
+=item *
+
+The native Parrot string format is an array of 32-bit Unicode codepoints
+in B<grapheme normalization form>. (NFG)
+
+=item * 
+
+NFG is defined as a normalization which allocates at most one codepoint
+to each visible character.
+
+=item *
+
+An interface is defined for interacting with Parrot strings and converting 
+between character sets and encodings.
+
+=back
+
+=head2 Encoding awareness
+
+Parrot was designed from the outset to support multiple string formats.
+Unlike other such projects, we don't standardize on Unicode internally.
+This is because for the majority of use cases, it's still far more
+efficient to deal with whatever input data the user sends us, which,
+equally in the majority of use cases, is something like ASCII - or at
+least, some kind of byte-based rather than character-based encoding.
+
+So internally, consumers of Parrot strings have to be aware that there
+is a plurality of string encodings going on inside Parrot. (Producers of
+Parrot strings can do whatever is most efficient for them.) The
+implications of this for the internal API will be detailed in the
+implementation section below, but to put it in simple terms: if you find
+yourself writing C<*s++> or any other C string idioms, you need to stop
+and think if that's what you really mean. Not everything is byte-based
+any more. 
+
+However, we're going to try to make it as easy for C<*s++>-minded people
+as possible, and part of that is the declaration of a Parrot native
+string format. You don't have to use it, but if you do all your dreams
+will come true.
+
+=head2 Native string format
+
+Dealing with variable-byte encodings is not fun; for instance, you need
+to do a bunch of computations every time you traverse a string. In order
+to make programming a lot easier, we define a Parrot native string
+format to be an array of unsigned 32-bit Unicode codepoints. 
+
+=head2 Grapheme normalization form
+
+=head1 IMPLEMENTATION
+
+=head2 Changes required to current string implementation
+
+=head2 String access API
+
+=head2 Normalization form
+
+=head2 String encoding API
+
+=head1 REFERENCES
+
+List of references.
+
+=cut
+
+__END__
+Local Variables:
+  fill-column:78
+End:

Reply via email to