Author: allison Date: Tue Apr 22 10:56:50 2008 New Revision: 27124 Modified: trunk/docs/pdds/draft/pdd28_character_sets.pod
Log: [pdd] Add interface specification to Strings PDD. Modified: trunk/docs/pdds/draft/pdd28_character_sets.pod ============================================================================== --- trunk/docs/pdds/draft/pdd28_character_sets.pod (original) +++ trunk/docs/pdds/draft/pdd28_character_sets.pod Tue Apr 22 10:56:50 2008 @@ -250,28 +250,98 @@ =head2 String API -Strings have the following structure: +Strings in the Parrot core should use the Parrot C<STRING> structure. Parrot +developers generally shouldn't deal with C<char *> or other string-like types +outside of this abstraction. It's also best not to access members of the +C<STRING> structure directly. The interpretation of the data inside the +structure is determined by the data's encoding. Parrot's strings are +encoding-aware so your functions don't need to be. + +Parrot's internal strings (C<STRING>s) have the following structure: struct parrot_string_t { UnionVal cache; Parrot_UInt flags; UINTVAL bufused; - UINTVAL hashval; UINTVAL strlen; - char *strstart; + UINTVAL hashval; const struct _encoding *encoding; const struct _charset *charset; const struct _normalization *normalization; }; -Deprecation note: the enum C<parrot_string_representation_t> will be removed. +The fields are: + +=over 4 + +=item cache + +A structure that holds the buffer for the string data and the size of the +buffer in bytes. + +{{NOTE: this is currently called "cache" for compatibility with PMC structures. +As we move toward eliminating the cache from PMCs, we will flatten out this +union value in the string structure to two members: a string buffer and the +size of the buffer used.}} + +=item flags + +Binary flags used for garbage collection, copy-on-write tracking, and other +metadata. + +=item bufused + +The amount of the buffer currently in use, in bytes. + +=item strlen + +The length of the string, in bytes. {{NOTE, not in characters, as characters +may be variably sized.}} + +=item hashval + +A cache of the hash value of the string, for rapid lookups when the string is +used as a hash key. + +=item encoding + +How the data is encoded (e.g. fixed 8-bit characters, UTF-8, or UTF-32). Note +that this specifies encoding only -- it's valid to encode EBCDIC characters +with the UTF-8 algorithm. Silly, but valid. + +The encoding structure specifies the encoding (by index number and by name, +for ease of lookup), the maximum number of bytes that a single character will +occupy in that encoding, as well as functions for manipulating strings with +that encoding. + +=item charset + +What sort of string data is in the buffer, for example ASCII, EBCDIC, or +Unicode. + +The charset structure specifies the character set (by index number and by +name) and provides functions for transcoding to and from that character set. + +=item normalization + +What normalization form the string data is in, one of the 4 Unicode +normalization forms or NFG. This structure stores information about the current +normalization form, function pointers for composition and decomposition for the +current normalization form, and a pointer to the grapheme table for NFG. + +=back + + +{{DEPRECATION NOTE: the enum C<parrot_string_representation_t> will be removed +from the parrot string structure. It's been commented out for years.}} -The current string functions will on the whole be maintained, with some -modifications for the addition of the NFG string format. +{{DEPRECATION NOTE: the C<char *> pointer C<strstart> will be removed. It +complicates the entire string subsystem for a tiny optimization on substring +operations, and offset math is messy with encodings that aren't byte-based.}} =head3 Conversions between normalization form, encoding, and charset -Conversion will be done with a function called C<string_grapheme_copy>: +Conversion will be done with a function called C<Parrot_string_grapheme_copy>: INTVAL string_grapheme_copy(STRING *src, STRING *dst) @@ -285,8 +355,446 @@ characters in non-NFG strings). This conversion effectively uses an intermediate NFG representation. +=head2 String Subsystem + +The following functions are used internally to initialise and terminate the +string allocation and garbage collection subsystem. + +=head3 Parrot_string_system_init (was string_init) + +Initialize Parrot's string subsystem. + +=head3 Parrot_string_system_end (was string_deinit) + +Terminate Parrot's string subsystem (clean up). + +=head2 String Interface Functions + +The current string functions will be maintained, with some modifications for +the addition of the NFG string format. Many string functions that are part of +Parrot's external API will be renamed for the standard "Parrot_*" naming +conventions. + +=head3 Parrot_string_set (was string_set) + +Set one string to a copy of the value of another string. + +=head3 Parrot_string_new_COW (was Parrot_make_COW_reference) + +Create a new copy-on-write string. Creating a new string header, clone the +struct members of the original string, and point to the same string buffer as +the original string. + +=head3 Parrot_string_reuse_COW (was Parrot_reuse_COW_reference) + +Create a new copy-on-write string. Clone the struct members of the original +string into a passed in string header, and point the reused string header to +the same string buffer as the original string. + +=head3 Parrot_string_write_COW (was Parrot_unmake_COW) + +If the specified Parrot string is copy‐on‐write, copy the string value to a new +string buffer and clear the copy-on-write flag. + +=head3 Parrot_string_concat (was string_concat) + +Concatenate two strings. Takes three arguments: two strings, and one integer +value of flags. If both string arguments are null, returns a new string created +according to the integer flags. + +=head3 Parrot_string_append (was string_append) + +Append one string to another and return the result. In the default case, the +return value is the same as the first string argument (the argument is modified +by the operation). If the first argument is COW or read-only, then the return +value is a new string. + +=head3 Parrot_string_from_cstring (was string_from_cstring) + +Create a Parrot string from a C string (a C<char *>). Takes two arguments, a C +string, and an integer length of the string (number of characters). If the +integer length isn't passed, the function will calculate the length. + +{{NOTE: the integer length isn't really necessary, and is under consideration +for deprecation.}} + +=head3 Parrot_constant_string_new (was const_string) + +Creates and returns a new Parrot constant string. Takes one C string (a C<char +*>) as an argument, the value of the constant string. The length of the C +string is calculated internally. + +=head3 Parrot_string_new + +Return a new string with the default encoding and character set. Accepts one +argument, a C string (C<char *>) to initialize the value of the string. + +=head3 Parrot_string_new_noinit (was string_make_empty) + +Returns a new empty string with the default encoding and chararacter set. + +=head3 Parrot_string_new_init (was string_make_direct) + +Returns a new string of the requested encoding, character set, and +normalization form, initializing the string value to the value passed in. Takes +5 arguments, a C string (C<char *>), an integer length of the string argument +in bytes, and struct pointers for encoding, character set, and normalization +form structs. If the C string (C<char *>) value is not passed, returns an empty +string. If the encoding, character set, or normalization form are passed as +null values, default values are used. + +{{NOTE: the crippled version of this function, C<string_make>, used to accept a +string name for the character set. This behavior is no longer supported, but +C<Parrot_find_encoding> and C<Parrot_find_charset> can be called to look up the +encoding or character set structs.}} + +=head3 Parrot_string_resize (was string_grow) + +Resize the string buffer of the given string adding the number of bytes passed +in the integer argument. If the argument is negative, remove the given number +of bytes. Throws an exception if shrinking the string buffer size will truncate +the string (if C<strlen> will be longer than C<buflen>). + +=head3 Parrot_string_length (was string_compute_strlen) + +Returns the number of characters in the string. Combining characters are each +counted separately. Variable-width encodings may lookahead. + +=head3 Parrot_string_grapheme_length + +Returns the number of graphemes in the string. Groups of combining characters +count as a single grapheme. + +=head3 Parrot_string_byte_length (was string_length) + +Returns the number of bytes in the string. The character width of +variable-width encodings is ignored. Combining characters are not treated any +differently than other characters. This is equivalent to directly accessing the +C<strlen> member of the C<STRING> struct. + +=head3 Parrot_string_index (was string_index) + +Returns the character at the specified index (the Nth character from the start +of the string). Combining characters are counted separately. Variable-width +encodings will lookahead to capture full character values. + +=head3 Parrot_string_grapheme_index + +Returns the grapheme at the specified index (the Nth grapheme from the start of +the string). Groups of combining characters count as a single grapheme, so this +function may return multiple characters. + +=head3 Parrot_string_find_substr (was string_str_index) + +Search for a given substring within a string. If it's found, return an integer +index to where the substring was found (the Nth character from the start of the +string). Combining characters are counted separately. Variable-width encodings +will lookahead to capture full character values. Returns -1 if the substring is +not found. + +=head3 Parrot_string_copy (was string_copy) + +Make a COW copy a string (a new string header pointing to the same string +buffer). + +=head3 Parrot_string_grapheme_copy (new) + +Accepts two string arguments: a destination and a source. Iterates through the +source string one grapheme at a time and appends it to the destination string. + +This function can be used to convert a string of one format to another format. + +=head3 Parrot_string_repeat (was string_repeat) + +Return a string containing the passed string argument, repeated the number of +times in the integer argument. + +=head3 Parrot_string_substr (was string_substr) + +Return a substring starting at an integer offset with an integer length. The +offset and length specify characters. Combining characters are counted +separately. Variable-width encodings will lookahead to capture full character +values. + +=head3 Parrot_string_grapheme_substr + +Return a substring starting at an integer offset with an integer length. The +offset and length specify graphemes. Groups of combining characters count as a +single grapheme. + +=head3 Parrot_string_replace (was string_replace) + +Replaces a substring within the first string argument with the second string +argument. An integer offset and length, in characters, specify where the +removed substring starts and how long it is. + +=head3 Parrot_string_grapheme_replace + +Replaces a substring within the first string argument with the second string +argument. An integer offset and length, in graphemes, specify where the removed +substring starts and how long it is. + +=head3 Parrot_string_chopn (was string_chopn) + +Chop the requested number of characters off the end of a string without +modifying the original string. + +=head3 Parrot_string_chopn_inplace (was string_chopn_inplace). + +Chop the requested number of characters off the end of a string, modifying the +original string. + +=head3 Parrot_string_grapheme_chopn + +Chop the requested number of graphemes off the end of a string without +modifying the original string. + + + +=head2 Internal String Functions + +=head3 string_max_bytes + +Calculate the number of bytes needed to contain a given number of characters in +a particular encoding. It multiplies the maximum possible width of a character +in the encoding by the number of characters requested. + +{{NOTE: pretty primitive and not very useful. May be deprecated.}} + +=head2 Deprecated String Functions + +The following string functions are slated to be deprecated. + +=head3 string_primary_encoding_for_representation + +Not useful, it only ever returned ASCII. + +=head3 string_rep_compatible + +Only useful on a very narrow set of string encodings/character sets. + +=head3 string_make + +This was a crippled version of a string initializer, now replaced with the full +version C<Parrot_string_new_init>. + +=head3 string_capacity + +This was used to calculate the size of the buffer after the C<strstart> +pointer. Deprecated with C<strstart>. + +=head3 string_ord + +Replaced by C<Parrot_string_index>. + +=head3 string_chr + +This is handled just fine by C<Parrot_string_new>, we don't need a special +version for a single character. + +=head3 make_writable + +An archaic function that uses a method of describing strings that hasn't been +allowed for years. + =head2 String PMC API +The String PMC provides a high-level object interface to the string +functionality. It contains a standard Parrot string, holding the string data. + +=head3 Vtable Functions + +The String PMC implements the following vtable functions. + +=over 4 + +=item init + +Initialize a new String PMC. + +=item new_from_string + +Create a new String PMC from a Parrot string argument. + +=item clone + +Clone a String PMC. + +=item mark + +Mark the string value of the String PMC as live. + + +=item get_integer + +Return the integer representation of the string. + +=item get_number + +Return the floating-point representation of the string. + +=item get_bignum + +Return the big number representation of the string. + +=item get_string + +Return the string value of the String PMC. + +=item get_bool + +Return the boolean value of the string. + +=item set_integer_native + +Set the string to an integer value, transforming the integer to its string +equivalent. + +=item set_bool + +Set the string to a boolean (integer) value, transforming the boolean to its +string equivalent. + +=item set_number_native + +Set the string to a floating-point value, transforming the number to its string +equivalent. + +=item set_string_native + +Set the String PMC's stored string value to be the string argument. If the +passed in string is a constant, store a copy. + +=item assign_string_native + +Set the String PMC's stored string value to a copy of the string argument. + +=item set_string_same + +Set the String PMC's stored string value to be the same as another String PMC's +stored string value. {{NOTE: uses direct access into the storage of the two +PMCs, very ugly.}} + +=item set_pmc + +Set the String PMC's stored string value to be the same as another PMC's string +value, as returned by that PMC's C<get_string> vtable function. + +=item *bitwise* + +All the bitwise string vtable functions, for AND, OR, XOR, and NOT, both +inplace and standard return. + +=item is_equal + +Compares the string values of two PMCs and returns true if they match exactly. + +=item is_equal_num + +Compares the numeric values of two PMCs (first transforming any strings to +numbers) and returns true if they match exactly. + +=item is_equal_string + +Compares the string values of two PMCs and returns true if they match exactly. +{{NOTE: the documentation for the PMC says that it returns FALSE if they match. +This is not the desired behavior.}} + +=item is_same + +Compares two PMCs and returns true if they are the same PMC class and contain +the same string (not an equivalent string value, but aliases to the same +low-level string). + +=item cmp + +Compares two PMCs and returns 1 if SELF is shorter, 0 if they are equal length +strings, and -1 if the passed in string argument is shorter. + +=item cmp_num + +Compares the numeric values of two PMCs (first changing those values to +numbers) and returns 1 if SELF is smaller, 0 if they are equal, and -1 if the +passed in string argument is smaller. + +=item cmp_string + +Compares two PMCs and returns 1 if SELF is shorter, 0 if they are equal length +strings, and -1 if the passed in string argument is shorter. + +=item substr + +Extract a substring of a given length starting from a given offset (in +graphemes) and store the result in the string argument. + +=item substr_str + +Extract a substring of a given length starting from a given offset (in +graphemes) and return the string. + +=item exists_keyed + +Return true if the Nth grapheme in the string exists. Negative numbers count +from the end. + +=item get_string_keyed + +Return the Nth grapheme in the string. Negative numbers count from the end. + +=item set_string_keyed + +Insert a string at the Nth grapheme position in the string. {{NOTE: this is +different than the current implementation.}} + +=item get_integer_keyed + +Returns the integer value of the Nth C<char> in the string. {{DEPRECATE}} + +=item set_integer_keyed + +Replace the C<char> at the Nth character position in the string with the +C<char> that corresponds to the passed integer value key. {{DEPRECATE}} + +=back + +=head3 Methods + +The String PMC provides the following methods. + +=over 4 + +=item replace + +Replace every occurance of one string with another. + +=item to_int + +Return the integer equivalent of a string. + +=item lower + +Change the string to all lowercase. + +=item trans + +Translate an ASCII string with entries from a translation table. + +{{NOTE: likely to be deprecated.}} + +=item reverse + +Reverse a string, one grapheme at a time. {{NOTE: Currenly only works for ASCII +strings, because it reverses one C<char> at a time.}} + + +=item is_integer + +Checks if the string is just an integer. {{NOTE: Currently only works for ASCII +strings, fix or deprecate.}} + +=back + + =head1 REFERENCES http://sirviente.9grid.es/sources/plan9/sys/doc/utf.ps - Plan 9's Runes are