Re: The encoding API

Michael Stone Fri, 13 Aug 2004 01:29:15 -0700

After my discussion, I've included an annotated copy of the functions, where I've added my comments after each function.

So, assuming (hah!) that I correctly understood everything in the draft, It seems that I have several general concerns: 1. Why are lists of bits or bytes that are being interpreted as "unencoded", which seem to fit your definition of "string" in the WTHI: A String (http://www.sidhe.org/~dan/blog/archives/000255.html), written as being of the same type as lists of codepoints that have a definite encoding attached? They may be similar, but would some confusion be avoided (or worth avoiding) by recognizing what seems to be to be a fundamental difference in the way that the two are interpreted?

2. Do "offset" parameters measure the offset in codepoints or in bytes, or does it vary by function? If it varies by function, I think we need a better notation for that. However, if it's in bytes, who has to deal with the problems posed by variable-length encodings? However, if "strings" represent text as an array (or list of bytes), then it makes some sense for the offset to be in bytes and not codepoints.

3. Are we storing Unicode code-points in 32-bit wide fields, and if so, is UINTVAL guaranteed to be 32-bits wide on all supported platforms? Incidentally, as I understand the standard, Unicode was designed not to care about signed/unsigned semantics. That's why U+FFFF is always an illegal value.

4. When we're replacing codepoints in a STRING, which I understand to be a collection of bytes, what happens when the codepoint that we're replacing is shorter than the code-point that is being put in its place (in a variable length encoding)? Are there any other issues with inserting code-points that we need to be aware of, like issuing warnings when illegal codepoints are inserted, or problems with neighboring codepoints that need to be checked (I'm thinking of Hangul here)?

5. Do we need to have any hooks for transport-layer encodings like compression algorithms?

6. Can the meanings of the functions be more clearly conveyed, particularly with respect to the length functions, or are these already standardized names for this functionality?

7. What, if any responsibilities do these functions have for insuring that the input that they're dealing with and the output that they produce are valid?

8. How will errors be handled? By exception? If so, what exceptions? What atomicity guarantees can we give? I.e., If we're transforming a string "in-place" and an exception is thrown, what happens to the string?

Hope this gives you some ideas. All in all, I think it will be quite workable. Michael ("Ashsong" on irc)

The annotated functions:

 void to_encoding(STRING *);
Make the string the new encoding, in place


If this fails, it dies by throwing an exception?  If so, what exception?
Is this an atomic function?

 STRING *copy_to_encoding(STRING *);
Make a copy of the string, in the new encoding.


Likewise.  Also, what semantics does the return value have if this fails?

 UINTVAL get_codepoint(STRING *, offset);
Return the codepoint at offset.

Is "offset" the number of codepoints to skip or the number of bytes to skip? (Since strings are thought of as being the bits that represent text, yet implementations (I think) shouldn't be responsible for dealing with variable-length encodings).

Also, what are the failure semantics?

 void set_codepoint(STRING, offset, UINTVAL codepoint);
Set the codepoint at offset to codepoint

Since strings are lists of bits representing text, if you insert a codepoint in a UTF-8 encoded string (or anything else that's a variable-length encoding) and the codepoint that you insert is longer than the one that's there, what happens?

Also, is UINTVAL guaranteed to be 32-bits wide? (Since that's what Unicode code-points are supposed to be stored in, as I understand it.)

 UINTVAL get_byte(STRING *, offset)
Get the byte at offset


Same concerns about the meaning of "offset".

  void set_byte(STRING *, offset, UINTVAL byte);
Set the byte at offset to byte

Same concerns about "offset".

Why is the type of parameter "byte" a UINTVAL?

  STRING *get_codepoints(STRING, offset, count);
Get count codepoints starting at offset, returned as a STRING of no charset. (If called through the charset code the returned string may be put into a charset if that's a valid thing)


Same concerns about "offset".

  STRING *get_bytes(STRING, offset, count)
Get count bytes starting at offset, returned as a binary STRING.

What is the design rationale for naming a set of bytes from which encoding (if I understand correctly) is deliberately being stripped as being of the same type as the encoded string?

Say I take a string in a foreign encoding and immediately call get_bytes on it. Should I, as a programmer, be allowed to (implicitly) convert the resulting chunk of bytesback into a string in the current encoding by calling other string functions on it? (i.e. since it probably can't be correctly interpreted without an explicit transformation call before I do this?) Perhaps what I'm asking is if all string functions are going to be doing checking for bad input or if it will be assumed that the functions used to create strings produce clean output all the time.

  void set_codepoints(STRING, offset, count, STRING codepointstring);
Set count codepoints, at offset offset to the contents of the codepoint string.


Same concerns about "offset".
Is it assumed that both strings are in the same encoding?


  void set_bytes(STRING, offset, count, STRING binarystring);

    Set count bytes, at offset offset, to the contents of binary string

Same concerns about "offset." Why is this function different from set_codepoints in name but not in signature? Why are strings being interpreted as bits not of a different type from lists of codepoints in a specified encoding?

  void become_encoding(STRING *);

    Assume the string is the new encoding and make it so. Validate first
    and throw an exception if this assumption is incorrect.

I don't think that I understand what this function is supposed to do. This is supposed to force any given string into the encoding that is specified by the functions in the global table?

----------


  UINTVAL codepoints(STRING *);

    Return the size in codepoints

  UINTVAL bytes(STRING *);

    Return the size in bytes

Maybe this is a standard name in parrot, but would a name like "byte_length" or "num_codepoints" more clearly indicate that what is being returned is in fact a size or length and not the actual bytes or codepoints? Is there a standard format for finding the length of things?

Also, same concerns about type conflation between lists of bits, independent of encoding and lists of codepoints with a specified encoding.

--------------

Re: The encoding API

Reply via email to