Re: The encoding API

Dan Sugalski Tue, 10 Aug 2004 07:00:43 -0700

At 11:29 PM -0400 8/9/04, Michael Stone wrote:

After my discussion, I've included an annotated copy of the functions, where I've added my comments after each function.

So, assuming (hah!) that I correctly understood everything in the draft, It seems that I have several general concerns: 1. Why are lists of bits or bytes that are being interpreted as "unencoded", which seem to fit your definition of "string" in the WTHI: A String (http://www.sidhe.org/~dan/blog/archives/000255.html), written as being of the same type as lists of codepoints that have a definite encoding attached? They may be similar, but would some confusion be avoided (or worth avoiding) by recognizing what seems to be to be a fundamental difference in the way that the two are interpreted?

I'm not sure I understand the question (though it is early, and I've had insufficient coffee) but it might help if I explain that there's no such thing as unencoded lists of bytes, at least as far as parrot's concerned. The closest we get at the encoding level is "8-bit fixed length" for binary data. With that, each code point is a single byte.

2. Do "offset" parameters measure the offset in codepoints or in bytes, or does it vary by function? If it varies by function, I think we need a better notation for that. However, if it's in bytes, who has to deal with the problems posed by variable-length encodings? However, if "strings" represent text as an array (or list of bytes), then it makes some sense for the offset to be in bytes and not codepoints.

It varies by function. That needs to be better noted. If you're looking for a code point, the offset's in code points, if you're looking for a byte the offset's in bytes. (And when we get to the charset end of things, offsets will be in graphemes (or Freds. I don't remember what we finally decided to name the things))

3. Are we storing Unicode code-points in 32-bit wide fields, and if so, is UINTVAL guaranteed to be 32-bits wide on all supported platforms? Incidentally, as I understand the standard, Unicode was designed not to care about signed/unsigned semantics. That's why U+FFFF is always an illegal value.

UINTVAL is always at least 32 bits for Parrot. It may be bigger. We'll prefer to store Unicode codepoints in UTF-32 format, I expect, as it's got O(1) access time for code points. (Not characters, of course, but that's a separate problem)

4. When we're replacing codepoints in a STRING, which I understand to be a collection of bytes, what happens when the codepoint that we're replacing is shorter than the code-point that is being put in its place (in a variable length encoding)? Are there any other issues with inserting code-points that we need to be aware of, like issuing warnings when illegal codepoints are inserted, or problems with neighboring codepoints that need to be checked (I'm thinking of Hangul here)?


There are several problems here.

When dealing with variable-length encodings, removal of codepoints in the middle may make the string shrink, and adding them may make it grow. The encoding layer is responsible for managing the underlying byte buffer to maintain consistency.

The encoding layer is responsible for complaining if you do byte-level manipulation and generate invalid codepoints. (Like because you messed up a UTF-8 buffer, for example)

The encoding layer is *not* responsible for maintaining the integrity of graphemes. That's the responsibility of the charset layer, and one of the reasons the charset layer shadows the encoding functions. (So when you manipulate code points you'll generally do it through the charset layer, which'll enforce correctness if it wants)

5. Do we need to have any hooks for transport-layer encodings like compression algorithms?


Ow. You just made my brain hurt.

I honestly hadn't considered layering encodings. This would be an amazingly cool thing. I think, though, that we won't, no matter how cool it would be to have, say, a zip or gzip encoding on top of a UTF-32 encoding. For that the transport stream would have to do the compression or uncompression.

Still, it *would* be cool to have, say, a GZip-UTF32 layer that looked UTF-32 and did compression on the underlying buffer... (That would give a very different view of code points vs bytes there too)

6. Can the meanings of the functions be more clearly conveyed, particularly with respect to the length functions, or are these already standardized names for this functionality?


Sure. The naming's completely up in the air, and I'm happy for alternatives.

7. What, if any responsibilities do these functions have for insuring that the input that they're dealing with and the output that they produce are valid?

The assumption can be made that input data is already valid, so the functions are only required to make the end result of the function valid, assuming that the encoding actually cares. (Which it may not, but that strikes me as unlikely)

8. How will errors be handled? By exception? If so, what exceptions? What atomicity guarantees can we give? I.e., If we're transforming a string "in-place" and an exception is thrown, what happens to the string?

Exceptions will be thrown for errors. I think we'll not guarantee atomicity in the face of an exception, but I'm open for argument on that one.

I didn't detail the exceptions that get thrown. Let's put that off for a bit and treat it separately, once we nail down the function semantics and names.

The annotated functions:

 void to_encoding(STRING *);
Make the string the new encoding, in place


If this fails, it dies by throwing an exception?  If so, what exception?
Is this an atomic function?

It does throw an exception, and I'd not planned on it being particularly atomic, but we could define it as such.

 STRING *copy_to_encoding(STRING *);
Make a copy of the string, in the new encoding.
Likewise. Also, what semantics does the return value have if this fails?


It throws an exception, so won't return.

 UINTVAL get_codepoint(STRING *, offset);
Return the codepoint at offset.
Is "offset" the number of codepoints to skip or the number of bytes to skip? (Since strings are thought of as being the bits that represent text, yet implementations (I think) shouldn't be responsible for dealing with variable-length encodings).


Offset is in codepoints.

Also, what are the failure semantics?

Failure. Hrm. Good question. Probably an exception, though I can see returning an empty string.

 void set_codepoint(STRING, offset, UINTVAL codepoint);
Set the codepoint at offset to codepoint
Since strings are lists of bits representing text, if you insert a codepoint in a UTF-8 encoded string (or anything else that's a variable-length encoding) and the codepoint that you insert is longer than the one that's there, what happens?

The old one is completely removed and the new one inserted. If they're of different lengths then the proper thing happens. (Which is that all the bytes of the old one are chopped out and all the bytes of the new one inserted with appropriate shuffling of the rest of the buffer)

Also, is UINTVAL guaranteed to be 32-bits wide? (Since that's what Unicode code-points are supposed to be stored in, as I understand it.)


Yes. UINTVAL is at least 32 bits.

 UINTVAL get_byte(STRING *, offset)
Get the byte at offset
Same concerns about the meaning of "offset".


Byte offset. Needs more clarity.

  void set_byte(STRING *, offset, UINTVAL byte);
Set the byte at offset to byte


Same concerns about "offset".
Why is the type of parameter "byte" a UINTVAL?

Because we don't have a byte type. :) If the byte's out of range we should probably pitch an exception.

  STRING *get_codepoints(STRING, offset, count);

    Get count codepoints starting at offset, returned as a STRING of no
    charset. (If called through the charset code the returned string may be
    put into a charset if that's a valid thing)

Same concerns about "offset".


Same answer. (That is, I need to be more explicit :)

  STRING *get_bytes(STRING, offset, count)
Get count bytes starting at offset, returned as a binary STRING.
What is the design rationale for naming a set of bytes from which encoding (if I understand correctly) is deliberately being stripped as being of the same type as the encoded string?

This is for very low-level things, since we don't officially allow you to cheat and peek straight at the buffer. For example, assume that I am sending the buffer to an external source in chunks of fixed-byte size. The low-level code will want to, say, snag the first 512 bytes and send them (if I've a 512 byte packet size), then the next 512, and so on, with no regards to anything higher-level--I want *exactly* what's in the buffer.

Say I take a string in a foreign encoding and immediately call get_bytes on it. Should I, as a programmer, be allowed to (implicitly) convert the resulting chunk of bytesback into a string in the current encoding by calling other string functions on it? (i.e. since it probably can't be correctly interpreted without an explicit transformation call before I do this?) Perhaps what I'm asking is if all string functions are going to be doing checking for bad input or if it will be assumed that the functions used to create strings produce clean output all the time.

Generally this function won't be used by higher-level code. We'll expose it via ops for folks writing low-level driver-type code, or charset code, in bytecode, but for day to day use nobody'll use it, and they ought not.

  void set_codepoints(STRING, offset, count, STRING codepointstring);

    Set count codepoints, at offset offset to the contents of the codepoint
    string.


Same concerns about "offset".
Is it assumed that both strings are in the same encoding?

Nope, they don't have to be the same encoding. If they're not then the function will have to convert, or iterate over the string.

  void set_bytes(STRING, offset, count, STRING binarystring);
    Set count bytes, at offset offset, to the contents of binary string
Same concerns about "offset." Why is this function different from set_codepoints in name but not in signature?

Because the semantics of codepoint and binary access are potentially different, so I wanted to distinguish the two operations.

Why are strings being interpreted as bits not of a different type from lists of codepoints in a specified encoding?


I'm afraid I've had insufficient coffee to figure out what you mean here.

  void become_encoding(STRING *);
    Assume the string is the new encoding and make it so. Validate first
    and throw an exception if this assumption is incorrect.
I don't think that I understand what this function is supposed to do. This is supposed to force any given string into the encoding that is specified by the functions in the global table?

This is mainly for use in converting 'encodingless' strings (really binary strings) to real encodings.

If, for example, I read in a file from disk or off a network socket or something, the lowest levels of parrot will see that file as a series of bytes. Now, if we know that the bytes are really valid UTF-8, we'd read them into the string (which will be marked as "8-bit fixed width" encoding) and call the UTF-8's become_encoding function. That'll then scan the buffer, validate that it really is valid UTF-8, and change the string's encoding in place.

It's mainly an in-place 'from binary' function. (Where the to_encoding is a "change to" function)

----------
  UINTVAL codepoints(STRING *);
    Return the size in codepoints
  UINTVAL bytes(STRING *);
    Return the size in bytes
Maybe this is a standard name in parrot, but would a name like "byte_length" or "num_codepoints" more clearly indicate that what is being returned is in fact a size or length and not the actual bytes or codepoints? Is there a standard format for finding the length of things?

There's no standard. Making one's fine, and depending on the plural ending is really a bad idea for non-english speakers. D'oh! byte_length and codepoint_length are just fine.

Also, same concerns about type conflation between lists of bits, independent of encoding and lists of codepoints with a specified encoding.

Hrm. Bit lists. I suppose we could have a true binary encoding. That'd be interesting... -- Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Re: The encoding API

Reply via email to