Steve Gunnell wrote:
Hi People,

Back in December I asked a question about utf8 I/O. Leo responded
pointing me at the encoding filters. I then published a possible
implementation of PIO_utf8_read with a request for comments.

Since that time I have been thinking about the testing and
implementation of I/O filters. Actually I started thinking about how to
create a suitable test set for the fragment I had written which fed back
various shortcomings of the implementation and led to wider thinking of
the entire process. I am primarily thinking about file I/O but I can see
no reason why this scheme cannot apply to any other form of I/O.

1) The immediate result returned by the lowest level of a read operation
is an undifferentiated type which I am going to call "bytestream". This
type makes no assumptions about the internal encoding of its contents.

Yep.

2) Trans-encoding from a bytestream string to a named charset and
encoding involves:
2.a) confirming that the bytestream converts to an integral number of
characters in the target encoding. The trans-encoding function should
return any trailing character fragments as a bytestream string.

Or warn or throw an exception.

3) I feel that it would be preferable if the read opcode specified N
characters to be read rather than N bytes.

Yep. If the user pushed an UTF8 input filter, it's pretty clear that he wants to deal with chars and not bytes.

... However to implement this the
PIO_*_read call would have to pass down the maximum byte size of a
character as well as the character count to the fundamental operation.

A utf8 input filter would read bytes by one from the underlaying 'buf' layer and convert N chars on the fly. A fixed-width encoding filter can just multiply N by bytes_per_char. I don't see a problem here.

4) PIO_*_peek needs to include a parameter to specify the maximum byte
length of one character in the target charset / encoding so that the
fundamental operation can guarantee returning enough bytes to return a
character after trans-encoding.

Or PIO_peek is disabled for e.g. utf8 filters and returns an error.

5) Seeking through an encoding filter could be highly problematic.
Filters such as "utf8" that have a non-deterministic byte per character
ratio should politely refuse seeks.

Yep - same.


6) Use of escape codes also adds a non-deterministic level to character
counts.

That's an entirely different problem and hasn't much in common with eg an utf8 input filter.

7) The line buffered read function should be removed from the
fundamental operations and made into a filter layer similar to the "buf"
layer.

There is no line buffered read function in the layer_api. io_buf does exactly, what you are proposing.

8) There would be advantages to having a PIO_*_get_encoding function in
the I/O interface to allow enquiries about the returned encoding from
lower levels.

I'm not sure about that.

Cheers,

Steve Gunnell

leo

Reply via email to