Hi People,

Back in December I asked a question about utf8 I/O. Leo responded
pointing me at the encoding filters. I then published a possible
implementation of PIO_utf8_read with a request for comments.

Since that time I have been thinking about the testing and
implementation of I/O filters. Actually I started thinking about how to
create a suitable test set for the fragment I had written which fed back
various shortcomings of the implementation and led to wider thinking of
the entire process. I am primarily thinking about file I/O but I can see
no reason why this scheme cannot apply to any other form of I/O.

1) The immediate result returned by the lowest level of a read operation
is an undifferentiated type which I am going to call "bytestream". This
type makes no assumptions about the internal encoding of its contents.

2) Trans-encoding from a bytestream string to a named charset and
encoding involves:
2.a) confirming that the bytestream converts to an integral number of
characters in the target encoding. The trans-encoding function should
return any trailing character fragments as a bytestream string. 
2.b) labelling the (possibly) truncated string with the target charset
and encoding.

3) I feel that it would be preferable if the read opcode specified N
characters to be read rather than N bytes. However to implement this the
PIO_*_read call would have to pass down the maximum byte size of a
character as well as the character count to the fundamental operation.
If N stays as bytes then the implementation will return a trans-encoding
dependent integral number of characters derived from no more than N
bytes of source data. It may also be desirable to limit the returned
string to also retuning no more than N bytes.

4) PIO_*_peek needs to include a parameter to specify the maximum byte
length of one character in the target charset / encoding so that the
fundamental operation can guarantee returning enough bytes to return a
character after trans-encoding.

5) Seeking through an encoding filter could be highly problematic.
Filters such as "utf8" that have a non-deterministic byte per character
ratio should politely refuse seeks.

6) Use of escape codes also adds a non-deterministic level to character
counts. The generation and normalisation of escape codes during
trans-encoding is very DWIM but the documents need to explicitly set a
policy on this behaviour. In general the use of HTML style entity codes
is preferable to using C style \nnn code as they can be normalised to
any encoding that supports them rather than requiring the programmer to
have to guess the original encoding.

7) The line buffered read function should be removed from the
fundamental operations and made into a filter layer similar to the "buf"
layer. There is no guarantee that the underlying data source is going to
conform to the line end notions of the current system and this should be
able to be compensated for.

8) There would be advantages to having a PIO_*_get_encoding function in
the I/O interface to allow enquiries about the returned encoding from
lower levels.

Okay some examples ...

$P0 = open "foo"
push $P0, 'ascii'
push $P0, 'by_line'
This would be a standard line oriented read/write.

$P0 = open "foo"
push $P0, 'utf16'
push $P0, 'by_line'
push $P0, 'utf8'
This could be used to read a Windows unicode file while all internal
processing is done using utf8 encodings. 'by_line' would need
initialisation with a non default line end marker.

$P0 = open "foo"
push $P0, 'ebcdic'
push $P0, 'ascii'
For mainframes.

$P0 = open "foo"
push $P0, 'encrypt_blowfish'
push $P0, 'adaptive_huffman'
push $P0, 'escaped_ascii'
push $P0, 'utf8'
You can figure it out .... 


Cheers,

Steve Gunnell


Reply via email to