Hi People, Back in December I asked a question about utf8 I/O. Leo responded pointing me at the encoding filters. I then published a possible implementation of PIO_utf8_read with a request for comments.
Since that time I have been thinking about the testing and implementation of I/O filters. Actually I started thinking about how to create a suitable test set for the fragment I had written which fed back various shortcomings of the implementation and led to wider thinking of the entire process. I am primarily thinking about file I/O but I can see no reason why this scheme cannot apply to any other form of I/O. 1) The immediate result returned by the lowest level of a read operation is an undifferentiated type which I am going to call "bytestream". This type makes no assumptions about the internal encoding of its contents. 2) Trans-encoding from a bytestream string to a named charset and encoding involves: 2.a) confirming that the bytestream converts to an integral number of characters in the target encoding. The trans-encoding function should return any trailing character fragments as a bytestream string. 2.b) labelling the (possibly) truncated string with the target charset and encoding. 3) I feel that it would be preferable if the read opcode specified N characters to be read rather than N bytes. However to implement this the PIO_*_read call would have to pass down the maximum byte size of a character as well as the character count to the fundamental operation. If N stays as bytes then the implementation will return a trans-encoding dependent integral number of characters derived from no more than N bytes of source data. It may also be desirable to limit the returned string to also retuning no more than N bytes. 4) PIO_*_peek needs to include a parameter to specify the maximum byte length of one character in the target charset / encoding so that the fundamental operation can guarantee returning enough bytes to return a character after trans-encoding. 5) Seeking through an encoding filter could be highly problematic. Filters such as "utf8" that have a non-deterministic byte per character ratio should politely refuse seeks. 6) Use of escape codes also adds a non-deterministic level to character counts. The generation and normalisation of escape codes during trans-encoding is very DWIM but the documents need to explicitly set a policy on this behaviour. In general the use of HTML style entity codes is preferable to using C style \nnn code as they can be normalised to any encoding that supports them rather than requiring the programmer to have to guess the original encoding. 7) The line buffered read function should be removed from the fundamental operations and made into a filter layer similar to the "buf" layer. There is no guarantee that the underlying data source is going to conform to the line end notions of the current system and this should be able to be compensated for. 8) There would be advantages to having a PIO_*_get_encoding function in the I/O interface to allow enquiries about the returned encoding from lower levels. Okay some examples ... $P0 = open "foo" push $P0, 'ascii' push $P0, 'by_line' This would be a standard line oriented read/write. $P0 = open "foo" push $P0, 'utf16' push $P0, 'by_line' push $P0, 'utf8' This could be used to read a Windows unicode file while all internal processing is done using utf8 encodings. 'by_line' would need initialisation with a non default line end marker. $P0 = open "foo" push $P0, 'ebcdic' push $P0, 'ascii' For mainframes. $P0 = open "foo" push $P0, 'encrypt_blowfish' push $P0, 'adaptive_huffman' push $P0, 'escaped_ascii' push $P0, 'utf8' You can figure it out .... Cheers, Steve Gunnell