Hi, Now that we've got ICU in, I thought it may be time to revisit the encodings implementation. I am a clamorous ignorant is unicode/encodings issues, so please be patient with me. :)
>From what I have asked people at IRC, and what's on the list archives, my understanding of how parrot will work with various encodings is: i) After an IO operation, strings are preserved on their original encoding, whatever it is. ii) Parrot will try to keep the string in such encoding, for as long as possible. iii) When some operation that requires ICU is requested, parrot will feed the string to ICU, who will convert it to UTF-16 (its internal encoding) and then perform the desired operation. Please correct me if this is wrong. Now, my questions are: I. About iii): I can imagine at least three different options about what to do with the converted UTF-16 string: a) We can discard the UTF-16 version, and recompute the conversion each time. (this is costly, isn't it?) b) We can replace the original string with the "upgraded" version, so strings will lazily become converted to UTF-16. (this makes sure that the conversion is only done once, but is conversion to UTF-16 always lossless?) c) We can store the UTF-16 version along the original one. (this is doubles the memory usage, plus it may be hard to implement) Each approach has its pros and cons. Which one is the right one? II. About ii): Which is exactly the point at which we decide to feed the string to ICU, and what operations should we (as parrot developers) implement in our own layer?. For example, let's take a relatively simple operation, such as uppercasing an string, and let's assume that the string is on a friendly encoding, such as ISO-8859-1. Even with this assumptions, conversion to uppercase is not straightforward, since it's locale-dependent (or to be more precise, it might be locale-dependent if the user chooses to). We could, of course, implement all locale-aware operations for each encoding and each locale, but how much work do we want to put on this? So, exactly what string functionalities do we want to implement ourselves in a per-encoding basis, and which ones are we going to forward to ICU? -angel