Abdelrazak Younes <[EMAIL PROTECTED]> writes: | > UTF-8 is a multi-byte encoding. It's useful for output to file | > because the data are stored as characters (bytes). So, much of a | > UTF-8 encoded file will be human readable; only the multi-byte | > sequences will not. | > Storing UTF-8 encoded data in a std::vector<char> (or uchar) is | > eminently sensible because you're telling your users that the | > container is just that; a container. You don't plan on using it for | > anything other than storage and transport from one part of the code | > to another. In particular, you certainly don't plan on using it to | > perform string manipulations. | > UCS-4 encodes all characters in the known universe, just as UTF-8 | > does, but each and every character takes up 4 bytes. It's reasonable | > to use a 32 bit unsigned int to store each character. The advantage | > of UCS-4 is that all characters take up the same space, so | > std::basic_string<boost::uin32_t>::length | > () is actually meaningful. | | exactly my point ;-) | Switching from vector to basic_string for ucs2 and 4 will simplify the code.
But that does not mean that the unicode.[Ch] api should change. -- Lgb