Angus Leeming wrote:
Abdelrazak Younes <[EMAIL PROTECTED]> writes:
Hum... I am not I follows everything but let me summarize what I
understand from current code. The std::vectors I am talking about are:
* vector<char>: could be replaced by std::basic_string<char>
* vector<unsigned char>: that is ucs2 right? That could be replaced by
std::basic_string<unsigned char>
* vector<boost::uint32_t>: I guess that is ucs4 and that could be
replaced by std::basic_string<unsigned char>
Internally we should just use one of those three types. The conversion
to this complicate utf8 encoding should happen on input/output only.
Handling a multi-byte encoding internally is just a recipe for a buggy
future IMHO.
So what I do not get right here?
Lots ;-)
UTF-8 is a multi-byte encoding. It's useful for output to file because the
data are stored as characters (bytes). So, much of a UTF-8 encoded file will
be human readable; only the multi-byte sequences will not.
Storing UTF-8 encoded data in a std::vector<char> (or uchar) is eminently
sensible because you're telling your users that the container is just that; a
container. You don't plan on using it for anything other than storage and
transport from one part of the code to another. In particular, you certainly
don't plan on using it to perform string manipulations.
UCS-4 encodes all characters in the known universe, just as UTF-8 does, but
each and every character takes up 4 bytes. It's reasonable to use a 32 bit
unsigned int to store each character. The advantage of UCS-4 is that all
characters take up the same space, so std::basic_string<boost::uin32_t>::length
() is actually meaningful.
exactly my point ;-)
Switching from vector to basic_string for ucs2 and 4 will simplify the code.
Any clearer?
Man, that's exactly what I understood. My choice of terms were
misleading as Georg pointed out (multi-byte versus variable-byte).
Abdel.