Abdelrazak Younes <[EMAIL PROTECTED]> writes:
> Hum... I am not I follows everything but let me summarize what I 
> understand from current code. The std::vectors I am talking about are:
> 
> * vector<char>: could be replaced by std::basic_string<char>
> * vector<unsigned char>: that is ucs2 right? That could be replaced by 
> std::basic_string<unsigned char>
> * vector<boost::uint32_t>: I guess that is ucs4 and that could be 
> replaced by std::basic_string<unsigned char>
> 
> Internally we should just use one of those three types. The conversion 
> to this complicate utf8 encoding should happen on input/output only. 
> Handling a multi-byte encoding internally is just a recipe for a buggy 
> future IMHO.
> 
> So what I do not get right here?

Lots ;-)

UTF-8 is a multi-byte encoding. It's useful for output to file because the 
data are stored as characters (bytes). So, much of a UTF-8 encoded file will 
be human readable; only the multi-byte sequences will not.

Storing UTF-8 encoded data in a std::vector<char> (or uchar) is eminently 
sensible because you're telling your users that the container is just that; a 
container. You don't plan on using it for anything other than storage and 
transport from one part of the code to another. In particular, you certainly 
don't plan on using it to perform string manipulations.

UCS-4 encodes all characters in the known universe, just as UTF-8 does, but 
each and every character takes up 4 bytes. It's reasonable to use a 32 bit 
unsigned int to store each character. The advantage of UCS-4 is that all 
characters take up the same space, so std::basic_string<boost::uin32_t>::length
() is actually meaningful.

Any clearer?
Angus

Reply via email to