Angus Leeming wrote:
Abdelrazak Younes <[EMAIL PROTECTED]> writes:
Hum... I am not I follows everything but let me summarize what I understand from current code. The std::vectors I am talking about are:

* vector<char>: could be replaced by std::basic_string<char>
* vector<unsigned char>: that is ucs2 right? That could be replaced by std::basic_string<unsigned char> * vector<boost::uint32_t>: I guess that is ucs4 and that could be replaced by std::basic_string<unsigned char>

Internally we should just use one of those three types. The conversion to this complicate utf8 encoding should happen on input/output only. Handling a multi-byte encoding internally is just a recipe for a buggy future IMHO.

So what I do not get right here?

Lots ;-)

UTF-8 is a multi-byte encoding. It's useful for output to file because the data are stored as characters (bytes). So, much of a UTF-8 encoded file will be human readable; only the multi-byte sequences will not.

Storing UTF-8 encoded data in a std::vector<char> (or uchar) is eminently sensible because you're telling your users that the container is just that; a container. You don't plan on using it for anything other than storage and transport from one part of the code to another. In particular, you certainly don't plan on using it to perform string manipulations.

UCS-4 encodes all characters in the known universe, just as UTF-8 does, but each and every character takes up 4 bytes. It's reasonable to use a 32 bit unsigned int to store each character. The advantage of UCS-4 is that all characters take up the same space, so std::basic_string<boost::uin32_t>::length
() is actually meaningful.

exactly my point ;-)
Switching from vector to basic_string for ucs2 and 4 will simplify the code.


Any clearer?

Man, that's exactly what I understood. My choice of terms were misleading as Georg pointed out (multi-byte versus variable-byte).

Abdel.

Reply via email to