Re: [patch] fix plain text output

Abdelrazak Younes Wed, 16 Aug 2006 10:20:21 -0700

Angus Leeming wrote:

Abdelrazak Younes <[EMAIL PROTECTED]> writes:
Hum... I am not I follows everything but let me summarize what Iunderstand from current code. The std::vectors I am talking about are:
* vector<char>: could be replaced by std::basic_string<char>
* vector<unsigned char>: that is ucs2 right? That could be replaced bystd::basic_string<unsigned char>* vector<boost::uint32_t>: I guess that is ucs4 and that could bereplaced by std::basic_string<unsigned char>
Internally we should just use one of those three types. The conversionto this complicate utf8 encoding should happen on input/output only.Handling a multi-byte encoding internally is just a recipe for a buggyfuture IMHO.
So what I do not get right here?
Lots ;-)
UTF-8 is a multi-byte encoding. It's useful for output to file because thedata are stored as characters (bytes). So, much of a UTF-8 encoded file willbe human readable; only the multi-byte sequences will not.
Storing UTF-8 encoded data in a std::vector<char> (or uchar) is eminentlysensible because you're telling your users that the container is just that; acontainer. You don't plan on using it for anything other than storage andtransport from one part of the code to another. In particular, you certainlydon't plan on using it to perform string manipulations.
UCS-4 encodes all characters in the known universe, just as UTF-8 does, buteach and every character takes up 4 bytes. It's reasonable to use a 32 bitunsigned int to store each character. The advantage of UCS-4 is that allcharacters take up the same space, so std::basic_string<boost::uin32_t>::length
() is actually meaningful.


exactly my point ;-)
Switching from vector to basic_string for ucs2 and 4 will simplify the code.

Any clearer?

Man, that's exactly what I understood. My choice of terms weremisleading as Georg pointed out (multi-byte versus variable-byte).


Abdel.

Re: [patch] fix plain text output

Reply via email to