Thanks for the feedback. My changes are now in the "multibyte" branch at
https://github.com/ericfischer/coreutils/tree/multibyte branched from the savannah coreutils repository. I've moved my lib changes (all multibyte or wide versions of existing single-byte functions) into a shared file in the src directory. They could be moved upstream if they turn out to be useful outside the scope of coreutils. If I'm reading your web page and code correctly, it sounds like the main things we disagree upon are: * Character widths. I treat any printing character as being of equal width (as it is on my display); you use wcswidth() to try to identify the characters' widths. * Handling of invalid encodings. I generally stop with an error; you wrap the foreign byte and pass it through to the output as an opaque object. * Case-insensitive comparison. I follow POSIX and map lower case to upper case equivalents where available; you use a case-insensitive collator. * Surrogate pairs. I trust wchar_t to be a sufficient character type; you have a special case for UTF-16 systems. It is true that I should pay more attention to character widths in expand, unexpand, fold, fmt, and pr. In particular I should make sure that zero-width characters are treated as zero-width and that they stay attached to the previous character so that combining accents will work. I don't think any more character width awareness than that is portable between displays. If wrapping foreign bytes is a requirement, I could do that, although it seems like unnecessary complexity when LC_ALL=C is available for binary files and other implementations get away with reporting errors when given files with invalid encodings. I don't think there is a good solution to case folding. On systems like glibc that have working collation, sort will already fold case whether or not you ask for it. On systems like MacOS that have broken collation, there is no collator to resort to when case mapping isn't sufficient. I don't think there is a good solution to the surrogate pair problem either. On systems where wide characters are only 16 bits, the wctype functions will be wrong on characters beyond that limit, so there's only so much the tools can do. Have I missed or misrepresented anything important? Thanks! Eric
