Bo Borgerson wrote: > Jim Meyering wrote: >> Bo Borgerson <[EMAIL PROTECTED]> wrote: >>> I may be misinterpreting your patch, but it seems to me that >>> decrementing count for zero-width characters could potentially lead to >>> confusion. Not all zero-width characters are combining characters, right? >> It looks ok to me, since there's an unconditional increment >> >> chars++; >> >> about 25 lines above, so the decrement would just undo that. > > > Right, I guess my question is more about the semantics of `wc -m'. > Should stand-alone zero-width characters such as the zero-width space be > counted? > > The attached (UTF-8) file contains 3 characters according to HEAD, but > only two with the patch.
Interesting, I thought of that myself but assumed iswspace(u"zero-width space") == 1 Actually there are no chars where: wcwidth(char)==0 && iswspace(char)==1 In the first 65535 code points there are also 404 chars which are not classed as combining in the unicode database, but are classed as zero width in the glibc locale data at least (zero-width space being one of them like you mentioned). I determined this with the attached progs: ./zw | python unidata.py | grep " 0 " | wc -l So I suggest that we don't merge my tweak as is. What we could do is: 1. Find a method to distinguish the above 404 characters at least. 2. Define -m to mean "individual displayable characters" if this is what people usually want. 3. Add a new option for this. Pádraig.
#define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <ctype.h> #include <wchar.h> #include <wctype.h> #include <string.h> #include <locale.h> int main(int argc, char** argv) { /* This is a single threaded app, so mark as such for performance. */ #include <stdio_ext.h> __fsetlocking(stdin,FSETLOCKING_BYCALLER); __fsetlocking(stdout,FSETLOCKING_BYCALLER); if (!setlocale(LC_CTYPE, "")) { //TODO: What about LC_COLLATE? fprintf(stderr,"Warning locale not supported by glibc, using 'C' locale\n"); } wchar_t wc; for (wc=0; wc<=0xFFFF; wc++) { if (!wcwidth(wc)) { printf("%04X\n",wc); } } }
import unicodedata,sys for char in sys.stdin: char = char[:-1] c = unichr(int(char,16)) try: print char, int(unicodedata.combining(c)!=0), unicodedata.name(c) except: print
_______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils