Hi Not very sure whether this is the right place to ask. But after searching the mailing list at http://www.debian.org/MailingLists/subscribe, I can't find a better one to post my question. So ask it here.
My question is - does wc can count multi bytes characters, such as Big5/ UTF-8 Chinese? If not, maybe I can help to modify source to get it count words directly. Env: kernel 2.6.27.8/ wc 6.10/ gcc version 4.3.2 / Debian lenny/ LANG en_US..UTF-8 I have a file named e.g. abc which contains Chinese and English characters. It may display as below (not very sure whether it can be seen in the mailing list) this is a ๆไปถ vi ๆ็ The manual words count are 8 characters. But the output of wc -w is 6. It seems like it is separated as token by white space. So the characters of Chinese which concatenates together would be treated as one character; resulting in the total words count is 6. I check the source, it seems it does not check if the input characters are multi bytes or not (e.g.wchar_t). So basically just to check if this has been done already. Thanks for help, -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org