coreutils wc count multi bytes question

Neo Anderson Fri, 06 Feb 2009 15:19:18 -0800

Hi

Not very sure whether this is the right place to ask. But after searching the 
mailing list at http://www.debian.org/MailingLists/subscribe, I can't find a 
better one to post my question. So ask it here.


My question is - does wc can count multi bytes characters, such as Big5/ UTF-8 
Chinese? If not, maybe I can help to modify source to get it count words 
directly. 

Env: kernel 2.6.27.8/ wc 6.10/ gcc version 4.3.2 / Debian lenny/ LANG 
en_US..UTF-8

I have a file named e.g. abc which contains Chinese and English characters. It 
may display as below (not very sure whether it can be seen in the mailing list)

this is a 文件 vi 打的

The manual words count are 8 characters. But the output of wc -w is 6. It seems 
like it is separated as token by white space. So the characters of Chinese 
which concatenates together would be treated as one character; resulting in the 
total words count is 6. 

I check the source, it seems it does not check if the input characters are 
multi bytes or not (e.g.wchar_t). So basically just to check if this has been 
done already. 

Thanks for help,








--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

coreutils wc count multi bytes question

Reply via email to