Re: UTF-8 support for wc(1)

Todd C. Miller Thu, 03 Dec 2015 10:49:39 -0800

On Sun, 29 Nov 2015 17:45:55 +0100, Ingo Schwarze wrote:

> our wc(1) utility currently violates POSIX in two ways:
> 
>  1. The -m option counts bytes instead of characters.
>     The patch given below fixes that.
> 
>  2. Word counting with -w only treats ASCII whitespace as word
>     boundaries and regards two words joined by non-ASCII whitespace
>     as one single word.
> 
> The second issue is not related to UTF-8, but a matter of full
> Unicode support.  It would not be hard to fix that by using
> mbtowc(3) and iswblank(3) instead of mblen(3).  However, i don't
> think we want to pollute our base system tools with functions
> requiring full Unicode support, not even to the extent available
> in our own C library.  So i consider iswblank(3) taboo for now.


I'm a little surprised by this.  It doesn't seem like it would be
any more complicated to use mbtowc(3) and iswblank(3) for the
multibyte case.

If you want to revisit this later when we have better Unicode support
I suppose that is OK too.

 - todd

Re: UTF-8 support for wc(1)

Reply via email to