Re: horrible utf-8 performace in wc

Pádraig Brady Thu, 08 May 2008 06:42:04 -0700

Bruno Haible wrote:
> As a consequence:
>   - The number of characters is the same as the number of wide characters.
>   - "wc -m" must output the number of characters.
>   - In a Unicode locale, <U00E9> is one character, and <U0065><U0301> is
>     two characters,


Fair enough.

> If you want wc to count characters after canonicalization, then you can
> invent a new wc command-line option for it.

I guess one would could possibly have --chars={unicode,glyph,grapheme,column}
with unicode being the default, and how it currently works.

> But I would find it more useful
> to have a filter program that reads from standard input and writes the
> canonicalized output to standard output; that would be applicable in many
> more situations.

That would be _very_ useful, yes.

thanks for all the great info in this thread,
Pádraig.



_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: horrible utf-8 performace in wc

Reply via email to