Re: horrible utf-8 performace in wc

2008-06-06 Thread Eric Blake
Bruno Haible clisp.org> writes: > > http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html > > But before these techniques can be used in practice in packages such as > coreutils, two problems would have to be solved satisfactorily: > > 1) "George Pollard makes the assumption that

Re: horrible utf-8 performace in wc

2008-06-06 Thread Bruno Haible
Pádraig Brady wrote: > There have been some interesting "counting UTF-8 strings" threads > over at reddit lately, all referenced from this article: > http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html But before these techniques can be used in practice in packages such as coreutils

Re: horrible utf-8 performace in wc

2008-06-05 Thread Jim Meyering
Pádraig Brady <[EMAIL PROTECTED]> wrote: > There have been some interesting "counting UTF-8 strings" threads > over at reddit lately, all referenced from this article: > http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html Thanks for the pointer! Interesting, indeed. __

Re: horrible utf-8 performace in wc

2008-06-05 Thread Pádraig Brady
There have been some interesting "counting UTF-8 strings" threads over at reddit lately, all referenced from this article: http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html Pádraig. ___ Bug-coreutils mailing list Bug-coreutils@gnu.org

Re: horrible utf-8 performace in wc

2008-05-08 Thread Jim Meyering
Bruno Haible <[EMAIL PROTECTED]> wrote: > 2008-05-08 Bruno Haible <[EMAIL PROTECTED]> > > Speed up "wc -m" and "wc -w" in multibyte case. > * src/wc.c: Include mbchar.h. > (wc): New variable in_shift. Use it to avoid calling mbrtowc for most > ASCII characters. Thanks! I'

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
> $ time ./wc -m long_lines.txt > 13357046 long_lines.txt > real0m1.860s It processes at the speed of 7 million characters per second. I would not call this a "horrible performance". > However wc calls mbrtowc() for each multibyte character. Yes. One could use mbstowcs (or mbsnrtowcs, but th

Re: horrible utf-8 performace in wc

2008-05-08 Thread Pádraig Brady
Bruno Haible wrote: > As a consequence: > - The number of characters is the same as the number of wide characters. > - "wc -m" must output the number of characters. > - In a Unicode locale, is one character, and is > two characters, Fair enough. > If you want wc to count characters af

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bo Borgerson
Bruno Haible wrote: > If you want wc to count characters after canonicalization, then you can > invent a new wc command-line option for it. But I would find it more useful > to have a filter program that reads from standard input and writes the > canonicalized output to standard output; that would

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
> @@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus) > linepos += width; > if (iswspace (wide_char)) > goto mb_word_separator; > + else if (uc_combining_class (wide_ch

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
> Is there a good library for combining-character canonicalization > available? That seems like something that would be useful to have in a > lot of text-processing tools. Also, for Unicode, something to shuffle > between the normalization forms might be helpful for comparisons. Such functionali

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
Pádraig Brady wrote: > mbstowcs doesn't canonicalize equivalent multibyte sequences, > and so therefore functions the same in this regard as our > processing of each wide character separately. > This could be considered a bug actually- i.e. should -m give > the number of wide chars, or the number o

Re: horrible utf-8 performace in wc

2008-05-08 Thread Bo Borgerson
Pádraig Brady wrote: > Bo Borgerson wrote: >> I poked around a little in gnulib and found a function for determining >> the combining class of a Unicode character. >> >> I think the attached patch does what you were intending to do, and it >> also counts all of the stand-alone zero-width characters

Re: horrible utf-8 performace in wc

2008-05-07 Thread Pádraig Brady
Bo Borgerson wrote: > Pádraig Brady wrote: >> In the first 65535 code points there are also 404 chars which are >> not classed as combining in the unicode database, but are classed >> as zero width in the glibc locale data at least (zero-width space >> being one of them like you mentioned). I deter

Re: horrible utf-8 performace in wc

2008-05-07 Thread Bo Borgerson
Pádraig Brady wrote: > In the first 65535 code points there are also 404 chars which are > not classed as combining in the unicode database, but are classed > as zero width in the glibc locale data at least (zero-width space > being one of them like you mentioned). I determined this with the > atta

Re: horrible utf-8 performace in wc

2008-05-07 Thread Pádraig Brady
Bo Borgerson wrote: > Jim Meyering wrote: >> Bo Borgerson <[EMAIL PROTECTED]> wrote: >>> I may be misinterpreting your patch, but it seems to me that >>> decrementing count for zero-width characters could potentially lead to >>> confusion. Not all zero-width characters are combining characters, ri

Re: horrible utf-8 performace in wc

2008-05-07 Thread Pádraig Brady
Bo Borgerson wrote: > Pádraig Brady wrote: >> canonically équivalent >> canonically équivalent >> >> Pádraig. >> >> p.s. I Notice that gnome-terminal still doesn't handle >> combining characters correctly, and my mail client thunderbird >> is putting the accent on the q rather than the e, sigh. >

Re: horrible utf-8 performace in wc

2008-05-07 Thread Bo Borgerson
Jim Meyering wrote: > Bo Borgerson <[EMAIL PROTECTED]> wrote: >> I may be misinterpreting your patch, but it seems to me that >> decrementing count for zero-width characters could potentially lead to >> confusion. Not all zero-width characters are combining characters, right? > > It looks ok to m

Re: horrible utf-8 performace in wc

2008-05-07 Thread Jim Meyering
Bo Borgerson <[EMAIL PROTECTED]> wrote: > I may be misinterpreting your patch, but it seems to me that > decrementing count for zero-width characters could potentially lead to > confusion. Not all zero-width characters are combining characters, right? It looks ok to me, since there's an unconditi

Re: horrible utf-8 performace in wc

2008-05-07 Thread Jim Meyering
Pádraig Brady <[EMAIL PROTECTED]> wrote: > Jan Engelhardt wrote: >> >> https://bugzilla.novell.com/show_bug.cgi?id=381873 >> >> Forwarding this because it is a GNU issue, not specifically a Novell one. >> I reproduced this myself with the latest coreutils from git >> (BTW: You might want to repack

Re: horrible utf-8 performace in wc

2008-05-07 Thread Jan Engelhardt
On Wednesday 2008-05-07 13:11, Pádraig Brady wrote: > >Now that is a _lot_ of extra time. libiconv could probably be >made more efficient. I've never actually looked at it. >However wc calls mbrtowc() for each multibyte character. >It would probably be a lot more efficient to use mbstowcs() >to co

Re: horrible utf-8 performace in wc

2008-05-07 Thread Bo Borgerson
Pádraig Brady wrote: > canonically équivalent > canonically équivalent > > Pádraig. > > p.s. I Notice that gnome-terminal still doesn't handle > combining characters correctly, and my mail client thunderbird > is putting the accent on the q rather than the e, sigh. They both render correctly he

Re: horrible utf-8 performace in wc

2008-05-07 Thread Pádraig Brady
Jan Engelhardt wrote: > > https://bugzilla.novell.com/show_bug.cgi?id=381873 > > Forwarding this because it is a GNU issue, not specifically a Novell one. > I reproduced this myself with the latest coreutils from git > (BTW: You might want to repack that repo, "counting objects" during the > clon