Bruno Haible clisp.org> writes:
> > http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
>
> But before these techniques can be used in practice in packages such as
> coreutils, two problems would have to be solved satisfactorily:
>
> 1) "George Pollard makes the assumption that
Pádraig Brady wrote:
> There have been some interesting "counting UTF-8 strings" threads
> over at reddit lately, all referenced from this article:
> http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
But before these techniques can be used in practice in packages such as
coreutils
Pádraig Brady <[EMAIL PROTECTED]> wrote:
> There have been some interesting "counting UTF-8 strings" threads
> over at reddit lately, all referenced from this article:
> http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
Thanks for the pointer!
Interesting, indeed.
__
There have been some interesting "counting UTF-8 strings" threads
over at reddit lately, all referenced from this article:
http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
Pádraig.
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
Bruno Haible <[EMAIL PROTECTED]> wrote:
> 2008-05-08 Bruno Haible <[EMAIL PROTECTED]>
>
> Speed up "wc -m" and "wc -w" in multibyte case.
> * src/wc.c: Include mbchar.h.
> (wc): New variable in_shift. Use it to avoid calling mbrtowc for most
> ASCII characters.
Thanks!
I'
> $ time ./wc -m long_lines.txt
> 13357046 long_lines.txt
> real0m1.860s
It processes at the speed of 7 million characters per second. I would not call
this a "horrible performance".
> However wc calls mbrtowc() for each multibyte character.
Yes. One could use mbstowcs (or mbsnrtowcs, but th
Bruno Haible wrote:
> As a consequence:
> - The number of characters is the same as the number of wide characters.
> - "wc -m" must output the number of characters.
> - In a Unicode locale, is one character, and is
> two characters,
Fair enough.
> If you want wc to count characters af
Bruno Haible wrote:
> If you want wc to count characters after canonicalization, then you can
> invent a new wc command-line option for it. But I would find it more useful
> to have a filter program that reads from standard input and writes the
> canonicalized output to standard output; that would
> @@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
> linepos += width;
> if (iswspace (wide_char))
> goto mb_word_separator;
> + else if (uc_combining_class (wide_ch
> Is there a good library for combining-character canonicalization
> available? That seems like something that would be useful to have in a
> lot of text-processing tools. Also, for Unicode, something to shuffle
> between the normalization forms might be helpful for comparisons.
Such functionali
Pádraig Brady wrote:
> mbstowcs doesn't canonicalize equivalent multibyte sequences,
> and so therefore functions the same in this regard as our
> processing of each wide character separately.
> This could be considered a bug actually- i.e. should -m give
> the number of wide chars, or the number o
Pádraig Brady wrote:
> Bo Borgerson wrote:
>> I poked around a little in gnulib and found a function for determining
>> the combining class of a Unicode character.
>>
>> I think the attached patch does what you were intending to do, and it
>> also counts all of the stand-alone zero-width characters
Bo Borgerson wrote:
> Pádraig Brady wrote:
>> In the first 65535 code points there are also 404 chars which are
>> not classed as combining in the unicode database, but are classed
>> as zero width in the glibc locale data at least (zero-width space
>> being one of them like you mentioned). I deter
Pádraig Brady wrote:
> In the first 65535 code points there are also 404 chars which are
> not classed as combining in the unicode database, but are classed
> as zero width in the glibc locale data at least (zero-width space
> being one of them like you mentioned). I determined this with the
> atta
Bo Borgerson wrote:
> Jim Meyering wrote:
>> Bo Borgerson <[EMAIL PROTECTED]> wrote:
>>> I may be misinterpreting your patch, but it seems to me that
>>> decrementing count for zero-width characters could potentially lead to
>>> confusion. Not all zero-width characters are combining characters, ri
Bo Borgerson wrote:
> Pádraig Brady wrote:
>> canonically équivalent
>> canonically équivalent
>>
>> Pádraig.
>>
>> p.s. I Notice that gnome-terminal still doesn't handle
>> combining characters correctly, and my mail client thunderbird
>> is putting the accent on the q rather than the e, sigh.
>
Jim Meyering wrote:
> Bo Borgerson <[EMAIL PROTECTED]> wrote:
>> I may be misinterpreting your patch, but it seems to me that
>> decrementing count for zero-width characters could potentially lead to
>> confusion. Not all zero-width characters are combining characters, right?
>
> It looks ok to m
Bo Borgerson <[EMAIL PROTECTED]> wrote:
> I may be misinterpreting your patch, but it seems to me that
> decrementing count for zero-width characters could potentially lead to
> confusion. Not all zero-width characters are combining characters, right?
It looks ok to me, since there's an unconditi
Pádraig Brady <[EMAIL PROTECTED]> wrote:
> Jan Engelhardt wrote:
>>
>> https://bugzilla.novell.com/show_bug.cgi?id=381873
>>
>> Forwarding this because it is a GNU issue, not specifically a Novell one.
>> I reproduced this myself with the latest coreutils from git
>> (BTW: You might want to repack
On Wednesday 2008-05-07 13:11, Pádraig Brady wrote:
>
>Now that is a _lot_ of extra time. libiconv could probably be
>made more efficient. I've never actually looked at it.
>However wc calls mbrtowc() for each multibyte character.
>It would probably be a lot more efficient to use mbstowcs()
>to co
Pádraig Brady wrote:
> canonically équivalent
> canonically équivalent
>
> Pádraig.
>
> p.s. I Notice that gnome-terminal still doesn't handle
> combining characters correctly, and my mail client thunderbird
> is putting the accent on the q rather than the e, sigh.
They both render correctly he
Jan Engelhardt wrote:
>
> https://bugzilla.novell.com/show_bug.cgi?id=381873
>
> Forwarding this because it is a GNU issue, not specifically a Novell one.
> I reproduced this myself with the latest coreutils from git
> (BTW: You might want to repack that repo, "counting objects" during the
> clon
22 matches
Mail list logo