Pádraig Brady <[EMAIL PROTECTED]> wrote: > Jan Engelhardt wrote: >> >> https://bugzilla.novell.com/show_bug.cgi?id=381873 >> >> Forwarding this because it is a GNU issue, not specifically a Novell one. >> I reproduced this myself with the latest coreutils from git >> (BTW: You might want to repack that repo, "counting objects" during the >> clone was rather slow in the initial counting.) >> >> Could it be a libiconv problem? > > Accounting for multibyte characters is what's taking the time: > > ~/git/coreutils/src$ time ./wc -m long_lines.txt > 13357046 long_lines.txt > real 0m1.860s > > ~/git/coreutils/src$ time ./wc -c long_lines.txt > 13538735 long_lines.txt > real 0m0.002s > > Now that is a _lot_ of extra time. libiconv could probably be > made more efficient. I've never actually looked at it. > However wc calls mbrtowc() for each multibyte character. > It would probably be a lot more efficient to use mbstowcs() > to convert the whole read buffer. > > Note mbstowcs doesn't handle embedded NULs so one would > need to find these first, and iterate over each substring, > as I did in my version of uniq previously mentioned. > > Also mbstowcs doesn't canonicalize equivalent multibyte sequences, > and so therefore functions the same in this regard as our > processing of each wide character separately. > This could be considered a bug actually- i.e. should -m give > the number of wide chars, or the number of multibyte chars? > With the attached patch, `wc -m` gives 23 chars for both these lines. > > canonically équivalent > canonically équivalent > > Pádraig. > > p.s. I Notice that gnome-terminal still doesn't handle > combining characters correctly, and my mail client thunderbird > is putting the accent on the q rather than the e, sigh. > diff --git a/src/wc.c b/src/wc.c > index 61ab485..f7f7109 100644 > --- a/src/wc.c > +++ b/src/wc.c > @@ -368,6 +368,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus) > linepos += width; > if (iswspace (wide_char)) > goto mb_word_separator; > + else if (width == 0) > + chars--; /* don't count combining chars */ > in_word = true; > } > break;
[thanks Jan, for forwarding that] Hi Pádraig, Thanks for investigating that. That does look like an improvement. Do you feel like adding a test case in tests/misc/wc? However, it'll be a little tricky, because you'll need to include the new test only if there is sufficient multi-byte support and if you can find a suitable locale to test with. To set the locale for that one test, put a hashref like {ENV=>"LC_CTYPE=$locale"} in the test array-ref, where you detected earlier that $locale is available. For related examples, run this in your git-cloned coreutils directory: git grep 'ENV *=>' Even if you don't have time to write the test, please resend your patch in "git format-patch --stdout HEAD~1" format so I don't have to worry about mangling the "á" in your name ;-) As for rendering, I see odd things, too. Using emacs (built from git yesterday) to view these three lines where the 1st and 3rd are identical: canonically équivalent canonically équivalent canonically équivalent I get results that depend on the font. (this is with fonts from debian unstable) Invoking it to use a nice, anti-aliased font, emacs -fn 'Dejavu Sans Mono-18' it looks pretty good, but the combining accent could be a little higher above the "e", rather than touching it.
<<attachment: djsm-18.jpg>>
With any "fixed" variant, the accent is so high above the "e" that it makes that entire line several pixels higher: emacs -fn '-*-fixed-*-*-*-*-16-*-*-*-*-*-*-*'
<<attachment: fixed-16.jpg>>
_______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils