Re: horrible utf-8 performace in wc

2008-05-08 Thread Bo Borgerson
Pádraig Brady wrote:
> Bo Borgerson wrote:
>> I poked around a little in gnulib and found a function for determining
>> the combining class of a Unicode character.
>>
>> I think the attached patch does what you were intending to do, and it
>> also counts all of the stand-alone zero-width characters you found:
> 
> cool, thanks.
> Could you could optimize it though and do the following
> as you've already calculated wcwidth().
> 
>   if (!width && uc_combining_class(wide_char))
> chars--;

Nice, good idea.

I think I may have worded my previous message in a misleading way.  The
intent of the attached patch was not to be a robust solution to the
problem, but rather a demonstration of the function I noticed in gnulib
in case it might be helpful to you.

You definitely seem to know a whole lot more about what's actually
involved here than I do.  I'm just trying to grease the skids. ;)

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


feature request: error codes for 'rm'

2008-05-08 Thread Danny Rawlins
Hi I'm quite surprised 'rm' does not return a error code for no such 
file, I would like to see at least error code 1 so I can use it in a 
shell script, additional error codes might also be nice.


Regards,
Danny Rawlins

http://crux.nu/Public/DannyRawlins


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: feature request: error codes for 'rm'

2008-05-08 Thread Danny Rawlins

Danny Rawlins wrote:
Hi I'm quite surprised 'rm' does not return a error code for no such 
file, I would like to see at least error code 1 so I can use it in a 
shell script, additional error codes might also be nice.


Regards,
Danny Rawlins

http://crux.nu/Public/DannyRawlins

Damn it sorry I made a mistake and I did not check properly sorry for 
wasting your time.


Regards,
Danny Rawlins


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Problème sous linux

2008-05-08 Thread nel natou
Bonjour.
  J'ai instalé Ubuntu sur mon pc en dual boot avec windows et j'ai des 
problèmes d'éceran ou de fréquence de raffraichissement.
  ce qui es bizarre, c'est qu'avant ça n'affectait que linux maintenant, ça 
vient même sous windows et parfois c'est très long.
  Je ne sia spas quoi faire.
  J'ai installé debian au lieu de ubuntu et ça continue
  Que faire? 

 __
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible 
contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail 
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
Pádraig Brady wrote:
> mbstowcs doesn't canonicalize equivalent multibyte sequences,
> and so therefore functions the same in this regard as our
> processing of each wide character separately.
> This could be considered a bug actually- i.e. should -m give
> the number of wide chars, or the number of multibyte chars?
> With the attached patch, `wc -m` gives 23 chars for both these lines.

The behaviour of "wc -m" is specified by POSIX [1] to output the "number
of characters". And:
  LC_CTYPE
Determine the locale for the interpretation of sequences of bytes of text
data as characters (for example, single-byte as opposed to multi-byte
characters in arguments and input files) and which characters are defined
as white space characters.

The definition of "Character" in [2] means a multibyte-character. IMO it
cannot be interpreted to mean a glyph, or a grapheme cluster, or a screen
column. Rather, it is the unit that is processed by a call to mbtowc [3] or
mbrtowc [4].

As a consequence:
  - The number of characters is the same as the number of wide characters.
  - "wc -m" must output the number of characters.
  - In a Unicode locale,  is one character, and  is
two characters,
* even if they are canonically equivalent (because POSIX does not make
  reference to this concept), and
* even if they render the same on the screen (because except for Curses,
  POSIX does not refer to the rendering of characters).

If you want wc to count characters after canonicalization, then you can
invent a new wc command-line option for it. But I would find it more useful
to have a filter program that reads from standard input and writes the
canonicalized output to standard output; that would be applicable in many
more situations.

Bruno

[1] http://www.opengroup.org/susv3/utilities/wc.html
[2] http://www.opengroup.org/susv3/basedefs/xbd_chap03.html
[3] http://www.opengroup.org/susv3/functions/mbtowc.html
[4] http://www.opengroup.org/susv3/functions/mbrtowc.html



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
> Is there a good library for combining-character canonicalization
> available?  That seems like something that would be useful to have in a
> lot of text-processing tools.  Also, for Unicode, something to shuffle
> between the normalization forms might be helpful for comparisons.

Such functionality is currently available in IBM's ICU, in GNOME's libunicode, 
in
Simon's libidn, and should be available in some time in gnulib. Please contact
me if you want to help with the gnulib implementation.

Bruno



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
> @@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
> linepos += width;
>   if (iswspace (wide_char))
> goto mb_word_separator;
> + else if (uc_combining_class (wide_char) != 0)
> +   chars--; /* don't count combining chars */
>   in_word = true;
> }
>   break;

If you want a tool to ignore combining characters (not 'wc -m', since 'wc -m'
is not specified to behave like this, see the other mail), then
uc_combining_class from gnulib is a usable API.

However, in this patch you are assuming an UTF-8 locale. Recall that on some
systems (Solaris, FreeBSD, ...) in EUC-JP locale for example, the wide-character
representation of a double-byte character is unrelated to Unicode: the mbrtowc
routine just combines the two bytes in a single wchar_t with a bit of shifting
and masking; no conversion to Unicode takes place here.

If you want to convert a byte sequence from the locale's encoding to a
sequence of Unicode characters, in order to use uc_combining_class and similar
API, you can do so through the gnulib function u32_conv_from_encoding
(using locale_charset() as encoding). It's defined in gnulib's "uniconv.h" file.

Bruno



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: locales for testing

2008-05-08 Thread Bruno Haible
Jim Meyering wrote:
> you'll need to include the new test only if there is
> sufficient multi-byte support and if you can find a suitable locale to
> test with.

gnulib has a few autoconf macros to determine suitables locales:

  gt_LOCALE_FR_UTF8   - french locale with UTF-8 encoding
  - Use this to verify basic operation in UTF-8 locales.

  gt_LOCALE_TR_UTF8   - turkish locale with UTF-8 encoding
  - Use this to verify upcase/downcase operations.

  gt_LOCALE_FR- french locale with unibyte encoding
  - Use this to verify classical unibyte locales.

  gt_LOCALE_ZH_CN - chinese locale with GB18030 encoding
  - Use this to verify operation in locales which have
Unicode characters but don't use UTF-8.

Bruno



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: BugReport about "ln" command worked in NTFS

2008-05-08 Thread Philip Rowlands

[ re-adding bug-coreutils@gnu.org ]

On Thu, 8 May 2008, [EMAIL PROTECTED] wrote:


The complete log about running "ln" is in the attachment.


The strace -c output you posted shows 1 successful call to link(2), as 
I'd expect. It then shows further expected output from stat(1) that the 
link count is 2 for both filenames.


Your initial report stated that rm was failing to remove one of the 
links, but your sample output doesn't show any use of rm, so it's 
impossible to see the problem being demonstrated.


Please try running the following commands on the affected filesystem and 
send back the output:


$ touch test1
$ ln test1 test2
$ ls -l
$ strace -e trace=unlink rm test1
$ ls -l


Cheers,
Phil


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bo Borgerson
Bruno Haible wrote:
> If you want wc to count characters after canonicalization, then you can
> invent a new wc command-line option for it. But I would find it more useful
> to have a filter program that reads from standard input and writes the
> canonicalized output to standard output; that would be applicable in many
> more situations.


I like the sound of that!

I suppose the not-yet-implemented gnulib Unicode normalization library
you mentioned in another post would be a prerequisite for such a tool.

I'm definitely interested in helping out here, but I think someone with
a more thorough understanding of Unicode would probably be more useful
(Pádraig?)

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Pádraig Brady
Bruno Haible wrote:
> As a consequence:
>   - The number of characters is the same as the number of wide characters.
>   - "wc -m" must output the number of characters.
>   - In a Unicode locale,  is one character, and  is
> two characters,

Fair enough.

> If you want wc to count characters after canonicalization, then you can
> invent a new wc command-line option for it.

I guess one would could possibly have --chars={unicode,glyph,grapheme,column}
with unicode being the default, and how it currently works.

> But I would find it more useful
> to have a filter program that reads from standard input and writes the
> canonicalized output to standard output; that would be applicable in many
> more situations.

That would be _very_ useful, yes.

thanks for all the great info in this thread,
Pádraig.



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bruno Haible
> $ time ./wc -m long_lines.txt
> 13357046 long_lines.txt
> real0m1.860s

It processes at the speed of 7 million characters per second. I would not call
this a "horrible performance".

> However wc calls mbrtowc() for each multibyte character.

Yes. One could use mbstowcs (or mbsnrtowcs, but that exists in glibc only).
Or one can avoid the calls to mbrtowc() when the character is in the "basic
POSIX character set" (i.e. most of ASCII). This trick comes from Paul Eggert
and is already realized in gnulib's mbiter.h and mbswidth.c. Applied here,
it hardly changes the code but speeds it up by a factor of 3.

Timing with original coreutils-6.11:
$ time wc -w < SuSE-9.0-DVD-ARCHIVES 
6999399

real2m26.211s
user2m8.553s
sys 0m1.046s
$ time wc -m < SuSE-9.0-DVD-ARCHIVES 
120602576

real2m17.754s
user2m8.164s
sys 0m0.919s

Timing with this patch:
$ time /build/coreutils-6.11/src/wc -w < SuSE-9.0-DVD-ARCHIVES 
6999399

real0m42.101s
user0m40.179s
sys 0m0.875s
$ time /build/coreutils-6.11/src/wc -m < SuSE-9.0-DVD-ARCHIVES 
120602576

real0m41.609s
user0m40.171s
sys 0m0.908s

So the resulting counts are the same, and the time to process a 120 MB file
is reduced from 128 sec to 40 sec, i.e. the speed increases from 0.94 MB/sec
to 3.0 MB/sec.


2008-05-08  Bruno Haible  <[EMAIL PROTECTED]>

Speed up "wc -m" and "wc -w" in multibyte case.
* src/wc.c: Include mbchar.h.
(wc): New variable in_shift. Use it to avoid calling mbrtowc for most
ASCII characters.

*** coreutils-6.11/src/wc.c.bak 2008-04-19 23:34:23.0 +0200
--- coreutils-6.11/src/wc.c 2008-05-08 16:18:25.0 +0200
***
*** 1,5 
  /* wc - print the number of lines, words, and bytes in files
!Copyright (C) 85, 91, 1995-2007 Free Software Foundation, Inc.
  
 This program is free software: you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
--- 1,5 
  /* wc - print the number of lines, words, and bytes in files
!Copyright (C) 85, 91, 1995-2008 Free Software Foundation, Inc.
  
 This program is free software: you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
***
*** 28,33 
--- 28,34 
  #include "system.h"
  #include "error.h"
  #include "inttostr.h"
+ #include "mbchar.h"
  #include "quote.h"
  #include "readtokens0.h"
  #include "safe-read.h"
***
*** 274,279 
--- 275,281 
bool in_word = false;
uintmax_t linepos = 0;
mbstate_t state = { 0, };
+   bool in_shift = false;
  # if SUPPORT_OLD_MBRTOWC
/* Back-up the state before each multibyte character conversion and
 move the last incomplete character of the buffer to the front
***
*** 308,377 
  wchar_t wide_char;
  size_t n;
  
! # if SUPPORT_OLD_MBRTOWC
! backup_state = state;
! # endif
! n = mbrtowc (&wide_char, p, bytes_read, &state);
! if (n == (size_t) -2)
{
! # if SUPPORT_OLD_MBRTOWC
! state = backup_state;
! # endif
! break;
!   }
! if (n == (size_t) -1)
!   {
! /* Remember that we read a byte, but don't complain
!about the error.  Because of the decoding error,
!this is a considered to be byte but not a
!character (that is, chars is not incremented).  */
! p++;
! bytes_read--;
}
  else
{
  if (n == 0)
{
  wide_char = 0;
  n = 1;
}
! p += n;
! bytes_read -= n;
! chars++;
! switch (wide_char)
{
!   case '\n':
! lines++;
! /* Fall through. */
!   case '\r':
!   case '\f':
! if (linepos > linelength)
!   linelength = linepos;
! linepos = 0;
! goto mb_word_separator;
!   case '\t':
! linepos += 8 - (linepos % 8);
! goto mb_word_separator;
!   case ' ':
! linepos++;
! /* Fall through. */
!   case '\v':
!   mb_word_separator:
! words += in_word;
! in_word = false;
! break;
!   default:
! if (iswprint (wide_char))
!   {
! int width = wcwidth (wide_char);
! if (width > 0)
!   linepos += width;
!  

Re: horrible utf-8 performace in wc

2008-05-08 Thread Jim Meyering
Bruno Haible <[EMAIL PROTECTED]> wrote:
> 2008-05-08  Bruno Haible  <[EMAIL PROTECTED]>
>
>   Speed up "wc -m" and "wc -w" in multibyte case.
>   * src/wc.c: Include mbchar.h.
>   (wc): New variable in_shift. Use it to avoid calling mbrtowc for most
>   ASCII characters.

Thanks!
I've applied that.


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils