On Wed, Jan 15, 2014 at 09:36:07PM +0100, Szabolcs Nagy wrote:
> * Silvan Jegen <[email protected]> [2014-01-15 20:43:54 +0100]:
> > Note, though, that GNU's tr does not seem to handle Unicode at all[1]
> > while this version of tr, according to "perf record/report", seems to
> > spend most of its running time in the Unicode handling functions of glibc.
> 
> multi-byte string decoding is known to be slow in glibc
> 
> eg see the utf8 decoding benchmark in
> http://www.etalabs.net/compare_libcs.html

I installed musl libc and used musl-gcc to compile this tr implementation
(no change in the code necessary). Using the same input file I get the
following numbers:

real    0m2.690s
user    0m2.597s
sys     0m0.187s

real    0m2.644s
user    0m2.590s
sys     0m0.143s

real    0m2.648s
user    0m2.543s
sys     0m0.200s

That's actually quite impressive.


> > By no means was this any serious benchmarking but eliminating the function
> > pointer did not seem to make an obvious difference.
> 
> note that recent gcc (4.7?) can do function pointer inlining
> if it can infere that the function is in the same tu
> (and with lto it can probably do cross-tu inlining)
> 
> > +void
> > +handleescapes(char *s)
> > +{
> > +   switch(*s) {
> > +   case 'n':
> > +           *s = '\x0A';
> > +           break;
> > +   case 't':
> > +           *s = '\x09';
> > +           break;
> > +   case '\\':
> > +           *s = '\x5c';
> 
> what's wrong with '\n' etc here?

I am not sure what you mean. My interpretations:

1. Why no '\n' in the case statements?

I don't think that's possible but I could be wrong.


2. Why are you escaping '\n'?

Because I assume that the user wants to replace/delete the newlines (resp.
tabs) from the input if he puts '\n' (resp. '\t') into the first character
set argument.


> btw a fully posix conformant tr implementation is available here:
> http://git.musl-libc.org/cgit/noxcuse/tree/src/tr.c

Looks interesting but I would have to have a longer look (and I catched
a cold so that has to wait...). I noticed that it uses the threadsafe
version of the mbtowc function. Do you think that is advisable in
general?


Reply via email to