On Wed, Jan 15, 2014 at 11:27:23AM +0000, sin wrote: > On Tue, Jan 14, 2014 at 09:35:11AM +0100, Silvan Jegen wrote: > > On Tue, Jan 14, 2014 at 12:22 AM, <q...@c9x.me> wrote: > > > On Mon, Jan 13, 2014 at 11:19:49AM -0800, Silvan Jegen wrote: > > >> I have rewritten "tr" to use mmap and the wchar.h functions. It seems > > >> to be quite slow but as far as I can tell it works reasonably well (at > > >> least when using a UTF-8 locale). Comments/review and testing welcome > > >> (I am relatively new to C so beware)! > > >> > > >> + void (*mapfunc) (const wchar_t*, char*); > > >> + > > >> + setlocale(LC_ALL, ""); > > >> + > > >> + mappings = (wchar_t *) mmap(NULL, 0x110000 * sizeof(wchar_t), > > >> PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); > > >> + > > >> + if(argc >= 2) { > > >> + parse_mapping(argv[0], argv[1], mappings); > > >> + mapfunc = map_to_set; > > >> + } else { > > >> + parse_mapping(argv[0], NULL, mappings); > > >> + mapfunc = map_to_null; > > >> + } > > >> + > > >> + while(afgets(&buf, &size, stdin)) > > >> + mapfunc(mappings, buf); > > > > > > Hi, > > > > > > if you want it to be faster I would suggest dropping the function > > > pointer so the compiler can inline instead of doing an indirect > > > call. > > > I wasn't aware of the performance implications when using a function > > pointer. > > How slow is it currently? Can you provide your benchmark measurements?
I only did some superficial testing running 'time tr di DI' on a text file 152MB in size (on my desktop while playing music and running transmission-daemon). Using the function pointer: real 0m6.816s user 0m5.567s sys 0m0.170s real 0m8.070s user 0m5.783s sys 0m0.167s real 0m7.921s user 0m5.697s sys 0m0.183s real 0m8.295s user 0m5.713s sys 0m0.187s real 0m8.533s user 0m5.920s sys 0m0.207s real 0m7.840s user 0m5.683s sys 0m0.167s real 0m7.830s user 0m5.563s sys 0m0.193s real 0m7.862s user 0m5.590s sys 0m0.167s real 0m7.145s user 0m5.743s sys 0m0.163s real 0m7.973s user 0m5.827s sys 0m0.183s Average real: 7.828500000000 No function pointer: real 0m6.859s user 0m5.690s sys 0m0.183s real 0m7.895s user 0m5.647s sys 0m0.207s real 0m8.170s user 0m5.667s sys 0m0.183s real 0m7.872s user 0m5.660s sys 0m0.180s real 0m8.135s user 0m5.713s sys 0m0.173s real 0m8.166s user 0m5.697s sys 0m0.160s real 0m8.073s user 0m5.737s sys 0m0.130s real 0m8.007s user 0m5.693s sys 0m0.217s real 0m7.801s user 0m5.553s sys 0m0.150s real 0m7.858s user 0m5.673s sys 0m0.177s Average real: 7.883600000000 GNU coreutils' tr for comparison: real 0m1.411s user 0m0.123s sys 0m0.167s real 0m2.744s user 0m0.107s sys 0m0.183s real 0m2.618s user 0m0.127s sys 0m0.180s real 0m2.248s user 0m0.093s sys 0m0.167s real 0m2.510s user 0m0.090s sys 0m0.183s real 0m2.234s user 0m0.120s sys 0m0.180s real 0m2.512s user 0m0.127s sys 0m0.157s real 0m2.602s user 0m0.097s sys 0m0.173s real 0m2.517s user 0m0.127s sys 0m0.157s real 0m2.480s user 0m0.103s sys 0m0.157s Note, though, that GNU's tr does not seem to handle Unicode at all[1] while this version of tr, according to "perf record/report", seems to spend most of its running time in the Unicode handling functions of glibc. > Most tools in sbase are really not what I'd call blazingly fast and > that is not the point. I like to keep things simple and reliable > as much as possible. If we can optimize it without sacrificing simplicity > then go for it. > > I do not think inlining makes sense here at all. By no means was this any serious benchmarking but eliminating the function pointer did not seem to make an obvious difference. I would still go for the function-pointer-less version of the code since it actually is one line shorter, I think. The second, function-pointer-less version of the code can be found below. I will start writing a man page (possibly based on the GNU one) as soon as I find the time (hopefully in the next few days). [1] http://en.wikipedia.org/wiki/Tr_(Unix) --->8--- Subject: [PATCH] Add the tr program --- Makefile | 1 + tr.c | 140 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 141 insertions(+) create mode 100644 tr.c diff --git a/Makefile b/Makefile index 81dfaf6..ee84221 100644 --- a/Makefile +++ b/Makefile @@ -81,6 +81,7 @@ SRC = \ tee.c \ test.c \ touch.c \ + tr.c \ true.c \ tty.c \ uname.c \ diff --git a/tr.c b/tr.c new file mode 100644 index 0000000..b163f56 --- /dev/null +++ b/tr.c @@ -0,0 +1,140 @@ +/* See LICENSE file for copyright and license details. */ +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <locale.h> +#include <wchar.h> +#include "text.h" +#include "util.h" + +static void +usage(void) +{ + eprintf("usage: %s set1 [set2]\n", argv0); +} + +void +handleescapes(char *s) +{ + switch(*s) { + case 'n': + *s = '\x0A'; + break; + case 't': + *s = '\x09'; + break; + case '\\': + *s = '\x5c'; + break; + } +} + +void +parsemapping(const char *set1, const char *set2, wchar_t *mappings) +{ + char *s; + wchar_t runeleft; + wchar_t runeright; + int leftbytes; + int rightbytes; + size_t n = 0; + size_t lset2; + + if(set2) { + lset2 = strnlen(set2, 255 * sizeof(wchar_t)); + } else { + set2 = &set1[0]; + lset2 = 0; + } + + s = (char *) set1; + while(*s) { + if(*s == '\\') { + handleescapes(++s); + } + + leftbytes = mbtowc(&runeleft, s, 4); + if(set2[n] != '\0') + rightbytes = mbtowc(&runeright, set2 + n, 4); + mappings[runeleft] = runeright; + + s += leftbytes; + if(n < lset2) + n += rightbytes; + } +} + +void +maptonull(const wchar_t *mappings, char *in) +{ + const char *s; + wchar_t runeleft; + int leftbytes = 0; + + s = in; + while(*s) { + leftbytes = mbtowc(&runeleft, s, 4); + if(!mappings[runeleft]) + putwchar(runeleft); + s += leftbytes; + } +} + +void +maptoset(const wchar_t *mappings, char *in) +{ + const char *s; + wchar_t runeleft; + int leftbytes = 0; + + s = in; + while(*s) { + leftbytes = mbtowc(&runeleft, s, 4); + if(!mappings[runeleft]) { + putwchar(runeleft); + } else { + putwchar(mappings[runeleft]); + } + s += leftbytes; + } +} + +int +main(int argc, char *argv[]) +{ + wchar_t *mappings; + char *buf = NULL; + size_t size = 0; + + setlocale(LC_ALL, ""); + + mappings = (wchar_t *) mmap(NULL, 0x110000 * sizeof(wchar_t), + PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); + + ARGBEGIN { + default: + usage(); + } ARGEND; + + if(argc == 0) + usage(); + + if(argc >= 2) { + parsemapping(argv[0], argv[1], mappings); + while(afgets(&buf, &size, stdin)) + maptoset(mappings, buf); + } else { + parsemapping(argv[0], NULL, mappings); + while(afgets(&buf, &size, stdin)) + maptonull(mappings, buf); + } + + free(buf); + + if(ferror(stdin)) + eprintf("<stdin>: read error:"); + + return EXIT_SUCCESS; +} -- 1.8.5.2