[dev] [sbase] [PATCH] Rewrite tr(1) in a sane way
Hello fellow hackers, the current tr(1)-implementation has really been giving me nightmares, so I rewrote it. Given POSIX really sucks in some areas, I went off the path at some areas, but not in a way that it would break scripts. Here's a comparison and you let me know what you prefer: 1) GNU coreutils: - $ echo "Motörhead" | tr öo oö M�to�rhead What happens? coreutils-tr(1) doesn't support multibyte characters and actually interprets ö as multiple single characters, which is the reason why it obviously messes it up. 2) old sbase-tr: $ echo "Motörhead" | ./tr öo oö Mötorhead $ echo "xx" | ./tr -s " " usage: ./tr [-d] [-c] set1 [set2] $ wc -l tr.c 356 tr.c Oh geez! You can't squeeze! Well, seems like I have to use coreutils now. 3) new tr: -- $ echo "Motörhead" | ./tr öo oö Mötorhead $ echo "xx" | ./tr -s " " x x $ wc -l tr.c 243 tr.c Works just fine! Please test it and let me know what you think! Cheers FRIGN -- FRIGN >From 2ff2c365fac5a0c0c0b6ee88cbbb4502a2dcf0a6 Mon Sep 17 00:00:00 2001 From: FRIGN Date: Fri, 9 Jan 2015 20:36:27 +0100 Subject: [PATCH] Rewrite tr(1) in a sane way tr(1) always used to be a saddening part of sbase, which was inherently broken and crufted. But to be fair, the POSIX-standard doesn't make it very simple. Given the current version was unfixable and broken by design, I sat down and rewrote tr(1) very close to the concept of set theory and the POSIX-standard with a few exceptions: - UTF-8: not allowed in POSIX, but in my opinion a must. This finally allows you to work with UTF-8 streams without problems or unexpected behaviour. - Equivalence classes: Left out, even GNU coreutils ignore them and depending on LC_COLLATE, which sucks. - Character classes: No experiments or environment-variable-trickery. Just plain definitions derived from the POSIX- standard, working as expected. I tested this thoroughly, but expect problems to show up in some way given the wide range of input this program has to handle. The only thing left on the TODO is to add support for literal expressions ('\n', '\t', '\001', ...) and probably rethinking the way [_*n] is unnecessarily restricted to string2. --- Makefile | 1 + tr.c | 487 --- utf.h| 1 + 3 files changed, 189 insertions(+), 300 deletions(-) diff --git a/Makefile b/Makefile index 14dc982..b4565bf 100644 --- a/Makefile +++ b/Makefile @@ -20,6 +20,7 @@ HDR =\ LIBUTF = libutf.a LIBUTFSRC =\ + libutf/chartorunearr.c\ libutf/readrune.c\ libutf/rune.c\ libutf/runetype.c\ diff --git a/tr.c b/tr.c index b661048..cfd1c94 100644 --- a/tr.c +++ b/tr.c @@ -1,356 +1,243 @@ -/* See LICENSE file for copyright and license details. */ -#include #include #include -#include -#include -#include "text.h" +#include "utf.h" #include "util.h" -static void -usage(void) -{ - eprintf("usage: %s [-d] [-c] set1 [set2]\n", argv0); -} - -static int dflag, cflag; -static wchar_t mappings[0x11]; +static int cflag = 0; +static int dflag = 0; +static int sflag = 0; -struct wset_state { - char *s; /* current character */ - wchar_t rfirst, rlast; /* first and last in range */ - wchar_t prev; /* previous returned character */ - int prev_was_range;/* was the previous character part of a c-c range? */ +struct range { + Rune start; + Rune end; + size_t quant; }; -struct set_state { - char *s, rfirst, rlast, prev; - int prev_was_octal; /* was the previous returned character written in octal? */ +#define DIGIT "0-9" +#define UPPER "A-Z" +#define LOWER "a-z" +#define PUNCT "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~" +#define ALNUM DIGIT UPPER LOWER + +struct class { + char *name; + char *str; +} classes[] = { + { "alnum", ALNUM }, + { "alpha", UPPER LOWER }, + { "blank", " \t" }, + { "cntrl", "\000-\037\177" }, + { "digit", DIGIT }, + { "graph", ALNUM PUNCT }, + { "lower", LOWER }, + { "print", ALNUM PUNCT " " }, + { "punct", PUNCT }, + { "space", "\t\n\v\f\r"}, + { "upper", UPPER }, + { "xdigit", DIGIT "A-Fa-f" }, }; -static void -set_state_defaults(struct set_state *s) -{ - s->rfirst = 1; - s->rlast = 0; - s->prev_was_octal = 1; -} +struct range *set1 = NULL; +size_t set1ranges = 0; +struct range *set2 = NULL; +size_t set2ranges = 0; -static void -wset_state_defaults(struct wset_state *s) +static size_t +rangelen(struct range r) { - s->rfirst = 1; - s->rlast = 0; - s->prev_was_range = 1; + return (r.end - r.start + 1) * r.quant; } -/* sets *s to the char that was intended to be written. - * returns how many bytes the s pointer has to advance to skip the - * escape sequence if it was an octal, always zero otherwise. */ -static int -resolve_escape(char *s) +static size_t +setlen(struct range *set, size_t setra
Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way
Quoth FRIGN: > - UTF-8: not allowed in POSIX, but in my opinion a must. This > finally allows you to work with UTF-8 streams without > problems or unexpected behaviour. I fully agree (unsurprisingly). Anything that relies on the POSIX behaviour to do weird things involving multibyte characters is insane.
Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way
On Fri, Jan 9, 2015, at 16:44, Nick wrote: > Quoth FRIGN: > > - UTF-8: not allowed in POSIX, but in my opinion a must. This > > finally allows you to work with UTF-8 streams without > > problems or unexpected behaviour. > > I fully agree (unsurprisingly). Anything that relies on the POSIX > behaviour to do weird things involving multibyte characters is > insane. Er... http://pubs.opengroup.org/onlinepubs/009696899/utilities/tr.html has very little mention of the issue one way or another, but does use the term "characters" rather than "bytes" in all relevant places, and talks about "multi-byte characters" in a tone that suggests they should be supported properly when LC_CTYPE has them. The only _questionable_ bits are some of the language surrounding the use of octal sequences: For single characters: "Multi-byte characters require multiple, concatenated escape sequences of this type, including the leading '\' for each byte." I read this as meaning that multi-byte characters are supported, and in fact that "tr '\303\266o' 'o\303\266' means that \303\266 [two escape sequences representing one multi-byte character] and o will be swapped - and that it is not possible to specify multibyte characters with octal values a dash-separated range specification (but they can be included as literals). Or, is it possible that FRIGN misinterpreted the prohibition on "multi-character collating elements" ?
Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way
On Fri, 09 Jan 2015 17:41:19 -0500 random...@fastmail.us wrote: > Or, is it possible that FRIGN misinterpreted the prohibition on > "multi-character collating elements" ? Did you read what I said? I explicitly went away from POSIX in this regard, because no human would write ""tr '\303\266o' 'o\303\266'". The reason why POSIX prohibits collating elements is only because they are inhibited by their own overload of different character sets and locales. Given assuming a UTF-8-locale is a very sane way to go (see Plan 9), this limit can easily be thrown off and makes life easier. Cheers FRIGN -- FRIGN
Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way
On Fri, Jan 9, 2015, at 17:48, FRIGN wrote: > Did you read what I said? I explicitly went away from POSIX in this > regard, > because no human would write ""tr '\303\266o' 'o\303\266'". POSIX doesn't require people to write it, it just requires that it works. POSIX has no problem with also allowing a literally typed multibyte character to refer to itself. It's basically saying that if someone _does_ write '\303\266o' 'o\303\266', you have to treat it the same as öo oö, and not as the individual bytes. > The reason why POSIX prohibits collating elements is only because they > are > inhibited by their own overload of different character sets and locales. > Given assuming a UTF-8-locale is a very sane way to go (see Plan 9), this > limit can easily be thrown off and makes life easier. I don't think you're understanding the difference between multi-character collating elements and multibyte characters. Multi-character collating elements are things like "ch" in some Spanish locales. They have nothing to do with UTF-8.
Re: [dev] [sbase] [PATCH-UPDATE] Rewrite tr(1) in a sane way
On Fri, 9 Jan 2015 20:39:48 +0100 FRIGN wrote: > sin just told me the patch was missing chartorunearr.c which in fact is the case. Here's an updated patch which should cleanly apply to a vanilla codebase at HEAD. Cheers FRIGN -- FRIGN >From f626eecfb757ab46cab7f16dc439258a6a497f1b Mon Sep 17 00:00:00 2001 From: FRIGN Date: Fri, 9 Jan 2015 20:36:27 +0100 Subject: [PATCH] Rewrite tr(1) in a sane way tr(1) always used to be a saddening part of sbase, which was inherently broken and crufted. But to be fair, the POSIX-standard doesn't make it very simple. Given the current version was unfixable and broken by design, I sat down and rewrote tr(1) very close to the concept of set theory and the POSIX-standard with a few exceptions: - UTF-8: not allowed in POSIX, but in my opinion a must. This finally allows you to work with UTF-8 streams without problems or unexpected behaviour. - Equivalence classes: Left out, even GNU coreutils ignore them and depending on LC_COLLATE, which sucks. - Character classes: No experiments or environment-variable-trickery. Just plain definitions derived from the POSIX- standard, working as expected. I tested this thoroughly, but expect problems to show up in some way given the wide range of input this program has to handle. The only thing left on the TODO is to add support for literal expressions ('\n', '\t', '\001', ...) and probably rethinking the way [_*n] is unnecessarily restricted to string2. --- Makefile | 1 + libutf/chartorunearr.c | 27 +++ tr.c | 487 +++-- utf.h | 1 + 4 files changed, 216 insertions(+), 300 deletions(-) create mode 100644 libutf/chartorunearr.c diff --git a/Makefile b/Makefile index 14dc982..b4565bf 100644 --- a/Makefile +++ b/Makefile @@ -20,6 +20,7 @@ HDR =\ LIBUTF = libutf.a LIBUTFSRC =\ + libutf/chartorunearr.c\ libutf/readrune.c\ libutf/rune.c\ libutf/runetype.c\ diff --git a/libutf/chartorunearr.c b/libutf/chartorunearr.c new file mode 100644 index 000..8d13e1f --- /dev/null +++ b/libutf/chartorunearr.c @@ -0,0 +1,27 @@ +/* See LICENSE file for copyright and license details. */ +#include +#include + +#include "../util.h" +#include "../utf.h" + +int +chartorunearr(const char *str, Rune **r) +{ + size_t len = strlen(str), rlen, roff, ret, i; + Rune s; + + for (rlen = 0, roff = 0; roff < len && ret; rlen++) { + ret = charntorune(&s, str + roff, MAX(UTFmax, len - roff)); + roff += ret; + } + + *r = emalloc(rlen * sizeof(Rune) + 1); + (*r)[rlen] = 0; + + for (i = 0, roff = 0; roff < len && i < rlen; i++) { + roff += charntorune(&(*r)[i], str + roff, MAX(UTFmax, len - roff)); + } + + return rlen; +} diff --git a/tr.c b/tr.c index b661048..cfd1c94 100644 --- a/tr.c +++ b/tr.c @@ -1,356 +1,243 @@ -/* See LICENSE file for copyright and license details. */ -#include #include #include -#include -#include -#include "text.h" +#include "utf.h" #include "util.h" -static void -usage(void) -{ - eprintf("usage: %s [-d] [-c] set1 [set2]\n", argv0); -} - -static int dflag, cflag; -static wchar_t mappings[0x11]; +static int cflag = 0; +static int dflag = 0; +static int sflag = 0; -struct wset_state { - char *s; /* current character */ - wchar_t rfirst, rlast; /* first and last in range */ - wchar_t prev; /* previous returned character */ - int prev_was_range;/* was the previous character part of a c-c range? */ +struct range { + Rune start; + Rune end; + size_t quant; }; -struct set_state { - char *s, rfirst, rlast, prev; - int prev_was_octal; /* was the previous returned character written in octal? */ +#define DIGIT "0-9" +#define UPPER "A-Z" +#define LOWER "a-z" +#define PUNCT "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~" +#define ALNUM DIGIT UPPER LOWER + +struct class { + char *name; + char *str; +} classes[] = { + { "alnum", ALNUM }, + { "alpha", UPPER LOWER }, + { "blank", " \t" }, + { "cntrl", "\000-\037\177" }, + { "digit", DIGIT }, + { "graph", ALNUM PUNCT }, + { "lower", LOWER }, + { "print", ALNUM PUNCT " " }, + { "punct", PUNCT }, + { "space", "\t\n\v\f\r"}, + { "upper", UPPER }, + { "xdigit", DIGIT "A-Fa-f" }, }; -static void -set_state_defaults(struct set_state *s) -{ - s->rfirst = 1; - s->rlast = 0; - s->prev_was_octal = 1; -} +struct range *set1 = NULL; +size_t set1ranges = 0; +struct range *set2 = NULL; +size_t set2ranges = 0; -static void -wset_state_defaults(struct wset_state *s) +static size_t +rangelen(struct range r) { - s->rfirst = 1; - s->rlast = 0; - s->prev_was_range = 1; + return (r.end - r.start + 1) * r.quant; } -/* sets *s to the char that was intended to be written. - * returns how many bytes the s pointer has to advance to skip the - * escape sequence if it was an octal, always zero
Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way
On Fri, 09 Jan 2015 17:55:04 -0500 random...@fastmail.us wrote: > POSIX doesn't require people to write it, it just requires that it > works. POSIX has no problem with also allowing a literally typed > multibyte character to refer to itself. It's basically saying that if > someone _does_ write '\303\266o' 'o\303\266', you have to treat it the > same as öo oö, and not as the individual bytes. This is madness. If you want the bytes to be collated, you just write the literal \50102. POSIX often is a solution to a problem that doesn't exist in the first place when you just use UTF-8. > They have nothing to do with UTF-8. That's exactly the point. Collating elements are depending on the current locale which is too much of a mess to deal with. So when the Spanish "ll" collates before "m" and after "l" in a given locale, we don't give a fuck. So please give me the point why you are torturing me with this information. I stated that I did not implement collating elements into this tr(1) at the beginning and that it's a POSIX-nightmare to do so, bringing harm to anybody who is interested in a consistent, usable tool. Cheers FRIGN -- FRIGN
Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way
On Fri, Jan 9, 2015, at 18:08, FRIGN wrote: > > This is madness. If you want the bytes to be collated, I don't see where you're getting that either of us want the bytes to be collated. I don't even know what you mean by "collated", since collating is not what tr does, except when ordering ranges. > you just write the > literal \50102. Even if octal values could be more than three digits, I have no idea what you think 50102 is. Its decimal value is 20546. Its hex value is 0x5042. I have no idea what it has to do with character U+00F6 whose UTF-8 representation is 0xC3 0xB6. I just realized what you're doing, 0xC3B6 has the _decimal_ value 50102, I have no idea why you would think _that_ is a representation people would want to use. If you're so pro-unicode, make it accept \u00F6 - that's a valid extension. But reusing the syntax POSIX uses for three-digit octal literals, for arbitrarily long decimal literals that aren't even unicode code points, makes no sense at all. In what universe is that intuitive? > POSIX often is a solution to a problem that doesn't exist > in the first place when you just use UTF-8. > > > They have nothing to do with UTF-8. > > That's exactly the point. Collating elements are depending on the current > locale which is too much of a mess to deal with. Huh? > So when the Spanish "ll" collates before "m" and after "l" in a given > locale, we don't give a fuck. > So please give me the point why you are torturing me with this > information. Because collating elements are the thing POSIX forbids which you appear to have _misinterpreted_ as forbidding multibyte characters. Otherwise I have _no idea_ what in POSIX you interpret as preventing reasonable behavior with UTF-8 multibyte characters. > I stated that I did not implement collating elements into this tr(1) at > the beginning and that it's a POSIX-nightmare to do so, bringing harm > to anybody who is interested in a consistent, usable tool. tl;dr: Collating elements = POSIX forbids them = You don't want them anyway. Multibyte characters = POSIX allows/requires them = You like them too. What is the problem? I don't know what you want to do that you think POSIX doesn't allow.
Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way
On Fri, 09 Jan 2015 18:24:46 -0500 random...@fastmail.us wrote: > Even if octal values could be more than three digits, I have no idea > what you think 50102 is. Its decimal value is 20546. Its hex value is > 0x5042. I have no idea what it has to do with character U+00F6 whose > UTF-8 representation is 0xC3 0xB6. I just realized what you're > doing, 0xC3B6 has the _decimal_ value 50102, I have no idea why you > would think _that_ is a representation people would want to use. If > you're so pro-unicode, make it accept \u00F6 - that's a valid extension. > But reusing the syntax POSIX uses for three-digit octal literals, for > arbitrarily long decimal literals that aren't even unicode code points, > makes no sense at all. In what universe is that intuitive? C3B6 is 'ö' and makes sense to allow specifying it as \50102 (in the pure UTF-8-sense of course, nothing to do with collating). > Collating elements = POSIX forbids them = You don't want them anyway. > Multibyte characters = POSIX allows/requires them = You like them too. > What is the problem? > I don't know what you want to do that you think POSIX doesn't allow. Well, probably I misunderstood the matter. Sometimes this stuff gets above my head. ;) At the end of the day, you want software to work as expected: GNU tr: $ echo ελληνική | tr [α-ω] [Α-Ω] ® our tr: $ echo ελληνικη | ./tr [α-ω] [Α-Ω] ΕΛΛΗΝΙΚΗ Cheers FRIGN -- FRIGN
Re: [dev] [sbase] [PATCH-UPDATE] Rewrite tr(1) in a sane way
FRIGN said: > +#define UPPER "A-Z" > +#define LOWER "a-z" > +#define PUNCT "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~" These definitions hugely misrepresent corresponding character classes. -- Dmitrij D. Czarkoff