date:20150109

[dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

2015-01-09 Thread FRIGN

Hello fellow hackers,

the current tr(1)-implementation has really been giving me nightmares,
so I rewrote it.
Given POSIX really sucks in some areas, I went off the path at some
areas, but not in a way that it would break scripts.
Here's a comparison and you let me know what you prefer:

1) GNU coreutils:
-

$ echo "Motörhead" | tr öo oö
M�to�rhead

What happens? coreutils-tr(1) doesn't support multibyte characters
and actually interprets ö as multiple single characters, which is
the reason why it obviously messes it up.

2) old sbase-tr:


$ echo "Motörhead" | ./tr öo oö
Mötorhead
$ echo "xx" | ./tr -s " "
usage: ./tr [-d] [-c] set1 [set2]
$ wc -l tr.c
356 tr.c

Oh geez! You can't squeeze! Well, seems like I have to use coreutils now.

3) new tr:
--

$ echo "Motörhead" | ./tr öo oö
Mötorhead
$ echo "xx" | ./tr -s " "
x x
$ wc -l tr.c
243 tr.c

Works just fine!

Please test it and let me know what you think!

Cheers

FRIGN

-- 
FRIGN 
>From 2ff2c365fac5a0c0c0b6ee88cbbb4502a2dcf0a6 Mon Sep 17 00:00:00 2001
From: FRIGN 
Date: Fri, 9 Jan 2015 20:36:27 +0100
Subject: [PATCH] Rewrite tr(1) in a sane way

tr(1) always used to be a saddening part of sbase, which was
inherently broken and crufted.
But to be fair, the POSIX-standard doesn't make it very simple.
Given the current version was unfixable and broken by design, I
sat down and rewrote tr(1) very close to the concept of set theory
and the POSIX-standard with a few exceptions:

 - UTF-8: not allowed in POSIX, but in my opinion a must. This
  finally allows you to work with UTF-8 streams without
  problems or unexpected behaviour.
 - Equivalence classes: Left out, even GNU coreutils ignore them
and depending on LC_COLLATE, which sucks.
 - Character classes: No experiments or environment-variable-trickery.
  Just plain definitions derived from the POSIX-
  standard, working as expected.

I tested this thoroughly, but expect problems to show up in some
way given the wide range of input this program has to handle.
The only thing left on the TODO is to add support for literal
expressions ('\n', '\t', '\001', ...) and probably rethinking
the way [_*n] is unnecessarily restricted to string2.
---
 Makefile |   1 +
 tr.c | 487 ---
 utf.h|   1 +
 3 files changed, 189 insertions(+), 300 deletions(-)

diff --git a/Makefile b/Makefile
index 14dc982..b4565bf 100644
--- a/Makefile
+++ b/Makefile
@@ -20,6 +20,7 @@ HDR =\
 
 LIBUTF = libutf.a
 LIBUTFSRC =\
+	libutf/chartorunearr.c\
 	libutf/readrune.c\
 	libutf/rune.c\
 	libutf/runetype.c\
diff --git a/tr.c b/tr.c
index b661048..cfd1c94 100644
--- a/tr.c
+++ b/tr.c
@@ -1,356 +1,243 @@
-/* See LICENSE file for copyright and license details. */
-#include 
 #include 
 #include 
-#include 
-#include 
 
-#include "text.h"
+#include "utf.h"
 #include "util.h"
 
-static void
-usage(void)
-{
-	eprintf("usage: %s [-d] [-c] set1 [set2]\n", argv0);
-}
-
-static int dflag, cflag;
-static wchar_t mappings[0x11];
+static int cflag = 0;
+static int dflag = 0;
+static int sflag = 0;
 
-struct wset_state {
-	char *s;   /* current character */
-	wchar_t rfirst, rlast; /* first and last in range */
-	wchar_t prev;  /* previous returned character */
-	int prev_was_range;/* was the previous character part of a c-c range? */
+struct range {
+	Rune   start;
+	Rune   end;
+	size_t quant;
 };
 
-struct set_state {
-	char *s, rfirst, rlast, prev;
-	int prev_was_octal; /* was the previous returned character written in octal? */
+#define DIGIT "0-9"
+#define UPPER "A-Z"
+#define LOWER "a-z"
+#define PUNCT "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"
+#define ALNUM DIGIT UPPER LOWER
+
+struct class {
+	char  *name;
+	char  *str;
+} classes[] = {
+	{ "alnum",  ALNUM   },
+	{ "alpha",  UPPER LOWER },
+	{ "blank",  " \t"   },
+	{ "cntrl",  "\000-\037\177" },
+	{ "digit",  DIGIT   },
+	{ "graph",  ALNUM PUNCT },
+	{ "lower",  LOWER   },
+	{ "print",  ALNUM PUNCT " " },
+	{ "punct",  PUNCT   },
+	{ "space",  "\t\n\v\f\r"},
+	{ "upper",  UPPER   },
+	{ "xdigit", DIGIT "A-Fa-f"  },
 };
 
-static void
-set_state_defaults(struct set_state *s)
-{
-	s->rfirst = 1;
-	s->rlast = 0;
-	s->prev_was_octal = 1;
-}
+struct range *set1 = NULL;
+size_t set1ranges  = 0;
+struct range *set2 = NULL;
+size_t set2ranges  = 0;
 
-static void
-wset_state_defaults(struct wset_state *s)
+static size_t
+rangelen(struct range r)
 {
-	s->rfirst = 1;
-	s->rlast = 0;
-	s->prev_was_range = 1;
+	return (r.end - r.start + 1) * r.quant;
 }
 
-/* sets *s to the char that was intended to be written.
- * returns how many bytes the s pointer has to advance to skip the
- * escape sequence if it was an octal, always zero otherwise. */
-static int
-resolve_escape(char *s)
+static size_t
+setlen(struct range *set, size_t setra

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

2015-01-09 Thread Nick

Quoth FRIGN:
>  - UTF-8: not allowed in POSIX, but in my opinion a must. This
>   finally allows you to work with UTF-8 streams without
>   problems or unexpected behaviour.

I fully agree (unsurprisingly). Anything that relies on the POSIX 
behaviour to do weird things involving multibyte characters is 
insane.

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

2015-01-09 Thread random832

On Fri, Jan 9, 2015, at 16:44, Nick wrote:
> Quoth FRIGN:
> >  - UTF-8: not allowed in POSIX, but in my opinion a must. This
> >   finally allows you to work with UTF-8 streams without
> >   problems or unexpected behaviour.
> 
> I fully agree (unsurprisingly). Anything that relies on the POSIX 
> behaviour to do weird things involving multibyte characters is 
> insane.

Er... http://pubs.opengroup.org/onlinepubs/009696899/utilities/tr.html
has very little mention of the issue one way or another, but does use
the term "characters" rather than "bytes" in all relevant places, and
talks about "multi-byte characters" in a tone that suggests they should
be supported properly when LC_CTYPE has them.

The only _questionable_ bits are some of the language surrounding the
use of octal sequences:

For single characters: "Multi-byte characters require multiple,
concatenated escape sequences of this type, including the leading '\'
for each byte."

I read this as meaning that multi-byte characters are supported, and in
fact that "tr '\303\266o' 'o\303\266' means that \303\266 [two escape
sequences representing one multi-byte character] and o will be swapped -
and that it is not possible to specify multibyte characters with octal
values a dash-separated range specification (but they can be included as
literals).

Or, is it possible that FRIGN misinterpreted the prohibition on
"multi-character collating elements" ?

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

2015-01-09 Thread FRIGN

On Fri, 09 Jan 2015 17:41:19 -0500
random...@fastmail.us wrote:

> Or, is it possible that FRIGN misinterpreted the prohibition on
> "multi-character collating elements" ?

Did you read what I said? I explicitly went away from POSIX in this regard,
because no human would write ""tr '\303\266o' 'o\303\266'".
The reason why POSIX prohibits collating elements is only because they are
inhibited by their own overload of different character sets and locales.
Given assuming a UTF-8-locale is a very sane way to go (see Plan 9), this
limit can easily be thrown off and makes life easier.

Cheers

FRIGN

-- 
FRIGN

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

2015-01-09 Thread random832

On Fri, Jan 9, 2015, at 17:48, FRIGN wrote:
> Did you read what I said? I explicitly went away from POSIX in this
> regard,
> because no human would write ""tr '\303\266o' 'o\303\266'".

POSIX doesn't require people to write it, it just requires that it
works. POSIX has no problem with also allowing a literally typed
multibyte character to refer to itself. It's basically saying that if
someone _does_ write '\303\266o' 'o\303\266', you have to treat it the
same as öo oö, and not as the individual bytes.

> The reason why POSIX prohibits collating elements is only because they
> are
> inhibited by their own overload of different character sets and locales.
> Given assuming a UTF-8-locale is a very sane way to go (see Plan 9), this
> limit can easily be thrown off and makes life easier.

I don't think you're understanding the difference between
multi-character collating elements and multibyte characters.

Multi-character collating elements are things like "ch" in some Spanish
locales. They have nothing to do with UTF-8.

Re: [dev] [sbase] [PATCH-UPDATE] Rewrite tr(1) in a sane way

2015-01-09 Thread FRIGN

On Fri, 9 Jan 2015 20:39:48 +0100
FRIGN  wrote:

> 

sin just told me the patch was missing chartorunearr.c
which in fact is the case.
Here's an updated patch which should cleanly apply to
a vanilla codebase at HEAD.

Cheers

FRIGN

-- 
FRIGN 
>From f626eecfb757ab46cab7f16dc439258a6a497f1b Mon Sep 17 00:00:00 2001
From: FRIGN 
Date: Fri, 9 Jan 2015 20:36:27 +0100
Subject: [PATCH] Rewrite tr(1) in a sane way

tr(1) always used to be a saddening part of sbase, which was
inherently broken and crufted.
But to be fair, the POSIX-standard doesn't make it very simple.
Given the current version was unfixable and broken by design, I
sat down and rewrote tr(1) very close to the concept of set theory
and the POSIX-standard with a few exceptions:

 - UTF-8: not allowed in POSIX, but in my opinion a must. This
  finally allows you to work with UTF-8 streams without
  problems or unexpected behaviour.
 - Equivalence classes: Left out, even GNU coreutils ignore them
and depending on LC_COLLATE, which sucks.
 - Character classes: No experiments or environment-variable-trickery.
  Just plain definitions derived from the POSIX-
  standard, working as expected.

I tested this thoroughly, but expect problems to show up in some
way given the wide range of input this program has to handle.
The only thing left on the TODO is to add support for literal
expressions ('\n', '\t', '\001', ...) and probably rethinking
the way [_*n] is unnecessarily restricted to string2.
---
 Makefile   |   1 +
 libutf/chartorunearr.c |  27 +++
 tr.c   | 487 +++--
 utf.h  |   1 +
 4 files changed, 216 insertions(+), 300 deletions(-)
 create mode 100644 libutf/chartorunearr.c

diff --git a/Makefile b/Makefile
index 14dc982..b4565bf 100644
--- a/Makefile
+++ b/Makefile
@@ -20,6 +20,7 @@ HDR =\
 
 LIBUTF = libutf.a
 LIBUTFSRC =\
+	libutf/chartorunearr.c\
 	libutf/readrune.c\
 	libutf/rune.c\
 	libutf/runetype.c\
diff --git a/libutf/chartorunearr.c b/libutf/chartorunearr.c
new file mode 100644
index 000..8d13e1f
--- /dev/null
+++ b/libutf/chartorunearr.c
@@ -0,0 +1,27 @@
+/* See LICENSE file for copyright and license details. */
+#include 
+#include 
+
+#include "../util.h"
+#include "../utf.h"
+
+int
+chartorunearr(const char *str, Rune **r)
+{
+	size_t len = strlen(str), rlen, roff, ret, i;
+	Rune s;
+
+	for (rlen = 0, roff = 0; roff < len && ret; rlen++) {
+		ret = charntorune(&s, str + roff, MAX(UTFmax, len - roff));
+		roff += ret;
+	}
+
+	*r = emalloc(rlen * sizeof(Rune) + 1);
+	(*r)[rlen] = 0;
+
+	for (i = 0, roff = 0; roff < len && i < rlen; i++) {
+		roff += charntorune(&(*r)[i], str + roff, MAX(UTFmax, len - roff));
+	}
+
+	return rlen;
+}
diff --git a/tr.c b/tr.c
index b661048..cfd1c94 100644
--- a/tr.c
+++ b/tr.c
@@ -1,356 +1,243 @@
-/* See LICENSE file for copyright and license details. */
-#include 
 #include 
 #include 
-#include 
-#include 
 
-#include "text.h"
+#include "utf.h"
 #include "util.h"
 
-static void
-usage(void)
-{
-	eprintf("usage: %s [-d] [-c] set1 [set2]\n", argv0);
-}
-
-static int dflag, cflag;
-static wchar_t mappings[0x11];
+static int cflag = 0;
+static int dflag = 0;
+static int sflag = 0;
 
-struct wset_state {
-	char *s;   /* current character */
-	wchar_t rfirst, rlast; /* first and last in range */
-	wchar_t prev;  /* previous returned character */
-	int prev_was_range;/* was the previous character part of a c-c range? */
+struct range {
+	Rune   start;
+	Rune   end;
+	size_t quant;
 };
 
-struct set_state {
-	char *s, rfirst, rlast, prev;
-	int prev_was_octal; /* was the previous returned character written in octal? */
+#define DIGIT "0-9"
+#define UPPER "A-Z"
+#define LOWER "a-z"
+#define PUNCT "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"
+#define ALNUM DIGIT UPPER LOWER
+
+struct class {
+	char  *name;
+	char  *str;
+} classes[] = {
+	{ "alnum",  ALNUM   },
+	{ "alpha",  UPPER LOWER },
+	{ "blank",  " \t"   },
+	{ "cntrl",  "\000-\037\177" },
+	{ "digit",  DIGIT   },
+	{ "graph",  ALNUM PUNCT },
+	{ "lower",  LOWER   },
+	{ "print",  ALNUM PUNCT " " },
+	{ "punct",  PUNCT   },
+	{ "space",  "\t\n\v\f\r"},
+	{ "upper",  UPPER   },
+	{ "xdigit", DIGIT "A-Fa-f"  },
 };
 
-static void
-set_state_defaults(struct set_state *s)
-{
-	s->rfirst = 1;
-	s->rlast = 0;
-	s->prev_was_octal = 1;
-}
+struct range *set1 = NULL;
+size_t set1ranges  = 0;
+struct range *set2 = NULL;
+size_t set2ranges  = 0;
 
-static void
-wset_state_defaults(struct wset_state *s)
+static size_t
+rangelen(struct range r)
 {
-	s->rfirst = 1;
-	s->rlast = 0;
-	s->prev_was_range = 1;
+	return (r.end - r.start + 1) * r.quant;
 }
 
-/* sets *s to the char that was intended to be written.
- * returns how many bytes the s pointer has to advance to skip the
- * escape sequence if it was an octal, always zero

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

2015-01-09 Thread FRIGN

On Fri, 09 Jan 2015 17:55:04 -0500
random...@fastmail.us wrote:

> POSIX doesn't require people to write it, it just requires that it
> works. POSIX has no problem with also allowing a literally typed
> multibyte character to refer to itself. It's basically saying that if
> someone _does_ write '\303\266o' 'o\303\266', you have to treat it the
> same as öo oö, and not as the individual bytes.

This is madness. If you want the bytes to be collated, you just write the
literal \50102. POSIX often is a solution to a problem that doesn't exist
in the first place when you just use UTF-8.

> They have nothing to do with UTF-8.

That's exactly the point. Collating elements are depending on the current
locale which is too much of a mess to deal with.
So when the Spanish "ll" collates before "m" and after "l" in a given
locale, we don't give a fuck.
So please give me the point why you are torturing me with this information.
I stated that I did not implement collating elements into this tr(1) at
the beginning and that it's a POSIX-nightmare to do so, bringing harm
to anybody who is interested in a consistent, usable tool.

Cheers

FRIGN

-- 
FRIGN

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

2015-01-09 Thread random832

On Fri, Jan 9, 2015, at 18:08, FRIGN wrote:
> 
> This is madness. If you want the bytes to be collated,

I don't see where you're getting that either of us want the bytes to be
collated. I don't even know what you mean by "collated", since collating
is not what tr does, except when ordering ranges.

> you just write the
> literal \50102. 

Even if octal values could be more than three digits, I have no idea
what you think 50102 is. Its decimal value is 20546. Its hex value is
0x5042. I have no idea what it has to do with character U+00F6 whose
UTF-8 representation is 0xC3 0xB6. I just realized what you're
doing, 0xC3B6 has the _decimal_ value 50102, I have no idea why you
would think _that_ is a representation people would want to use. If
you're so pro-unicode, make it accept \u00F6 - that's a valid extension.
But reusing the syntax POSIX uses for three-digit octal literals, for
arbitrarily long decimal literals that aren't even unicode code points,
makes no sense at all. In what universe is that intuitive?

> POSIX often is a solution to a problem that doesn't exist
> in the first place when you just use UTF-8.
> 
> > They have nothing to do with UTF-8.
> 
> That's exactly the point. Collating elements are depending on the current
> locale which is too much of a mess to deal with.

Huh?

> So when the Spanish "ll" collates before "m" and after "l" in a given
> locale, we don't give a fuck.
> So please give me the point why you are torturing me with this
> information.

Because collating elements are the thing POSIX forbids which you appear
to have _misinterpreted_ as forbidding multibyte characters. Otherwise I
have _no idea_ what in POSIX you interpret as preventing reasonable
behavior with UTF-8 multibyte characters.

> I stated that I did not implement collating elements into this tr(1) at
> the beginning and that it's a POSIX-nightmare to do so, bringing harm
> to anybody who is interested in a consistent, usable tool.

tl;dr:

Collating elements = POSIX forbids them = You don't want them anyway.
Multibyte characters = POSIX allows/requires them = You like them too.
What is the problem?
I don't know what you want to do that you think POSIX doesn't allow.

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

2015-01-09 Thread FRIGN

On Fri, 09 Jan 2015 18:24:46 -0500
random...@fastmail.us wrote:

> Even if octal values could be more than three digits, I have no idea
> what you think 50102 is. Its decimal value is 20546. Its hex value is
> 0x5042. I have no idea what it has to do with character U+00F6 whose
> UTF-8 representation is 0xC3 0xB6. I just realized what you're
> doing, 0xC3B6 has the _decimal_ value 50102, I have no idea why you
> would think _that_ is a representation people would want to use. If
> you're so pro-unicode, make it accept \u00F6 - that's a valid extension.
> But reusing the syntax POSIX uses for three-digit octal literals, for
> arbitrarily long decimal literals that aren't even unicode code points,
> makes no sense at all. In what universe is that intuitive?

C3B6 is 'ö' and makes sense to allow specifying it as \50102 (in the pure
UTF-8-sense of course, nothing to do with collating).

> Collating elements = POSIX forbids them = You don't want them anyway.
> Multibyte characters = POSIX allows/requires them = You like them too.
> What is the problem?
> I don't know what you want to do that you think POSIX doesn't allow.

Well, probably I misunderstood the matter. Sometimes this stuff gets
above my head. ;)
At the end of the day, you want software to work as expected:

GNU tr:
$ echo ελληνική | tr [α-ω] [Α-Ω]
®

our tr:
$ echo ελληνικη | ./tr [α-ω] [Α-Ω]   
ΕΛΛΗΝΙΚΗ

Cheers

FRIGN

-- 
FRIGN

Re: [dev] [sbase] [PATCH-UPDATE] Rewrite tr(1) in a sane way

2015-01-09 Thread Dmitrij D. Czarkoff

FRIGN said:
> +#define UPPER "A-Z"
> +#define LOWER "a-z"
> +#define PUNCT "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"

These definitions hugely misrepresent corresponding character classes.

-- 
Dmitrij D. Czarkoff

[dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

Re: [dev] [sbase] [PATCH-UPDATE] Rewrite tr(1) in a sane way

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

Re: [dev] [sbase] [PATCH] Rewrite tr(1) in a sane way

Re: [dev] [sbase] [PATCH-UPDATE] Rewrite tr(1) in a sane way

10 matches

Site Navigation

Mail list logo

Footer information