Re: gawk regex stuff you may want

Paul Eggert Thu, 21 Jan 2016 09:09:15 -0800

On 01/20/2016 11:59 AM, Aharon Robbins wrote:


I don't see that. In regex.h:

| struct re_pattern_buffer
| {
|   /* Space that holds the compiled pattern.  It is declared as
|      `unsigned char *' because its elements are sometimes used as
|      array indexes.  */
|   unsigned char *__REPB_PREFIX(buffer);

Ah, that's probably the problem, and is why your version needs extracasts between unsigned char * and re_dfa_t *. In gnulib, that last linereads 'struct re_dfa_t *__REPB_PREFIX(buffer);', which is what thebuffer really is, and avoids the need for casts later.

Since malloc (0) is well-defined to return either NULL or a valid pointer
to a zero-byte object that can be freed, the code is working as-is.

Not on older systems. Although yes, I'm not sure I still support such
systems.  I don't know which checker complained. I think it was valgrind
at some point in the past.

Those older systems are long dead, as this requirement was standardizedin C89 and even the oldest systems we formerly supported (SunOS 4, if Irecall) conformed to the standard long ago. valgrind 3.11 does notcomplain about simple uses like free (malloc (0)) on my platform (Fedora23 x86-64). Perhaps older versions of valgrind complain on someplatforms, but I'd rather not worry about that; it's really a valgrind bug.

I'd rather look for a solution that involves silencing the checking
without making the code bulkier and typically slower.

This is in compilation, not execution; not sure the difference
is really noticable.

There is some runtime slowdown and code bloat. You're right, it's notsignificant, but it's the principle of the thing: I'd rather not bloatthe code just to pacify a busted checker.

If a byte value given inside [...] is greater than 127, it should
be left alone to be matched as-is (no arbitrary limits).

Hah! We recently ran into a similar problem with 'grep'. Previously,part of the grep code assumed that unibyte locales cannot have encodingerrors, so in a unibyte locale one can just use the traditionalbyte-oriented algorithm without worrying about mbrlen and the like.Other parts of the grep code checked for encoding errors in the usualway, e.g., with mbrlen. Unfortunately, the assumption about unibytelocales is incorrect, and the inconsistencies between the differentparts of 'grep' caused confusing and arguably incorrect behavior whichtook a bit of hackery to resolve; see the thread starting at<http://bugs.gnu.org/20526#86>.

As far as 'grep' is concerned, it'll trust what regcomp does here, so wedo have some freedom to change the code in this area. However, it looksto me like your patch would do the wrong thing for unibyte locales wherebtowc (b) returns a value that neither b nor WEOF. Also, the rest thecode assumes that if btowc returns WEOF in a multibyte locale then therewon't be a match (see the setup code in init_dfa, and I have the naggingfeeling that this assumption is embedded elsewhere). So, how about theattached more-conservative patch instead?

        /* Build single byte matching table for this equivalence class.  */
+      char_buf[1] = (unsigned char) '\0';

This should be unnecessary, as the rest of the code shouldn't be looking
at that byte. Is this something to pacify a lint checker?

Or valgrind at some point. I think a static checker that someone submitted
a report from.

It might be helpful to get at the bottom of this. If it's some buggystatic checker then I would rather not worry about it.

-       wc = wc2;
+       wc = (wint_t) wc2;

wc and wc2 are both integers so the cast is unnecessary. Similarly elsewhere.

There was a compiler (maybe VMS) for which this was necessary.

Again, it'd be helpful to know what the problem actually was. I expectit was just a warning, which is fine to ignore. It's valid standard Ccode, for what it's worth.

From 8f17f9b379db88ce39c79e9bb4184a37bd04b363 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Thu, 21 Jan 2016 09:06:20 -0800
Subject: [PATCH] regex: treat [x] as x if x is a unibyte encoding error

Problem reported by Aharon Robbins in:
http://lists.gnu.org/archive/html/bug-gnulib/2016-01/msg00091.html
* lib/regcomp.c (parse_byte) [_LIBC && RE_ENABLE_I18N]: New function.
(build_range_exp) [_LIBC && RE_ENABLE_I18N]: Use it.
---
 ChangeLog     |  8 ++++++++
 lib/regcomp.c | 17 +++++++++++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 181f709..a870e86 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,11 @@
+2016-01-21  Paul Eggert  <eggert@cs.ucla.edu>
+
+	regex: treat [x] as x if x is a unibyte encoding error
+	Problem reported by Aharon Robbins in:
+	http://lists.gnu.org/archive/html/bug-gnulib/2016-01/msg00091.html
+	* lib/regcomp.c (parse_byte) [_LIBC && RE_ENABLE_I18N]: New function.
+	(build_range_exp) [_LIBC && RE_ENABLE_I18N]: Use it.
+
 2016-01-21  Bruno Haible  <bruno@clisp.org>
 
 	hash-pjw-bare: fix comment
diff --git a/lib/regcomp.c b/lib/regcomp.c
index f86ce06..2ba99d7 100644
--- a/lib/regcomp.c
+++ b/lib/regcomp.c
@@ -2696,6 +2696,19 @@ parse_dup_op (bin_tree_t *elem, re_string_t *regexp, re_dfa_t *dfa,
 #define BRACKET_NAME_BUF_SIZE 32
 
 #ifndef _LIBC
+
+# ifdef RE_ENABLE_I18N
+/* Convert the byte B to the corresponding wide character.  In a
+   unibyte locale, treat B as itself if it is an encoding error.
+   In a multibyte locale, return WEOF if B is an encoding error.  */
+static wint_t
+parse_byte (unsigned char b, re_charset_t *mbcset)
+{
+  wint_t wc = __btowc (b);
+  return wc == WEOF && !mbcset ? b : wc;
+}
+#endif
+
   /* Local function for parse_bracket_exp only used in case of NOT _LIBC.
      Build the range expression which starts from START_ELEM, and ends
      at END_ELEM.  The result are written to MBCSET and SBCSET.
@@ -2747,9 +2760,9 @@ build_range_exp (const reg_syntax_t syntax,
 	      : ((end_elem->type == COLL_SYM) ? end_elem->opr.name[0]
 		 : 0));
     start_wc = ((start_elem->type == SB_CHAR || start_elem->type == COLL_SYM)
-		? __btowc (start_ch) : start_elem->opr.wch);
+		? parse_byte (start_ch, mbcset) : start_elem->opr.wch);
     end_wc = ((end_elem->type == SB_CHAR || end_elem->type == COLL_SYM)
-	      ? __btowc (end_ch) : end_elem->opr.wch);
+	      ? parse_byte (end_ch, mbcset) : end_elem->opr.wch);
     if (start_wc == WEOF || end_wc == WEOF)
       return REG_ECOLLATE;
     else if (BE ((syntax & RE_NO_EMPTY_RANGES) && start_wc > end_wc, 0))
-- 
2.5.0

Re: gawk regex stuff you may want

Reply via email to